Tinygrad decomposes them so there’s only “reciprocal”, but it’s a
slight loss of expressivity (and convenience!) wrt. “to-power-of”
In OCANNL it would be slightly harder to do complex gradients like
“to-power-of” and “division”
Reduces the scope of the tensor class a bit but tensors still have
grad field — no need to centralize for that
Inspiring: instead of tracking backprop at the assignments level,
track it at the tensor expression level like in tinygrad
We get rid of the diff / non-diff distinction, replace
diff : diff option by mutable
diff : diff option, and
type diff = { grad : t; zero_grads : asgns }
It shouldn’t reduce flexibility since the new
grad.forward would function like the old
diff.bprop
We will still use the global session state: to update the diff field
i.e. there will still be an implicit set of differentiated roots;
later, we could refactor that too
Polyhedral Optimization (2025.02.02 – 2025.02.06)
Quick Glance at Polyhedral Optimization (2025.02.02)
Sources: PLISS 2019 Polyhedral Compilation lecture slides (includes
a mini primer on MLIR); PolyTOPS: Reconfigurable and Flexible Polyhedral
Scheduler
Optimizing nested loops by affine transformations of loop
variables
Polyhedra are a technical detail
Tiling, Loop Fusion and Fission
Here dataflow perspective: T: topology with nodes-cols and
buffers-rows; v: node firings; Tv = 0: no deadlock or
buffer overflow; vc > 0 ∀c:
no starvation
ISL: integer set library
Schedule: map from timestamp to iteration domain (using
lexicographic order)
Construct a producer/consumer dataflow diagram, reorder for
locality, and compactify buffers by reusing already-read cells
Optimal solving is too hard, and efficient code generation from a
solution is too hard
Simpler — synchronous dataflow: + finds efficient, deadlock free
schedules; − limited solutions: Sp(x) = qp ⋅ x + dp;
− same amount of data at each node. q: rate, d: delay, p: producer, c:
consumer
For example, minimize: qp, qc, dp, dc
subject to: qc ⋅ x + dc ≥ qp ⋅ x/2 + dp
Polyhedral Optimization Framework (2025.02.03)
Iteration domain: space of all executions of statements
i.e. single-cell assignments
Schedule = maps iteration domain to execution times preserving
lexical order
Actually lines are also lexicographic: output dims, input dims
Loops introduce one dimension to the iteration domain per nesting
depth
Statement sequences introduce one dimension per sequence at nesting
depth (“unrolled loop”)
Called β-vectors if every other dimension is for sequencing, and
uniquely identify share merits
Reuse — temporal vs spatial locality, cache lines
Valid schedule: dependencies have lexico-positive distance, schedule
is invertible — anyway causality preserved
Parallelism: benefit distance-0 dependences
Code generation: invert the schedule; affine eqs. and ineqs.
schedules invertible (Gauss Elimination, Fourier-Motzkin)
Optimizing schedule: ILP using heuristic objectives (trying to
approx. performance on idealized hardware); or discrete search with
benchmarking JITted solutions
When ILP, ensure invertibility e.g. iteratively optimize over rows
adding constraints to keep full row rank
Reuse: benefit accesses not using iterators from the last schedule
dimension
Fusion (when the sequence dim difference is zero): penalty for
fissioned statements sharing a data cell, OR penalty for fused
statements not sharing any data cells
Vectorization: penalty for iterator in the fastest-varying array
subscript and no others
“Opening Polyhedral Compiler’s Black Box” from Inria
(2025.02.04)
A statement can have multiple relations if disjunctive conditions in
control flow
Change in a scheduling relation = program transformation
Requirements — invertibility not necessary: all relations have the
same iteration domain; union of relations of a statement covers the
iteration domain and is 1-to-1; all dates are (vectors of) integers —
implies linear independence of dimensions
Decodes optimized scheduling relations into a sequence of syntactic
loop transformations
Encodes loop transformations as updates of scheduling relations
Syntactic transformations:
Reorder — changes sequencing
FuseNext — fuses a loop with its successor
Distribute — splits a loop in two
Interchange — swaps loops at given 2 depths in the given
nesting
Shift — moves given statements by given amount in the given
loop
Skew — scales the given output loop’s iterator by
given factor
Reshape — scales the given input loop’s iterator by
given factor
Reverse — reverses the iteration order for the given loop(s)
IndexSetSplit — partitions a scheduling relation into two disjoint
ones, e.g. duplicates a loop with one up to midpoint, the next from
midpoint
Collapse — reverse of IndexSetSplit
Grain / Densify — change how long it takes between two executions of
the given statements along the given loop
StripMine — decomposes the given loop into 2 nested loops, with one
exec of the inner loop iterating for at most the given steps. +
Interchange = Tiling
Linearize — reverse of StripMine
Unroll/Parallelize — not a real transformation, marks the given
output dimension for unrolling or vectorizing
Transforms not needed for decoding: Interchange, Collapse,
Linearize, Parallelize
Grain + Reshape = Rotation: aligns a dependence along a loop
“High performance compilers for parallel computing” M. Wolfe
1996
Another Glance at MLIR: Multi-Level Intermediate Representation
(2025.02.05)
Typed Static Single Assignment, functional: instead of Phi
functions, blocks take arguments
Many improvements over LLVM
Declarative general programming language with namespaces, pattern
matching, rewrite rules
Combined with C++ code (tricky rewrite rules, passes, trait
implementations etc.)
E.g. can define syntax of operations from within op. def.
Dialects are centrally or externally maintained extensions
— 48 dialects in the repository
Lowering sequence (all within MLIR): TensorFlow repr. → XLA HLO →
Polyhedral/Affine → LLVM IR
nvgpu dialect — PTX specific operations
amdgpu dialect
llvm dialect
Affine dialect — “Polyhedral” / for-loop lang with first-class loop
ranges
Linalg dialect — high-level hierarchical optimization for linalg
ops, especially convolutions; Structured Abstraction
4: sync and comms, within and between architectures
Handles sparsity
Integrated with ISL to for example express ASTs for CUDA; Halide IR
is used for CPU ASTs
Handles recurrent NNs and cyclic dataflows — via skewing
Optimizes distributed code
“A Deep Learning Based Cost Model for Automatic Code Optimization” —
cost model architecture:
Computation Embedding Layer: feedforward NN on computation vectors
(per statement A, B, C, D, E)
Recursive Loop Embedding Layer: LSTM over child loop embeddings and
child computation embeddings → new loop embedding
Regression Layer: feedforward NN → predicted speedup
Computation vectors composed of loop nestings including loop
transformations, and memory accesses, but padded to 1235 dims for
feedforward NN embedding
Autotuning / Program Search (2025.02.12 – 2025.02.16)
Halide Program Search — Learning to Optimize Halide with Tree Search
and Random Programs (2025.02.12–14)
Autotuning / program search: here Halide (2019), next time Tiramisu
(2021)
Iterates through network layers from output
Compute and storage granularity: how many loops to fuse, memory
reuse: fusing all loops = inlining
Add extra tiling to previous layer if it enables more choices
Steps: (1) add inner tilings (opt. vectorized / SIMD), (2) outer
tilings (opt. parallel threads); total 2n steps for n
layers
Choosing each tile size
Beam search: top-k for each step
Cost model trained on full pipelines; implicitly assume unscheduled
layers are inputs; when interdependent, assume max locality and dense
vector loads (optimistic)
Pruning: limit search space, e.g. loops compute→storage should be
single-dimensional (best results for line-buffer)
Coarse-to-fine: diversify the beam by prioritizing one
representative of a loop nest class; next pass: only optimize within
winning loop nest classes, with finer subclasses
Heuristic instead of search: loop unrolling, no nest reordering
except vector dims pushed inside, memory space selection
Cost model and runtime to be continued
See figure from yesterday (Tiramisu cost model)
Algo and schedule features, computed statically:
Histograms of operations to compute a single point
Count of evaluations of a region, of allocations of storage for
their result, of accesses of this storage: both in bytes and in
contiguous segments
Number of parallel tasks launched
Number of whole SIMD vectors computed, and of the corresponding
scalar values
Number of dense vectors and scalars loaded per vector computed
(amortized over sharing by unrolled loops)
Region shapes via symbolic interval arithmetic
A small cost network: features → coefficients for
simple manual heuristics, also built out of the features
Beam search with “dropout”: expand k-th best w/prob p^k
Pretrained on random algos / networks
Autotune: benchmark end-of-beam candidates; finetune: update model
while beam-searching
Tiramisu Program Search (2025.02.16)
Searches for a sequence of transformations: fusion, tiling,
interchanges, unrolling
Heuristic / not optimized: parallelization, vectorization
Recursive NN over a loop embedding: combines outputs of LSTMs
Loop children: LSTM over loop transformations
Statement children: LSTM over computation vectors
Use of two LSTMs loses ordering info between nested loops and
statements
Computation vectors composed of loop nestings including loop
transformations, and memory accesses, but padded to 1235 dims for
feedforward NN embedding
Matrix Multiplication (2025.02.16 – 2025.02.21)
Matrix Multiplication on Multicore CPU (2025.02.16)
Source: siboehm.com
SGEM = single-precision general matrix multiply: a*A*B + b*C
The machine FMA is accumulating! c := a*b + c; FMA = fused multiply
add = MAC = multiply accumulate
Uses vectorized FMA instructions
Intel’s MKL has dedicated SGEMM functions / kernels for each
microarchitecture
CPUs are compute bound even for small matrices
-march=native specializes C compiler’s output to
microarchitecture it compiled on
Big gain from loop reordering: non-contiguous iterator should never
be innermost — much better than assigning to a register in innermost
loop
Tiling on multiple dimensions
Tiling the reduce loop (middle) with outermost replication
Idea: tile width so that a single tile of the reduce loop fits in
cache; in practice the best size is bigger
It’s two elementary loop transforms: strip-mine and reorder
Multithreading strategy: tile rows of A and columns of B so that
there’s no need to communicate except to gather the results
Matrix Multiplication in CUDA without Tensor Cores
(2025.02.17–18)
Source: siboehm.com
We implement FMA / MAC: C := alpha A @ B + beta C
cuBLAS at FP32 doesn’t use tensor cores, only at TF32 and BF16
Should be compute bound if well designed
Kernel 1: each cell of C a separate thread doing reduction, thread
ids end up increasing along columns of C; 1.3% eff.
No change in assembly, access coalescing at runtime, requires
aligned access; compiler does partial inner loop unrolling
Kernel 2: ensure thread ids increase along rows of C; 8.5% eff.
Kernel 3: threads first populate shared mem, then sync, then reduce
the shared mem block; 12.8% eff.
It’s a strip-mining of the reduce loop, but tricky as pure loop
transforms — kinda also strip-mining of A and B’s non-shared axis (into
block-parallel and thread-parallel) and promoting tiles to shared
mem
Kernel 4: kernel 3’s thread-parallel A loop strip-mined to new
innermost non-parallel loop, smaller tiling of reduce loop than kernel 3
to lower smem; 36.5% eff.
__syncthreads separates two implicit thread-parallel
loops, but logically the thread-parallel iterators are different (the
first loop’s iter for A becomes the innermost non-parallel loop for
A)
Kernel 5: also strip-mine kernel 4’s thread-parallel B loop to new
innermost non-parallel loop; now too few threads wrt. tile size: smem
copy loops; 68.7% eff.
Combats shared mem access bottleneck; why does the strip-mining
halve threads per threadblock?
Kernel 6: transpose a tile of A while copying to shared mem;
vectorize copying via reinterpret_cast<float4 *>;
78.4% eff.
Kernels 7 and 8: shared-mem bank conflicts
Kernel 9: autotuning: how much data cached — from global to shared
mem, from shared mem to register file i.e. local mem
Each level of tiling is: strip-mining, reordering, preparatory loops
for mem level transfer (sometimes with transpose)
Kernels 11 and 12: double buffering
Smem bank conflicts and double buffering continued after Sep. V
notes on AMD GPUs
Levels of parallelism: blocktiling (different Streaming
Microprocessors, syncing and shared mem within tile), warptiling
(different warp schedulers, mem bank benefits, warp-level ops
e.g. tensor cores), threadtiling (Instruction Level Parallelism)
CUDA compiler can do the ILP vectorization
Matrix Multiplication on AMD RDNA3 GPUs (2025.02.19)
Source: seb-v.github.io
Nomenclature: Compute Unit = Streaming Multiprocessor; Stream
Processor = CUDA Core; wavefront (32 or 64) = warp (32 threads);
workgroup = block; Local Data Storage = Shared MEMory
The programming model is almost the same, uses the same keywords
e.g. __shared__
Like before, we do C := a*A*B + b*C in FP32 prec.; 32*4 (b/c FP32) =
128 bytes mem transaction
Kernel 1 is identical to CUDA; 3.3% eff.
Kernel 2: LDS tiling; square As, Bs 32x32 of A / B, strip-mines
reduce dim and reorders, loads the tiles As, Bs w/ transpose into LDS
along rows of A, B; 13.1% eff.
Kernel 3: register tiling adds 2 (3) more tiling levels:
1-per-thread increases arithmetic complexity and 2x2-per-wave enables
vectorized mem transfers; loads a col,row of LDS tile to registers to
reduce, strip-mines both tile dims
This is kinda swizzled layout
The wave tile is composed of 2x2 subtiles sized 8x4 that each fit in
a wavefront (8*4=32) (why?) doesn’t prevent mem bank conflicts b/c
accessed contiguously?
Also unlocks v_dual_fmac_f32: parallel vector
multiply-accumulate operations
Kernel 7 from siboehm.com prevents bank conflicts by dispersing
consecutive Bs, so on reading to regs they’re from the same bank
I don’t see how kernel 8 from siboehm.com prevents bank
conflicts?
Matrix Multiplication on AMD GPUs: Double Buffering and Bank
Conflicts (2025.02.20)
Source: seb-v.github.io
Needs __launch_bounds__, otherwise compiler uses
scratch mem instead of registers
Kernel 4: double buffering; load the next block’s GMEM data into
registers at the beginning of a step; sync at end of step and write to
SMEM; also adds unroll pragmas; 83.7% eff.
While some threads queue for loading from GMEM, others continue
computing
Kernel 5: bank conflicts → adds padding to As rows (smem tile of A)
so that accesses are staggered: original BM%32=0, new (BM+4)%32=4 no
same-thread same-bank conflicts; arithmetic intensity → 2x wider wave
tile; adds -cumode (each SIMD32 sticks to its own LDS);
109.8% eff.
Theoretically close to 200% eff. might be possible because for
unknown reason rocBLAS is not using dual_fmac: 2x multiply-accumulate
ops.
Matrix Multiplication on AMD GPUs: from HIP to ISA (2025.02.21)
Source: seb-v.github.io
Nomenclature: HIP = CUDA; ISA = SASS (assembly); AMDIL / ROCm llvm
IR = PTX
Kernel 6: optimize vector arithmetics using
v_dual_fmac_f32: maximize number of vec registers read per
instruction; maximize use of cache; maintain a consistent symmetric
access pattern; don’t worry about contiguous use of registers: reorder
before writing to LDS
Kernel 6 details: ensure C_regs contiguous, assign A_col and B_row
to non-overlapping banks (0-1 and 2-3), update LDS loads and inner loop
v_dual_fmac accordingly, permute registers as needed by code writing to
global memory; 123.7% eff.
Kernel 7: loop unrolling; didn’t work with the HIP implementation
b/c the compiler prefetched more values from the LDS; now just duplicate
ISA loop body with address increments, remove branch instructions;
135.1% eff.
Kernel 8: batched GMEM loads; precompute global load offsets into
scalar registers (saving on vec registers); in the loop, update A,B
index vec registers, global_load_b32 output-reg, index-reg,
offset-reg format; spread the loads across the unrolled inner loop;
160.6% eff.
Remaining possible optimization: double buffering for LDS
This approach is not scalable but gives insight into optimal
execution on RDNA3
The FineWeb Recipe (2025.02.22)
OpenAI, Anthropic etc. use their own web crawlers and data
Common Crawl is a non-profit org extracting data for researchers;
currently over 250 billion pages; new crawls w/ 200-400TiB text every
1-2 months
CommonCrawl text extraction (WET) is poor quality; but text
extraction is costly → redone
FineWeb is derived from 96 CommonCrawl snapshots; has 15 trillion
tokens; 44TB
FineWeb-Edu: 1.3 trillion and 5.4 trillion tokens (very) high
quality
Filtered by “educational content” classifier distilled from
Llama-3-70B-instruct prompted to focus on grade/middle-school
Data quality: train small models and evaluate on early-signal
benchmarks
Base filtering: URL blocklists; English only (fastText classifier);
repetition filter (many repeated lines); filters from MassiveText (used
to train Gopher); 36T tokens
MassiveText filters: docs between 50 and 100K words, mean word
length from 3 to 10 chars, hash|ellipsis-to-word ratio < 0.1; less
than 90% lines bullet points; less than 30% lines end with ellipsis; 80%
words contain alphabetic char; contains at least 2 of: the, be, to,
of, and, that, have, with
Filters from C4 dataset: no “lorem ipsum” and no curly brackets
C4 dataset inspired filters: docs with fraction of lines ending with
punctuation > 0.12; docs with fraction of chars in duplicated lines
> 0.1; docs with fraction of lines shorter than 30 chars <
0.67