Mind Maps — March 2025

CUTLASS and CuTe (2025.03.04 – 2025.03.08)

CUTLASS: C++ CUDA Templates for Linear Algebra Subroutines (2025.03.04)

  • Python interface
    • Brief code thanks to good defaults
    • Brief but very informative error messages compared to C++ template compile errors
  • Tiling as dividing shape by shape, or layout by shape
    • Facilitate working with tensor cores
  • Tiling compute resources
  • Layout algebra: composing layouts (as projections?)
    • Nested layouts aka modes: subset of axes, have specific semantics e.g. for threads, tensor cores
    • Also with swizzle functors
    • Compile-time checks for static parts of layouts
  • layout = (shape) : (stride), concatenable (“multiply”)
    • Tile: shape changes, but not stride
  • CuTe: C++ CUDA Template Library for Tensors
    • Front-end to CUTLASS 3.0
  • Conceptual GEMM Hierarchy
    • Atom layer — centered around hardware-accelerated ops e.g. tensor core, vector op
    • Tiled MMA/Copy — centered around optimal use of a single GPU
    • Collective layer — centered around synchronization
    • Kernel layer — grid planning, load balancing, thread marshalling
      • Device layer
        • Host-side

CUTLASS / CuTe Layout Algebra (2025.03.05)

  • Mode of layout = axis = length-1 layout
  • Coordinate = logical position in shape: natural (aka h-D), or colexicogr. ordinal (aka 1-D), or mixed (sub-shapes → ord.) (aka R-D, R=rank); natural coordinate = same (nested) tuple as shape; index = offset in layout (inner product of natural coordinate and stride)
  • Confusing: formalism and default stride is column-major (colexicographical)
  • Tuple-like and NumPy-like operations: sublayouts (fully expressive), concatenation, grouping, flattening, slicing
  • Complement
    • Left/right inverse
  • Composition
    • Logical product
      • Logical divide
      • Zipped, tiled

CuTe Layout Algebra — Continued (2025.03.06)

  • Coalesced layout: remove modes (axes) of size (dim) 1, remove modes that traverse contiguously with the preceding mode: (s0:d0, s1:s0*d0) --> (s0*s1, d0)
    • Sorted layout: strides are non-decreasing
  • Complement of layout A wrt. size M: layout B such that (A, B) (concatenated) is contiguous (dense) of size M; i.e. B fills the gaps of A due to strides
    • B by itself has its image disjoint from A except for position 0 — because of disjoint coordinates when embedded in C=(A,B)
  • (Functional) composition of layouts: R(c) := (A ∘ B)(c) := A(B(c))
    • R has coordinates compatible with B
    • (s0:d0) ∘ (s1:d1) = s1:(d0*d1) — every d0th element, s1 total
    • Composition is left-distributive with concatenation, so focus on B with 1 mode
    • Coalesce A and let B=N:r, then A ∘ B has the modes (dimensions and strides) from the middle of A with first and last mode adjusted, such that size(A ∘ B)=N, and the first stride = the corresponding stride of A * (r / the cumulative size of modes to the left)

CuTe Layout Algebra — Continued (2025.03.07)

  • complement(A,M) has one more mode than A: leading mode (d0:1) where d0 is leading stride of A; except d0=1 → same num of modes
  • A ∘ B selects parts of A and reshapes them
  • Tilers add expressivity via hierarchical layouts:
    • Layouts: A ∘ (B1, B2) = (A ∘ B1, A ∘ B2)
    • Tilers: (A1, A2) ∘ <B1, B2> = (A1 ∘ B1, A2 ∘ B2)
  • Logical divide: A / B := A ∘ (B, complement(B, size(A))) — split A into elems pointed to by B, and the rest
    • In the hierarchical layout of A/B, the first toplevel mode is the tile, and the second is the iterator over tiles
    • Logical dividing by a tiler keeps the hierarchical structure: (A1, A2) / <B1, B2> = (A1 / B1, A2 / B2)
  • Zipped divide: puts the iterator modes coming from different pieces of the tiler into one toplevel sublayout
  • Tiled divide: puts the intra-tile modes of the tiler pieces into one toplevel sublayout
    • layout<0>(zipped_divide(A, B)) = layout<0>(tiled_divide(A, B)) = A ∘ B
  • Dividing by a multimode tiler: multidim tiles that look differently for different modes (axes) of A
  • Logical product: A × B := (A, complement(A, size(A)*cosize(B))) ∘ B
    • A is the tile, B is the number and order of repetitions, A* is the available repetition layout
    • logical_product is tricky because B needs to be designed for A; blocked_product simplifies this
    • raked_product disperses the tiles

CuTe: Tensors and Algorithms (2025.03.08)

  • tensor = layout + engine + [non]owning; engine = RAM iterator + mem tag; mem tag = global | shared | registers
  • Thread-Value partitioning: a layout mapping threads + per-thread values to coordinates
  • Tensor slice: new non-owning tensor with offset iterator, and new (restricted) layout
  • Partitioning: zipped_divide then slice; inner_partition, local_tile: give each e.g. threadgroup a tile; outer_partition, local_partition: iterate at fixed position / slice of the tiles
  • Since CuTe can represent datatype, mem tag, shape and stride at compile time, algorithms can specialize to hardware-specific instructions
    • The type-inferred default impl can be overridden
  • copy: the compiler-selected impl. may need a specific explicit synchronization
    • copy_if: copy with a mask tensor
  • axpby: y := a*x+b*y (generalizes FMA)
  • gemm: accumulating, one of: element-wise product, outer product, batched outer product, matrix product; selected by a type param
  • fill, clear memory
  • Generics: transform (map in-place), fold an operation (C++17 style)

CuTe Matrix Multiply (2025.03.10)

  • Mnemonic: D := A*B + C
  • Hardware instruction levels:
    • Single thread (e.g. FMA)
    • Quadpair (Volta: V100) — kinda 8 thread tensor core, i.e. 4 QPs per warp; QP 0 is threads 0-3 & 16-19 etc.
    • Single warp (Ampere: A100, RTX 30x0)
    • Warpgroup (Hopper: H100) — not introduced, but much optimized
    • 128 1D threads
  • MMA and Copy atoms: expose specific PTX instructions with a unified, templated interface
    • For a specific PTX operation, provide A,B,C,D types and layouts
  • M/N/K-major: has stride 1 in the M/N/K mode
  • Often represents inputs as (M,K),(N,K) rather than (M,K),(K,N)
  • Simplest example: tiles over threadblocks aka CTAs
auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);
Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1,X,_1>{});
  • Static templated args e.g. bM=128 × bN=128 × bK=8
  • Selects sublayouts except for pos. X for both cta_tiler and cta_coord
  • Example: independently specifying shared mem layouts, partitioning patterns, PTX instruction via TiledCopy and TiledMMA
auto tA = make_layout(make_shape(Int<32>{},Int<8>{}));
Tensor tAgA/tAsA = local_partition(gA/sA, tA, threadIdx.x);
...  copy(tAgA(_,_,k_tile), tAsA); cp_async_fence();
     cp_async_wait<0>(); gemm(tCsA, tCsB, tCrC); __syncthreads();
  • Builders take Copy_Atom<...> / UniversalFMA<TC,TA,TB> and threads (and values for TiledCopy) layout(s), then TiledX.get_slice(threadIdx.x) to extract tAgA, tAsA, tCsA

The Ultra-Scale Playbook (2025.03.24 – 2025.03.26)

Table of Contents (2025.03.24)

  • One GPU training
    • Profiling compute and communication
    • Recomputing activations
  • Data parallelism
    • Overlap grad reduction with backward pass
    • Bucket grads into concat tensors for reduction
    • ZeRO: 1 (optim. states), 2 (+ grads), 3 (+ params)
  • Tensor parallelism — along hidden dim
  • Sequence parallelism — along input dim
  • Context parallelism
    • Ring attention
    • Zig-zag ring attention
  • Expert parallelism
  • Pipeline parallelism
    • Zero Bubble and DualPipe
  • Fusing, threading, mixed precision
    • Flash Attention 1-3
  • Finding best training config
    • Step 1: fit step in mem
    • Step 2: target batch size
    • Step 3: throughput

Picotron (2025.03.26)

  • Vocab embedding
    • Mask input on GPU to only compute what’s in the GPU’s dictionary
    • And shift
  • Tensor parallel
    • Split is best: first by columns, then by rows (matmul → nonlin → matmul; output proj.)
  • Parallelisms (example: 8× parallel total): data → batch dim (DP=2), tensor → hidden dim (TP=2), pipeline → model layer dim (PP=2)
  • Process group manager
    • PP*TP*DP grid of GPUs
    • Selects subsets of GPUs for a given parallel schema
    • Sets up one process per GPU
  • Dataloader
    • tokenize_dataset
  • Data parallel
    • Distributed sampler
    • Dataparallel buckets
    • all_reduce to average gradients
  • Grad accumulation: reduce mem by not parallelizing over the whole batch
    • Manual grad_acc in a temp var to reduce at higher prec FP32