In OCANNL, easy to manually partition a model into components and
explicitly distribute over devices and backends.
More natural for pipeline parallelism
Awkward/hard for model parallelism, especially for parameter
sharding
We need good design for initialization
Tensor parallelism
In OCANNL, activations are typically a special case of
non-materialized tensors
Rethinking design: non-materialized tensors vs the eviction
mechanism
Parameter Sharding (2024.11.18)
Fully Sharded Data Parallel
Each sub-module / model layer is a FSDP unit.
Each unit’s non-shared parameters are flattened, concatenated and
sharded across backends/devices.
Before the unit’s computation, unshards required parameters.
Afterward, deletes other shards’ parameters.
Communicates parameters & their gradients on demand, for
unsharding & accumulation
Crosses abstraction levels: bad fit for OCANNL design
Great for balancing memory, computation, number of devices
Tensor Parallelism
Communicates activations: computation boundaries
Upside: keeps model design clean
Downside: complicates model design with computational
considerations
Manually: a slice operator indexed by backend/device/stream
Hard work: find good axes to balance memory, computation, number of
devices
Fits with OCANNL design
Automatically find good axes — tricky, worse fit for OCANNL
design
Ideas for Upcoming Mind Maps (2024.11.19, updated 2024.11.27)
Ideas for Upcoming Mind Maps (2024.11.19, updated 2024.11.27)
Convolutions and padding — 1/3 done: tinygrad
Design risk: interacts with the whole OCANNL, esp. shape
inference
Even LLMs: need padding
Visually attractive examples
Pipeline parallelism — mostly done
Continues recent theme
Program search in tinygrad — done
Super important for design
Pallas: extension of JAX for writing custom kernels — done
Mutable abstraction layer for JAX
What are blocks? (added 11.27)
More deep dives
Karpathy’s: llm.c, llama2.c, LLM101n, nanoGPT
jackpeck’s llama2.ml
Keller Jordan’s modded-nanogpt
keras → keras.core
Pipeline Parallelism and Optimizer/Gradient Sharding
(2024.11.20)
Pipeline Parallelism and Optimizer/Gradient Sharding
(2024.11.20)
Papers to read: ZeRO arXiv:1910.02054, Zero Bubble arXiv:2401.10241,
NanoFlow arXiv:2408.12757
Pipeline parallelism
Pipeline with nano-batches at the granularity of operations
PP most helpful for cross-server connections
Asynchronous PP breaks optimizer semantics
Zero Bubble
Optimizes microbatch scheduling of Forward, Backward, and
Weight-gradient
Layer-wise parallelizes the optimizer step by accumulating
progressively
Propagates global optimizer state of previous iteration while the
next iteration is computing the initial forward steps
Redoes the optimizer step if a global check fails for any layer
(found INF, NaN, or gradient clip needed)
Optimizer/gradient sharding
ZeRO: focus on minimizing per-GPU memory
ZeRO-DP is similar to FSDP, but lossy and much simpler
Partitions optimizer states, discards gradient parts for other
partitions
More notes coming
Tinygrad Deep Dive (2024.11.21 – 2024.11.27)
Program Search in Tinygrad (2024.11.21)
Kernel transforming operations — operations usually have axis and
amt (amount / part of):
LOCAL — converts part of a global axis to local
PADTO — adds padding to an axis
NOLOCALS — prohibits use of local axes
GROUPTOP
UPCASTMID
GROUP — related: fusing reduce ops??
TC (TensorCore)
UPCAST — converts part of an axis into an innermost
upcasted axis
UNROLL — converts part of or whole axis into an
innermost upcasted axis
SWAP — swaps two axes in a tensor
Upcasted axis is computed or reduced at higher precision??
In tinygrad, memory mode and projection semantics are assigned
per-axis
Axis color coding: blue = global dims, cyan = local dims, green =
reduce-local, white = reduce-late upcasted, red = reduce loops, purple =
reduce upcasted, yellow = normal upcasted
TensorCore’s reduce axis is selected starting from
first_reduce, the 2 inputs axes ending at
first_reduce
Applying TC applies PADTO, UNROLL for reduce axis, UPCAST, LOCAL for
TensorCore’s threads’ axes
Then optional “hand-coded” UPCAST and LOCAL
All of this is in codegen/kernel.py, beam_search itself
is in engine/search.py
Scheduler in Tinygrad (2024.11.22)
Why is all this in here? Where is the scheduling? —
schedule.py: things to schedule, not scheduling
to_uop for non-const buffers is the ShapeTracker’s view
of the buffer
An UOp is_scheduled when the op is
Ops.VIEW
Selects groups to fuse, vs. what to materialize — via graph
rewriting
Prepares indexing (aka. movement ops), integrates indexing with
computation
What is sizzle??? → see later
ScheduleItem with disjoint input and output buffers
Similar to OCANNL’s inputs and outputs fields in
routine
Multiple outputs possible via Ops.SINK AST node,
otherwise single output
Upcoming design (PR 7065)
Gets rid of LazyBuffer (replaced with UOp)
and of engine/lazy.py
Gets rid of indexing processing in schedule.py, instead
exposes ShapeTracker methods in ops.py
Tracks materialized buffers (i.e. realized in tinygrad)
in ops.py
Graph rewriting to push views below computations, collect buffers
and kernels — in schedule.py
Pallas: a JAX Kernel Language (2024.11.23)
On TPU: args are already SRAM (copied before running a kernel), then
copied explicitly into registers before computation
On GPU: args are HBM/DRAM mem (global), explicitly copied into SRAM
mem (local) before computation
Imperative: explicit assignments
Refs were introduced to make JAX stateful even before
Pallas, reused
Kernels parallelize over grids
vmap of pallas_call: adds extra grid
axis
A BlockSpec projects an input or output to a
block-slice view and threads/streams, for blockwise parallelism over
grids
Manually specified for each input & output
Generalization of tiling
Dynamic slicing and masking
Also needs explicit out_shape
Not implemented yet: alternatives to BlockSpec
e.g. overlapping windows for convolutions
Doesn’t support: conv_general etc. —
usually not on hardware; gather/scatter —
backends without noncontiguous memory reads/writes
GPU and TPU are not entirely interchangeable?
More control over memory access (on GPU)
Triton (2024.11.24)
Based on pointers
Pallas exposes them but on refs
load, store with dynamic slicing and
mask
Nice idea: associate axes with strides to decouple tensor semantics
from memory layout
Sharing code across backends (CUDA, CDNA) but dedicated sections to
compute e.g. num of threads per multi-processor
Execution over a grid, as in CUDA and Pallas
Both manual and auto-tune splitting into warps
Autotune picks a config out of manually speced candidates
Each block must have 2^n elements, needs padding via mask
Manual decoding of indices into per-block vectors of offsets for
load/store, but otherwise functions over ndim arrays