Mind Maps — November 2024

OCANNL Design (2024.11.07 – 2024.11.12)

Factor Out of Backends (2024.11.07)

  • Memory modes
  • Is-in-context
  • Updating context arrays
    • Based on traced_arrays
    • Before linking in backend

Release 0.5 (2024.11.08)

  • Factor out mem context ✓
    • Synchronize running routines
  • Factor out copying prep ✓
    • Synchronize copying
  • Remove syncing from the data-parallel algo
    • Debug data-parallel perf

Program Search (2024.11.09)

  • Symbolic optimization
    • Generic across backends
  • Currently deterministic rewrite system
    • Nondeterministic rewrite system
      • Beam search and MCTS
  • Needs backend-specific fitness

Training Rig (2024.11.10)

  • How best to connect the comps I have?
    • Thunderbolt 5 vs Ethernet
  • iMac Mini
    • Cheap but less memory
  • Razer Blade laptop?

Managing Scarce Memory (2024.11.11)

  • evacuated field
    • Actually, evacuated_to : [Host | Stream Tn.t] option
    • If node is evacuated, automatically schedule bringing it back when needed
    • Can evacuation be automated?
  • to_host ~evacuate:true
  • device_to_device ~evacuate:true?
    • Bringing back across multiple hops

Data Movement Automation (2024.11.12)

  • Default from_host to initialize hosted nodes on device
    • But the reverse for computational initialization e.g. random distr
  • Similar to evicted_to
  • Does not need an extra layer!
    • Explicit calls to backend work as currently
    • Automates backend calls
      • Available functions and/or exposed fields to disable or tune the automation

Low-Precision Numbers and Quantization (2024.11.13)

Low-Precision Numbers and Quantization (2024.11.13)

  • bfloat16
    • Broad hardware support
    • Standardized?
    • Lobby for inclusion in Bigarray?
    • Low value per effort — but good stress test of design
  • fp8
    • High-end hardware
    • Two fp8 types on H100
    • Blockwise quantization on Blackwell hardware
    • Missing compute precs — no design effort
  • TF32: TensorFloat-32
    • Nvidia TensorCore
    • In fact it’s 19 bits
  • Quantized
    • Training vs inference
    • Simple discretization
    • Statistically quantizing trained models
    • Represent as multidim. precision
  • Compute prec separated from mem store prec

Strong Models Are Very Large (2024.11.14 – 2024.11.18)

Strong Models Are Very Large (2024.11.14)

  • Pipeline parallelism
    • Synchronous
    • Asynchronous
      • Multiple copies of grads or weights
      • Enforces: 1 worker = multiple streams
      • Ensure: possible to assign all resources to a stream while others suspended
  • Networked devices
    • Add_network functor for backends
  • Model parallelism (see 11.16)
    • Heterogeneous backends
      • Also: at-home networking Macs and Nvidia RTX
      • Should Add_network devices have an option “no network”? — No. Primary motivation

Usability (2024.11.15)

  • Runs the necessary fwd and bprop code
    • Tracking of what needs recomputing
    • Invokes %op if needed
    • Requiring module Backend in scope
  • Extension layer %run?
    • Not needed for auto memory transfers
  • Documentation
    • Consider .mld and/or mdx
  • Install printers for Tnode.t and Tensor.t
  • Saving/restoring
    • Full model graph saving
      • W/o pickling aka. marshalling
    • Granular snapshotting

Model Parallelism (2024.11.16)

  • Fully Sharded Data Parallel
    • Paper from Meta AI arXiv:2304.11277
    • Parameter prefetching
  • Overlap communication with computation
    • Operation reordering — hard! requires passing activations
  • In OCANNL, easy to manually partition a model into components and explicitly distribute over devices and backends.
    • More natural for pipeline parallelism
    • Awkward/hard for model parallelism, especially for parameter sharding
  • We need good design for initialization
  • Tensor parallelism
    • In OCANNL, activations are typically a special case of non-materialized tensors
    • Rethinking design: non-materialized tensors vs the eviction mechanism

Parameter Sharding (2024.11.18)

  • Fully Sharded Data Parallel
    1. Each sub-module / model layer is a FSDP unit.
    2. Each unit’s non-shared parameters are flattened, concatenated and sharded across backends/devices.
    3. Before the unit’s computation, unshards required parameters. Afterward, deletes other shards’ parameters.
    • Communicates parameters & their gradients on demand, for unsharding & accumulation
    • Crosses abstraction levels: bad fit for OCANNL design
    • Great for balancing memory, computation, number of devices
  • Tensor Parallelism
    • Communicates activations: computation boundaries
    • Upside: keeps model design clean
    • Downside: complicates model design with computational considerations
    • Manually: a slice operator indexed by backend/device/stream
      • Hard work: find good axes to balance memory, computation, number of devices
      • Fits with OCANNL design
    • Automatically find good axes — tricky, worse fit for OCANNL design

Ideas for Upcoming Mind Maps (2024.11.19, updated 2024.11.27)

Ideas for Upcoming Mind Maps (2024.11.19, updated 2024.11.27)

  • Convolutions and padding — 1/3 done: tinygrad
    • Design risk: interacts with the whole OCANNL, esp. shape inference
    • Even LLMs: need padding
    • Visually attractive examples
  • Pipeline parallelism — mostly done
    • Continues recent theme
  • Program search in tinygrad — done
    • Super important for design
  • Pallas: extension of JAX for writing custom kernels — done
    • Mutable abstraction layer for JAX
  • What are blocks? (added 11.27)
  • More deep dives
    • Karpathy’s: llm.c, llama2.c, LLM101n, nanoGPT
    • jackpeck’s llama2.ml
    • Keller Jordan’s modded-nanogpt
    • keras → keras.core

Pipeline Parallelism and Optimizer/Gradient Sharding (2024.11.20)

Pipeline Parallelism and Optimizer/Gradient Sharding (2024.11.20)

  • Papers to read: ZeRO arXiv:1910.02054, Zero Bubble arXiv:2401.10241, NanoFlow arXiv:2408.12757
  • Pipeline parallelism
    • Pipeline with nano-batches at the granularity of operations
    • PP most helpful for cross-server connections
    • Asynchronous PP breaks optimizer semantics
    • Zero Bubble
      • Optimizes microbatch scheduling of Forward, Backward, and Weight-gradient
      • Layer-wise parallelizes the optimizer step by accumulating progressively
      • Propagates global optimizer state of previous iteration while the next iteration is computing the initial forward steps
      • Redoes the optimizer step if a global check fails for any layer (found INF, NaN, or gradient clip needed)
  • Optimizer/gradient sharding
    • ZeRO: focus on minimizing per-GPU memory
    • ZeRO-DP is similar to FSDP, but lossy and much simpler
      • Partitions optimizer states, discards gradient parts for other partitions
  • More notes coming

Tinygrad Deep Dive (2024.11.21 – 2024.11.27)

Program Search in Tinygrad (2024.11.21)

  • Kernel transforming operations — operations usually have axis and amt (amount / part of):
    • LOCAL — converts part of a global axis to local
    • PADTO — adds padding to an axis
    • NOLOCALS — prohibits use of local axes
    • GROUPTOP
    • UPCASTMID
    • GROUP — related: fusing reduce ops??
    • TC (TensorCore)
    • UPCAST — converts part of an axis into an innermost upcasted axis
    • UNROLL — converts part of or whole axis into an innermost upcasted axis
    • SWAP — swaps two axes in a tensor
  • Upcasted axis is computed or reduced at higher precision??
  • In tinygrad, memory mode and projection semantics are assigned per-axis
  • Axis color coding: blue = global dims, cyan = local dims, green = reduce-local, white = reduce-late upcasted, red = reduce loops, purple = reduce upcasted, yellow = normal upcasted
  • TensorCore’s reduce axis is selected starting from first_reduce, the 2 inputs axes ending at first_reduce
    • Applying TC applies PADTO, UNROLL for reduce axis, UPCAST, LOCAL for TensorCore’s threads’ axes
    • Then optional “hand-coded” UPCAST and LOCAL
  • All of this is in codegen/kernel.py, beam_search itself is in engine/search.py

Scheduler in Tinygrad (2024.11.22)

  • Why is all this in here? Where is the scheduling? — schedule.py: things to schedule, not scheduling
  • to_uop for non-const buffers is the ShapeTracker’s view of the buffer
    • An UOp is_scheduled when the op is Ops.VIEW
  • Selects groups to fuse, vs. what to materialize — via graph rewriting
  • Prepares indexing (aka. movement ops), integrates indexing with computation
    • What is sizzle??? → see later
  • ScheduleItem with disjoint input and output buffers
    • Similar to OCANNL’s inputs and outputs fields in routine
    • Multiple outputs possible via Ops.SINK AST node, otherwise single output
  • Upcoming design (PR 7065)
    • Gets rid of LazyBuffer (replaced with UOp) and of engine/lazy.py
    • Gets rid of indexing processing in schedule.py, instead exposes ShapeTracker methods in ops.py
    • Tracks materialized buffers (i.e. realized in tinygrad) in ops.py
    • Graph rewriting to push views below computations, collect buffers and kernels — in schedule.py

Pallas: a JAX Kernel Language (2024.11.23)

  • On TPU: args are already SRAM (copied before running a kernel), then copied explicitly into registers before computation
  • On GPU: args are HBM/DRAM mem (global), explicitly copied into SRAM mem (local) before computation
  • Imperative: explicit assignments
    • Refs were introduced to make JAX stateful even before Pallas, reused
  • Kernels parallelize over grids
    • vmap of pallas_call: adds extra grid axis
  • A BlockSpec projects an input or output to a block-slice view and threads/streams, for blockwise parallelism over grids
    • Manually specified for each input & output
    • Generalization of tiling
    • Dynamic slicing and masking
    • Also needs explicit out_shape
    • Not implemented yet: alternatives to BlockSpec e.g. overlapping windows for convolutions
  • Doesn’t support: conv_general etc. — usually not on hardware; gather/scatter — backends without noncontiguous memory reads/writes
  • GPU and TPU are not entirely interchangeable?
  • More control over memory access (on GPU)

Triton (2024.11.24)

  • Based on pointers
    • Pallas exposes them but on refs
  • load, store with dynamic slicing and mask
  • Nice idea: associate axes with strides to decouple tensor semantics from memory layout
  • Sharing code across backends (CUDA, CDNA) but dedicated sections to compute e.g. num of threads per multi-processor
  • Execution over a grid, as in CUDA and Pallas
  • Both manual and auto-tune splitting into warps
    • Autotune picks a config out of manually speced candidates
  • Each block must have 2^n elements, needs padding via mask
  • Manual decoding of indices into per-block vectors of offsets for load/store, but otherwise functions over ndim arrays
  • Nice tutorials
    • Maybe worth replicating triton.testing (perf_report|Benchmark)
  • Groups of blocks reduce memory transfers per same num of output blocks
    • Not built-in
  • tl.exp is __expf from CUDA (approximate)

Triton Part 2 (2024.11.25)

  • Counter-based Parallel Random Numbers
    • Computes pseudo-random nums on the device with seed int32
  • Reducing across blocks in a group — manual, by locking at the end of a kernel
  • Vs scheduling DSLs
    • Separate algo and schedule: tile splits, loop reordering and unrolling, parallel axes
    • TVM has built-in automatic scheduling
  • Vs polyhedral DSLs
    • Use affine access functions
    • Support fusion, interchange, tiling, parallelization
    • Loop transformations
    • Large search space
    • Not applicable to (structured-)sparse networks
  • Manual optimizations:
    • Coalescing — orders threads within micro-tile to contiguous mem access
    • Thread swizzling — transforms row-major to column-major submatrix for each group-size rows
  • Auto optimizations:
    • Pre-fetching
    • Async copy scheduling
    • Shared mem synchronization — inserts barriers into GPU code by detecting read-after-writes and write-after-read

Tinygrad ShapeTracker (2024.11.26)

  • Pure movement ops: RESHAPE, PERMUTE, PAD, EXPAND, SHRINK, STRIDE
    • PAD — fwd: NumPy pad, bprop: shrink
    • EXPAND — fwd: NumPy broadcast_to, bprop: reduce SUM
    • SHRINK — fwd: NumPy subarray view, bprop: pad
    • STRIDE — decomposed using Flip for < 0 and a combination of Pad and Reshape for > 1
  • Translates all of NumPy syntax to ops
  • Vs teenygrad = 1/10th of tinygrad
  • Is a list of View objects
  • A View has: shape (i.e. dims), strides, offset, mask (begin-end per axis), whether it’s contiguous
    • Default strides assume shape is rightmost-major
    • Canonicalized, e.g.: if mask uncovers just 1 index, convert to “no stride” and adjust the offset
  • Why many views per Tracker? → tomorrow
  • Direct relevant ops are: EXPAND, CONTRACT, VIEW, REDUCE_AXIS
  • Assigned to op nodes except DEFINE_LOCAL/GLOBAL/VAR, BUFFER, CONST
    • VIEW has a ShapeTracker
    • Other nodes inherit ShapeTracker from children; all children must have the same ShapeTracker!
  • VIEW = non-copy movement op; VIEW doesn’t have children, instead typically provides a view for the preceding DEFINE_GLOBAL

Tinygrad Followup (2024.11.27)

  • MultiLazyBuffer with per-device LazyBuffers
    • Stores per-device bounds of the sharded axis
    • Dedicated float4 support
  • Multigpu via sharding
    • Ends up duplicating the part of the compute graph below reducing of the sharded axes
    • Its ScheduleItem forms a kernel
    • STORE: retain after kernel finishes
  • DEFINE_GLOBAL args: position in parameter list, param name, mutability
  • CAST and BITCAST
    • BITCAST first takes address and pointer-casts the address
  • UPCAST UOp actually means UNROLL!
    • What is UPCAST Opt (i.e. in program search)?
  • Variable
    • Axis dimensions placeholder
    • Has range — this range is used to render loops in Linearizer
    • Can remain symbolic
      • Generates symbolic mask guard
      • Passed as a kernel param
    • Can create_schedule_with_vars perform some shape inference?