Mind Maps — January 2025

Apple Platforms (2025.01.13 – 2025.01.15)

Apple’s MLX (2025.01.13)

  • Expression based like JAX — tracing → compute graph
    • But imperative and with stateful Module class
    • Pythonic: classes, generators
  • Expr. transformations work on function objects: grad, vmap, vjp, jvp
  • Specify device (cpu, gpu) at the expression / result array — actually, specify stream (since uniform memory)
  • array vs. vector
  • Good C interface
  • Explicitly executed with eval() — implicitly executed by print, memoryview, save
    • Currently not compiled
  • compile() optimizes graph e.g. by fusion — only for pure functions
    • Can compile the whole SGD update, support for capturing mutable state
  • Formats: .npy, .npz, .safetensors (HuggingFace), .gguf (GGreganov)
  • Distributed comp. via MPI: all_sum, all_gather, send, recv (an array)
  • Exporting functions, can be imported cross-frontend
    • Only for fixed shapes, via example inputs
    • Exporting modules with or w/o params

Apple Metal Quick Glance (2025.01.15)

  • Can compile from commandline or at runtime a/synchronously from string: newLibraryWithSource:options:completionHandler/error
  • Multilevel command buffer: blit/compute encoders
  • Doesn’t support double (?)

Habits (2025.01.19)

  • Atomic
    • Obvious — define trigger: definite time+location, or end of earlier habit
    • Attractive — community, temptation / ritual
    • Easy — reduce friction, prime the environment; downscale habit to two minutes; break down big ones into multiple; establish a <2min minimal satisfying version
    • Satisfying — immediate reward on completion; use habit tracker
  • Modify habit — keep trigger and reward, replace routine
  • Stepladder — small changes; Community; Important — dwell on priorities; Easy; Neurohack — behavior drives identity; Captivating rewards; Engrained — repetition

Andrej Karpathy’s Educational Resources (2025.01.20)

  • GitHub repositories
  • nn-zero-to-hero — introductory and experimental parts of the lectures
    • Lecture 1: micrograd — minimal NumPy NN framework
    • Lectures 2–5: makemore — bigram character-level LM
    • Lecture 8: nanoGPT — optimizes minGPT
  • minGPT — the first GPT in PyTorch repo, purely educational
  • nanoGPT
    • ng-video-lecture — follows the first, 2-hours lecture reproducing nanoGPT
    • build-nanogpt — follows latest, 4-hours lecture reproducing nanoGPT (improved lecture)
  • minBPE — tokenizer for GPT
  • char-rnn — character-level LM, old, in Lua Torch, vanilla Recurrent NN, GRU, LSTM
  • llama2.c
    • train.py in PyTorch
    • run.c in FP32 pure C
    • runq.c in int8 quantized
  • llm.c — very efficient C/CUDA, multi-GPU with ZeRO, optimized for pretraining GPT-2/3
    • Simple reference impls: train_gpt2.c in pure C, train_gpt2.cu in CUDA (this is the entry point to main impl.), train_gpt2.py in PyTorch

llm.c Deep Dive (2025.01.21 – 2025.01.23)

llm.c (2025.01.21)

  • Uses cuBLAS, cuBLASLt in code for matmul forward and backward, manual kernel for backward bias term
  • Manual kernels for: encoder forward and backward, layernorm forward and backward, softmax forward and (in-place) backward, Cross-Entropy forward fused with backward, AdamW (see slide 2024.11.20)
  • Hyperparams for loading any GPT2 or GPT3
  • Random init scheme from GPT2, init computed on the CPU
  • Uses ZeRO for multi-GPU — uses MPI to distribute (to be continued)
  • Two variants: with & w/o cuDNN, used for attention
  • Uses stochastic rounding from FP32 to BF16
  • Judiciously uses size_t to not overflow int
  • Conserves memory by reusing buffers
  • Manually manages memory (no per-tensor alloc/free)

llm.c Distributed (2025.01.23)

  • Sharded Data Parallel and FSDP not supported yet — only sharding optimizer states aka. ZeRO-1
  • Dedicated CUDA stream for NCCL ops
  • llmc/zero.cuh here, NCCL and MPI overview on future maps
  • multi_gpu_async_reduce_gradient: reduce-scatter if ZeRO else all-reduce
  • gpt2_calculate_grad_norm: reuses the activations buffer; without ZeRO, gradients already averaged across all GPUs, sums norms locally; with ZeRO, need to all-reduce-sum the norms
  • Different network socket API causes some duplication between Windows and Linux

CUDA Programming (2025.01.24 – 2025.01.25)

CUDA Cooperative Groups (2025.01.24)

  • API for thread subsets — first-class thread blocks
  • coalesced_group active = coalesced_threads() — currently executing in a warp
    • active.sync() — no deadlock; protects from deadlocks wrt. __syncthreads()
  • group.sync(), group.thread_rank() (no. of this thread in this group)
  • tiled_partition(group, size) — if size=32, gives warps
    • thread_block_tile<size>, tiled_partition<size>(group)
  • thread_block_tile::...: shfl(), shfl_down(), shfl_up(), shfl_xor(), any(), all(), ballot(), match_any(), match_all()
    • #pragma unroll since size is static — then, compiler can remove synchronizations (unsafe when done manually)
    • __ffs, __popc
  • CUDA Warp-level primitives — 32-bit (int) masks pick threads of a warp
    • Synchronized data exchange: __all_sync, __any_sync, __uni_sync, __ballot_sync; __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync; __match_any_sync, __match_all_sync
  • Get __activemask; Sync with memory fence: __syncwarp(mask=FULL_MASK) (0xffffffff)

CUDA Warp Functions: Vote, Match, Reduce, Shuffle (2025.01.25)

  • Each calling thread must be in the mask, all masks must be the same
  • Vote: __all_sync, __any_sync — the value is non-zero for all/any of the threads
  • Match: __match_any_sync — returns the mask of threads with the same value as the calling thread; __match_all_sync — returns the given mask if all its threads have the same value, otherwise 0
  • Vote: __ballot_sync — returns the mask of threads with non-zero value
  • Reduce: __reduce_add/min/max_sync — reduces the int or unsigned values; __reduce_and/or/xor_sync — logical op reduces unsigned values
  • Shuffle (shfl): exchange a variable between threads of a warp (faster than shared mem)
    • Unlike reduce functions, works with all numeric types, including __half2 and __nv_bfloat162
    • __shfl_sync specifies source lane explicitly
    • __shfl_up/down_sync specify delta, source is lower/higher
    • __shfl_xor_sync bitwise XORs calling thread lane ID with the given lane mask (see: butterfly pattern)
    • Optional width (one of 2,4,8,16) subdivides operation into groups, with group-relative addressing
  • Warp Matrix Functions leverage Tensor Cores for Matrix Multiply Add

cuDNN (2025.01.26 – 2025.01.27)

cuDNN: CUDA Deep Neural Network (2025.01.26)

  • Frontend: C++ and Python APIs; Backend: C API
  • C++ based on shared pointers
  • All tensors have from 3 to 8 axes (with leading dims 1 if not needed)
  • cuDNN graph performs inference: shapes for virtual/temp tensors, strides, precisions
    • Via Graph::validate
  • Graph: fusion etc.
  • Configurable defaults for: io, intermediate, and compute data type
  • Autotuning — filtering: numerical, behavior, functional properties; policy-based selection; multiple heuristic-based execution plans
  • Opt-in to use Tensor Core (in backend, not in frontend?)
  • Multinode graphs do not support in-place operations
  • Operations:
    • Matrix Mult. — matmul broadcasts even non-1 batch axes if needed
    • Convolution: forward, data grad, weight grad
    • Batch Normalization: forward, backprop, finalize stats
    • Layernorm: forward, backprop
    • Instancenorm: forward, backprop
    • Attention: forward, backprop — only for FP16, BF16, FP8
    • Pointwise: add, bias, scale, sub, mul, rsqrt, relu, elu, gelu, cmp_gt

cuDNN Graphs (2025.01.27)

  • Virtual tensors can be any type, but recommended FP32
  • Mixed precision inputs via (pointwise) identity
  • On Ada Lovelace, FP8 inputs trigger FP8 Tensor Cores
  • Compute type FP32 / CUDNN_DATA_FLOAT (recommended for backward pass) and CUDNN_DATA_FAST_FLOAT_FOR_FP8
  • Require exactly one batch axis
  • Pre-compiled single operation engines: convolution and normalization ops
  • Generic runtime fusion engines: only for pointwise ops → matmul or convolution (or none) → pointwise ops [→ reduction op]
  • Specialized pre-compiled engines
    • Convolution-BatchNorm with ReLU activation
    • FP8 Fused Flash Attention max sequence length 512
    • Fused Attention with max seq length 512, forward and backward, e.g. similar to BERT and T5
      • Allows optional: scalar key scaling, padding or causal masks, softmax, dropout
  • Specialized runtime fusion patterns
    • ResNet helpers: BatchNorm forward (with optional Add, ReLU, and (> 0) side output) and backward (with optional dReLU and side grad output for fwd’s Add)
    • Support multi-GPU batches
    • Fused Flash Attention forward and backward, usable GPT and BERT like models
      • Configurable with many scaling, mask and dropout options

Communication Libraries (2025.01.28 – 2025.01.30)

MPI: Message Passing Interface and Collective Operations (2025.01.28)

  • llm.c usage: 1 GPU = 1 process
    • Can initialize NCCL with: tcp, mpi, fs (fs: file system synchronization)
    • MPI_Bcast to initialize NCCL rank
    • MPI_Allgather to find the GPU’s ordinal on a machine
  • Send, Recv: point-to-point communication
  • Broadcast: send from one to all
  • Reduce: send from all, reduce on the fly into a value received by one
  • All-Reduce: send from all, receive the same reduced value by all
  • Gather: send from all, all values received by one node in a container
  • All-Gather: send from all, every node receives all values in a container
  • Scatter: send from a container on one node, a different value to every node
  • Reduce-Scatter: pointwise reduces a vector and scatters the results
  • All-to-all: each node has a container, from which it sends a different value to every node’s receiving container
  • Scan: each node receives a partially reduced value depending on its rank
  • Vectorized versions of these, and dedicated versions where values are arrays

NCCL: NVIDIA Collective Communications Library (2025.01.29)

  • Deals with different GPU and interconnect types: PCIe, NVLINK, InfiniBand Verbs, IP sockets
  • Single kernel handling both communication and computation
  • Control: single-threaded, multi-threaded, multi-process including MPI
  • Operations: All-Reduce, Broadcast, Reduce, All-Gather, Reduce-Scatter, Send, Recv
    • Root rank gets the result of Reduce
  • A communicator has nodes (ranks) and issues collective operations, a communicator object is a node
  • ncclUniqueId is the root rank, w/ it a given rank communicates. Can be a set ncclUniqueIds, all nodes must have the same set.
  • Optionally can block, default async via stream queues — but enqueueing can block on other ranks to arrive first
  • Using communicators concurrently can cause deadlocks. Even with separate streams, e.g. one uses too many blocks.
  • Don’t use streams while NCCL uses them
  • A communicator object of a device can use host pointers, cannot use pointers of a peer device, to avoid programmer errors
  • Cooperative Thread Arrays: threads of a warp?
  • Group calls — dynamic scope ncclGroupStart/End
    • Must be used when single thread manages multiple devices
    • Aggregating operations might optimize communication
  • To avoid internal copies: ncclMemAlloc & ncclCommRegister

NCCL Usage in llm.c (2025.01.30)

  • NCCL setup and helpers in zero.cuh
  • multi_gpu_async_reduce_gradient: if not sharding then All-Reduce, if ZeRO-1 then Reduce-Scatter
  • gpt2_update: All-Gather updated shards of params
  • gpt2_backward_and_reduce: All-Reduce gradients per-layer in the last microbatch step, also All-Reduce accumulated_mean_loss
  • gpt2_calculate_grad_norm: All-Reduce grad_norm_squared
  • Main: Periodically compute and All-Reduce: validation loss, HellaSwag accuracy
  • All-Reduce to compute memory stats

Quick First Glance at Caten (2025.01.31)

  • Lazy like tinygrad, but more control like OCANNL: explicitly forced, can be explicitly compiled
    • Unlike OCANNL compilation is optional
  • API the same as NumPy/PyTorch? — cumbersome, verbose notation
  • All graphs can be compiled with dynamic shapes (unlike tinygrad and current OCANNL)
  • Compilation pipeline: shape inference, high-level → scheduler i.e. kernel demarcation and ordering → low-level → polyhedral IR → rendering kernels
  • High-level (API) and low-level (AIR) graph interfaces, codegen, graph runner
  • Focus on “NN inference runtime”
  • Verdict: the code is too sprawling (e.g. too many files) to be worth digging into; better focus on tinygrad
    • But it’s impressive, maybe I’ll have some luck finding inspiration
  • Interesting to look at: source/api/tensor.lisp, source/codegen/scheduler.lisp, source/codegen/jit.lisp, source/codegen/memory-planner.lisp (minimizes peak memory usage), source/byoc/metal.lisp, external/llm/layers.lisp