Distributed comp. via MPI: all_sum,
all_gather, send, recv (an
array)
Exporting functions, can be imported cross-frontend
Only for fixed shapes, via example inputs
Exporting modules with or w/o params
Apple Metal Quick Glance (2025.01.15)
Can compile from commandline or at runtime a/synchronously from
string:
newLibraryWithSource:options:completionHandler/error
Multilevel command buffer: blit/compute encoders
Doesn’t support double (?)
Habits (2025.01.19)
Atomic
Obvious — define trigger: definite time+location, or end of earlier
habit
Attractive — community, temptation / ritual
Easy — reduce friction, prime the environment; downscale habit to
two minutes; break down big ones into multiple; establish a <2min
minimal satisfying version
Satisfying — immediate reward on completion; use habit tracker
Modify habit — keep trigger and reward, replace routine
Stepladder — small changes; Community; Important — dwell on
priorities; Easy; Neurohack — behavior drives identity; Captivating
rewards; Engrained — repetition
Andrej Karpathy’s Educational Resources (2025.01.20)
GitHub repositories
nn-zero-to-hero — introductory and experimental parts of the
lectures
llm.c — very efficient C/CUDA, multi-GPU with ZeRO, optimized for
pretraining GPT-2/3
Simple reference impls: train_gpt2.c in pure C, train_gpt2.cu in
CUDA (this is the entry point to main impl.), train_gpt2.py in
PyTorch
llm.c Deep Dive (2025.01.21 – 2025.01.23)
llm.c (2025.01.21)
Uses cuBLAS, cuBLASLt in code for matmul forward
and backward, manual kernel for backward bias term
Manual kernels for: encoder forward and backward, layernorm forward
and backward, softmax forward and (in-place) backward, Cross-Entropy
forward fused with backward, AdamW (see slide 2024.11.20)
Hyperparams for loading any GPT2 or GPT3
Random init scheme from GPT2, init computed on the CPU
Uses ZeRO for multi-GPU — uses MPI to distribute (to be
continued)
Two variants: with & w/o cuDNN, used for attention
Uses stochastic rounding from FP32 to BF16
Judiciously uses size_t to not overflow int
Conserves memory by reusing buffers
Manually manages memory (no per-tensor alloc/free)
llm.c Distributed (2025.01.23)
Sharded Data Parallel and FSDP not supported yet — only sharding
optimizer states aka. ZeRO-1
Dedicated CUDA stream for NCCL ops
llmc/zero.cuh here, NCCL and MPI overview on future
maps
multi_gpu_async_reduce_gradient: reduce-scatter if ZeRO
else all-reduce
gpt2_calculate_grad_norm: reuses the activations
buffer; without ZeRO, gradients already averaged across all GPUs, sums
norms locally; with ZeRO, need to all-reduce-sum the norms
Different network socket API causes some duplication between Windows
and Linux
CUDA Programming (2025.01.24 – 2025.01.25)
CUDA Cooperative Groups (2025.01.24)
API for thread subsets — first-class thread blocks
coalesced_group active = coalesced_threads() —
currently executing in a warp
active.sync() — no deadlock; protects from deadlocks
wrt. __syncthreads()
group.sync(), group.thread_rank() (no. of
this thread in this group)
tiled_partition(group, size) — if size=32,
gives warps
Get __activemask; Sync with memory fence:
__syncwarp(mask=FULL_MASK) (0xffffffff)
CUDA Warp Functions: Vote, Match, Reduce, Shuffle (2025.01.25)
Each calling thread must be in the mask, all masks must be the
same
Vote: __all_sync, __any_sync — the value
is non-zero for all/any of the threads
Match: __match_any_sync — returns the mask of threads
with the same value as the calling thread; __match_all_sync
— returns the given mask if all its threads have the same value,
otherwise 0
Vote: __ballot_sync — returns the mask of threads with
non-zero value
Reduce: __reduce_add/min/max_sync — reduces the int or
unsigned values; __reduce_and/or/xor_sync — logical op
reduces unsigned values
Shuffle (shfl): exchange a variable between threads of a warp
(faster than shared mem)
Unlike reduce functions, works with all numeric types, including
__half2 and __nv_bfloat162
__shfl_sync specifies source lane explicitly
__shfl_up/down_sync specify delta, source is
lower/higher
__shfl_xor_sync bitwise XORs calling thread lane ID
with the given lane mask (see: butterfly pattern)
Optional width (one of 2,4,8,16) subdivides operation into groups,
with group-relative addressing
Warp Matrix Functions leverage Tensor Cores for Matrix Multiply
Add
cuDNN (2025.01.26 – 2025.01.27)
cuDNN: CUDA Deep Neural Network (2025.01.26)
Frontend: C++ and Python APIs; Backend: C API
C++ based on shared pointers
All tensors have from 3 to 8 axes (with leading dims 1 if not
needed)
cuDNN graph performs inference: shapes for virtual/temp tensors,
strides, precisions
Via Graph::validate
Graph: fusion etc.
Configurable defaults for: io, intermediate, and compute data
type
ResNet helpers: BatchNorm forward (with optional Add, ReLU, and
(> 0) side output) and backward (with optional dReLU and side grad
output for fwd’s Add)
Support multi-GPU batches
Fused Flash Attention forward and backward, usable GPT and BERT like
models
Configurable with many scaling, mask and dropout options
Communication Libraries (2025.01.28 – 2025.01.30)
MPI: Message Passing Interface and Collective Operations
(2025.01.28)
llm.c usage: 1 GPU = 1 process
Can initialize NCCL with: tcp, mpi, fs (fs: file system
synchronization)
MPI_Bcast to initialize NCCL rank
MPI_Allgather to find the GPU’s ordinal on a
machine
Send, Recv: point-to-point communication
Broadcast: send from one to all
Reduce: send from all, reduce on the fly into a value received by
one
All-Reduce: send from all, receive the same reduced value by
all
Gather: send from all, all values received by one node in a
container
All-Gather: send from all, every node receives all values in a
container
Scatter: send from a container on one node, a different value to
every node
Reduce-Scatter: pointwise reduces a vector and scatters the
results
All-to-all: each node has a container, from which it sends a
different value to every node’s receiving container
Scan: each node receives a partially reduced value depending on its
rank
Vectorized versions of these, and dedicated versions where values
are arrays