A Range Is Not Its Shape: Reading Rangeify-Era tinygrad from OCANNL
Writing of this article was interrupted by the ban of Claude Fable.
A Range Is Not Its Shape
The previous essay placed tinygrad on a design-space map and left it as the most striking point on one axis: the framework that had just crossed from carrying movement as stride metadata to expressing it as loop structure, arriving — from the opposite direction — at the representation OCANNL was built on. That essay ended by calling OCANNL “the borrower, pulled by tinygrad toward a representation it does not yet have,” and moved on. This essay does not move on. It goes inside tinygrad, reads the rangeify-era machinery in detail, and asks the prospector’s question: which of tinygrad’s solutions are worth porting to OCANNL, in what form, and in what order.
The title is tinygrad’s own distinction, turned into the series’ refrain. In tinygrad’s framing, a shape is a declarative constraint — this tensor has these extents — while a range is an imperative statement of how the device should loop. They are distinct objects in a fixed correspondence, exactly the kind of pair this series keeps finding: a shape is not its index, a tensor is not its storage, and a range is not its shape. Once the two are distinct, you get to choose which one your compiler manipulates. tinygrad’s rangeify is the decision to manipulate the ranges; the rest of this essay is about what that decision bought them and what OCANNL should buy with it.
Some context on standing. tinygrad has been listed among OCANNL’s
inspirations in the README from early on, and OCANNL’s roadmap defines
its own scheduling ambition explicitly against it — “instead of dynamic
scheduling as in tinygrad, we can schedule statically by program
search,” followed immediately by “we should also reproduce the search
that tinygrad is doing.” There was also a prior personal deep dive, back
in the ShapeTracker era, so the question here is not what tinygrad
is but what changed — and the answer is: the thing that
made the two frameworks hardest to compare. All claims about tinygrad
below are pinned to the master branch at commit
2bfdf85f87da (2026-06-12, version 0.13.0), and to the
tinyspec document (spec/tinyspec.tex) the project now ships
as its core specification; claims about OCANNL are verified against the
current tree. Readers coming from the earlier posts in this series have
everything they need; readers who know tinygrad but not OCANNL may want
the
storage essay first.
One dialect, end to end
The fastest way to convey tinygrad’s architecture is its spec’s
subtitle: a single dialect from Tensor programs to Command
Buffers. Everything — and the word is doing real work — is a
UOp: a tuple of an operation, a tuple of source UOps,
an op-dependent argument, and a processing tag. The user-facing tensor
graph is UOps. The movement operations are UOps. Loops are UOps
(Range, closed by End). Memory writes are UOps
(Store, the only op with observable side effects).
Workgroup barriers, GPU thread indices, conditionals, tensor-core
invocations are UOps. The linearized instruction sequence is a UOp
(Linear), and so is the compiled kernel itself
(Program, whose sources include the rendered
Source string and the machine-code Binary).
There is no IR boundary anywhere from the Python expression you typed to
the bytes the driver receives: compilation is one graph being rewritten
in place, through roughly twenty named rewrite stages
(full_rewrite_to_sink in codegen/__init__.py),
until what remains is the program.
Each UOp has five derived properties — dtype, shape, device, address
space, and min_max — computed bottom-up from local rules.
Hold on to two of those. Address space (GLOBAL,
LOCAL, REG) is the memory hierarchy made
explicit on every node, which is what lets a later stage assign and
reuse storage mechanically. And min_max is an interval — every
UOp knows a lower and upper bound on its value — threaded through the
whole graph by arithmetic rules. Both will return as port
candidates.
The op set itself is aggressively, delightfully minimal. The unary
arithmetic primitives are exactly three: Recip,
Trunc, Cast. Everything else is decomposed:
Neg is multiplication by −1, Sub is
add-of-neg, Sqrt is Exp2(0.5 · Log2(x)), and
Exp2/Log2/Sin are themselves
polynomial-approximation compositions. Random number generation is
Threefry, five rounds of add-rotate-xor on counters — the
same counter-based-RNG bet OCANNL made with its threefry4x32 machinery,
arrived at independently, and for the same reason: a counter-based
generator is a pure function, and pure functions are what a graph
compiler can reason about. The spec then demonstrates the philosophy at
tensor level by deriving common operations as compositions: matrix
multiply is reshape-multiply-reduce; arange is a prefix sum
of ones, itself built from a pad/reshape/expand sliding-window trick;
and gather is a one-hot mask contracted against the table
—
# gather: out[i] = T[idx[i]]. one-hot mask along gather axis, then reduce
pos = arange(K).reshape(K, 1)
mask = (pos == idx.reshape(1, -1)).cast(T.dtype)
return (T.reshape(K, 1) * mask).sum(0)— which is, symbol for symbol, the pattern OCANNL’s one-hot work (#343) starts from: represent indexed access as (k equals index) times table[k] summed over k, and trust the compiler to collapse the loop back into an indexed read. Two frameworks that refuse primitives end up refusing the same primitives, because the algebra underneath is the same. This is the deep kinship, and it is why the porting question is well-posed at all: the systems agree about what computation is; they differ in how the compiler is organized around it.
It is worth being concrete about that difference now, because it
frames every section below. tinygrad is one dynamically-typed graph
rewritten to fixpoint by a pattern-matching engine, with properties
derived locally and checked at rewrite time. OCANNL is a stack of
statically-typed IRs — tensor expressions, then the assignments language
(Assignments.comp), then the loop-nest IR
(Low_level.t), then backend C/CUDA/MSL syntax — with shapes
inferred globally ahead of lowering by the row-polymorphic system the
earlier posts in this series developed, and each layer’s invariants
carried in its types. One dialect versus typed layers; derived-upward
versus inferred-ahead; fixpoint rewriting versus ordered passes. Neither
is dominant; each makes some ports trivial and others structural. The
inventory below keeps returning to this.
From stride stacks to ranges
What rangeify replaced deserves a paragraph, because the replacement
is the whole reason this comparison became tractable. The old
ShapeTracker was a stack of view records — shape, strides, offset, and a
validity mask per view — composing the six movement operations (reshape,
permute, expand, pad, shrink, flip) without copying: the lazy,
multi-layer analogue of a NumPy view. It was tinygrad’s signature data
structure, and it carried the movement semantics of the entire
framework: any indexing question was answered by pushing an index
expression down through the stack of views, stride arithmetic at each
level. In the current tree it is gone — not deprecated, gone;
there is no file and no reference. Its replacement, in
schedule/rangeify.py, works the other way around. Loop
variables — Range UOps, each with an extent and an axis
type — are created early, on the operation graph itself. Movement
operations are then compiled away into index arithmetic on those ranges:
apply_movement_op (in schedule/indexing.py)
answers “given the output’s ranges, what index expression reads the
right input element?”, so a permute reorders which range feeds which
axis, a shrink adds an offset, a pad wraps the index in a validity
guard, a reshape solves for the input ranges that linearize to the same
address. When an indexing operation lands on a chain of movement ops,
the chain is folded into the index computation directly
(_mop_index), so no intermediate ever exists.
Fusion then stops being a separate planning problem and becomes a
fact about ranges: operations that end up iterating over the
same ranges collapse into the same loop nest; operations that
don’t, don’t. What is genuinely decided — by a cost heuristic,
remove_bufferize, weighing how many buffers an expression
touches and whether reduction axes intervene — is
bufferization: whether an intermediate value gets materialized
into a buffer (Store, plus an After marking
the lifetime) or stays inline as a subexpression of its consumers. A
Contiguous marker forces materialization; kernel boundaries
fall out at reduction edges (split_reduceop).
Readers of this series have seen this machine before, under different
names, because it is OCANNL’s machine. OCANNL never had a ShapeTracker
to tear out: einsum specifications go through shape inference to
projections — for every operation, an assignment of affine
index expressions (Iterator, Fixed_idx, and
affine combinations of loop symbols, the axis_index type in
arrayjit/lib/indexing.ml) to every axis of every operand —
and projections lower to For_loop nests in
Low_level.t. Loops created early, movement as index
arithmetic, no stride metadata surviving to runtime: that has been the
architecture from the start, which is exactly what the previous essay
meant by the convergence coming from tinygrad’s side. And the
materialization decision has an equally direct counterpart: OCANNL’s
virtualization pass (virtual_llc) decides whether
a tensor node becomes a buffer or is inlined into its consumers as a
recomputed expression — bufferize is “materialized,”
remove_bufferize is “virtual,” and tinygrad’s buffer-count
cost heuristic plays the role of OCANNL’s visit-counting
(virtualize_max_visits) and inlining toggles. Even the
address-space story aligns: tinygrad’s
GLOBAL/LOCAL/REG is OCANNL’s
memory-mode lattice collapsing toward stored-versus-inlined, with
Local_scope scalars as the REG end.
A small Rosetta stone, for orientation — the left column from the tinygrad spec and sources, the right from OCANNL’s tree:
| tinygrad | OCANNL |
|---|---|
| UOp graph (one dialect) | Assignments.comp → Low_level.t → backend
syntax (typed layers) |
Range / End |
For_loop { index; from_; to_; body; … } |
movement ops → index rewrites (apply_movement_op) |
einsum projections → affine axis_index per operand
axis |
bufferize vs. inline (remove_bufferize cost) |
materialized vs. virtual node (virtual_llc, visit
counts) |
addrspace
GLOBAL/LOCAL/REG |
memory modes; Local_scope scalars at the
REG end |
Store + After (assign = write, then
ordered passthrough) |
assignments + compile-time hazard frontier
(last_writer, last_readers) |
graph_rewrite to fixpoint over
PatternMatcher rules |
ordered passes: virtual_llc → simplify_llc
→ CSE/hoisting |
| derived properties (dtype, shape, device, min_max) | static row-polymorphic shape inference ahead of lowering |
min_max interval on every UOp |
— (port candidate 3) |
AxisType on every range |
— (port candidate 2) |
| OptOps + BEAM search | — (port candidate 1) |
device n-tuples, shard axis property |
first-class contexts, merge buffers (port candidate 4: the property) |
The dashes in the right column are the article.
What stays home
Before the inventory of what to take, the inventory of what not to — the things tinygrad built that OCANNL gets from its own constitution, listed not to score points but because each one shapes a porting decision below.
Shape inference. tinygrad’s shapes are derived properties: computed bottom-up per UOp, checked when an op is constructed, with broadcasting resolved by right-aligned per-axis rules. It works, and in a trace-everything system it is the natural design. OCANNL’s shapes are inferred, globally and bidirectionally, with row variables and the broadcast order the first four posts built: under-specified shapes flow both ways through a program and a conflict is reported before anything lowers. Nothing in tinygrad needs importing here, and — more to the point — several ports below get easier on OCANNL’s side because shape information is total and static by the time the loop IR exists.
Graph capture. tinygrad’s @function
decorator exists because its programs are Python: to get a reusable
graph fragment out of a Python function you must trace it — run it
lazily, find the tensors, replace them with Param
placeholders, wrap the body in a Function UOp. OCANNL’s
%op/%cd syntax extensions build graphs
natively; there is nothing to trace and no tracing infrastructure to
maintain. The interesting part of tinygrad’s Function/Call
layer is not the capture but what happens after, and that part is a real
port candidate (number 5).
The rewrite engine, as machinery. tinygrad’s
PatternMatcher — rules as a list of (UPat
pattern, callback) pairs, applied by graph_rewrite to
fixpoint — is the engine the entire compiler runs on, and a beautiful
piece of bootstrapping in a language without pattern matching. OCaml has
pattern matching; OCANNL’s simplify_llc is an ordinary
exhaustive, typed, compiled match over
scalar_t. The machinery is not the port. The
discipline around it might be (number 6).
Contexts as values. tinygrad’s device state is
global and stringly addressed ("CUDA:0"); kernels compile
against the ambient device. OCANNL’s contexts are first-class values
with explicit lineage — the previous essay’s whole subject — and this
turns out to matter operationally for the very first port below: an
autotuner wants to compile and time many candidate schedules for the
same context state, and sibling compiles from one explicit frontier
are a cleaner substrate for that than careful mutation of a global.
What remains is what tinygrad has and OCANNL wants. Six areas, in descending value-per-difficulty.
Port 1: the OptOps schedule layer
This is the big one, and rangeify is what moved it within reach.
In the ShapeTracker era, tinygrad’s kernel optimizations were
entangled with view arithmetic and hard to even state in OCANNL’s
vocabulary. In the rangeify era they are loop-nest transforms,
and the vocabulary is shared. The spec defines an optimization as a
triple — (op, axis, arg) — and the op set is small enough to quote
whole: Split an axis by a factor, the new sub-axis
acquiring a designated axis type; Padto an axis to a
multiple, introducing validity masks; Swap two axes;
Nolocals; and TC, mapping a reduction
onto tensor cores. That is the entire schedule language. The classical
names map one-to-one: Split is tiling when the new axis is a loop,
vectorization when its type is UPCAST, unrolling when
UNROLL, CPU parallelism when THREAD, a
shared-memory group reduction when GROUP_REDUCE; Swap is
loop interchange; Padto is the pad-to-multiple that makes tile sizes
divide. (The implementation spells the splits out as separate enum
members — UPCAST, UNROLL, LOCAL,
THREAD, GROUP, … in
codegen/opt/__init__.py — but they are one operation
differing in the target type of the new axis, and the spec’s unification
is the insight: a schedule transform is a split plus a
retyping.) Schedules compose left-to-right as plain lists of
Opt values; the search space is sequences of triples; and
the search itself (beam_search in
codegen/opt/search.py) is a BEAM over those sequences,
compiling each candidate and timing it on the device, in parallel,
keeping the fastest few per round until improvement stalls.
Halide’s lesson, which tinygrad inherits through TVM’s lineage (the TVM deep dive traces this), is that the algorithm and the schedule are separate artifacts, and the OptOps layer is that lesson reduced to its minimum viable form: not a scheduling DSL, just a list of triples acting on loop axes — small enough to search mechanically.
Now the OCANNL side, with the seed inventory first because it is more
advanced than “no schedule layer” suggests. The substrate the OptOps act
on — explicit loop nests — is Low_level.t’s native form.
Unrolling exists as a code-generation mode (unroll_dims
builds fully unrolled nests with fixed indices). The vectorization seed
exists: Set_from_vec with its vec_unop family
is a genuine vector-store primitive in the IR, currently with a single
occupant (the threefry bit-conversion
Uint4x32_to_prec_uniform) but with the right shape for
UPCAST-style codegen to grow into. And the masking
machinery Padto needs — generating guarded strips over the out-of-range
region of a loop — already exists for a different purpose, as
loop_over_padding_region, which walks exactly the
left-margin/valid-middle/right-margin decomposition per dimension that
pad-to-multiple requires. What is genuinely missing is bounded and falls
in three parts:
(a) Axis annotations richer than
For_loop’s current fields — that is port 2, the
prerequisite, below.
(b) The schedule as a value, kept separate from the
IR. Here OCANNL should follow the Halide discipline more
strictly than tinygrad does. tinygrad’s Opt triples
are values (frozen, serializable — that is what BEAM caches),
but their application (apply_opt in
codegen/opt/postrange.py) destructively rewrites the one
graph in place, mid-pipeline. The OCANNL-shaped version is a pure pass:
a schedule is a list of (optop, axis reference, argument)
triples, and applying it is a Low_level.t -> Low_level.t
function run after virtualization — the kernel before scheduling and the
kernel after are both ordinary values of the IR type, comparable,
printable, testable. This is the v0.8 tiling work’s natural form, and
the existing tiling proposals (#412 for GPU, the CPU matmul
plan for the sibling) describe the transforms; what no proposal yet
pins down is this schedule IR itself — the Tiramisu study’s thesis is
precisely that this layer is what OCANNL is missing, and the gap now has
a stub of its own.
(c) The search harness — and this is the easy part, not the hard one. BEAM over schedule prefixes with on-device timing is a few hundred lines in any language once (a) and (b) exist; the superoptimizer survey covers the design space of cost models and search strategies beyond it. The one structural remark worth making: contexts-as-values are a better autotuner substrate than tinygrad’s global device state. Candidate timings are sibling compiles from one frontier — the same context value, compiled against repeatedly, with no mutation to undo between candidates and no risk that timing candidate 7 perturbed the state candidate 8 compiles against. tinygrad’s BEAM manages this with care (per-worker device isolation in a multiprocessing pool); OCANNL’s design makes the isolation a non-event.
The honest design problem — the one the article cannot solve, only
locate — is pass ordering against virtualization.
Inlining changes which loops exist, so the schedule must act after
virtual_llc; but a schedule can change what should
have been inlined — Padto, in particular, grows a computation and wraps
it in guards, which can flip the cost balance that made a node virtual.
tinygrad dodges the question by construction: everything is one graph,
simplification is re-run between every stage because re-running a
fixpoint rewriter is cheap, and a bufferize decision revisited is just
another rewrite. In OCANNL’s layered architecture, “go back and
re-virtualize” is a structural event, not another rewrite round. The
plausible answer is the boring one — pick the order virtualize →
schedule → simplify, accept that a Padto-triggered re-virtualization is
a second iteration of the pipeline rather than a fixpoint subtlety, and
measure how often it matters — but it should be designed deliberately,
not discovered.
Port 2: AxisTypes — the prerequisite
Every tinygrad range carries an AxisType, and the
type system of axes is where the schedule language and the hardware
mapping share a vocabulary: GLOBAL (a GPU grid dimension),
LOCAL (a workgroup dimension, with access to shared
memory), WARP, THREAD (CPU parallelism), plain
LOOP, REDUCE, GROUP_REDUCE (a
reduction staged through shared memory), UPCAST (register
vectorization), UNROLL. An OptOp’s Split names the type of
the axis it creates; the code generator’s later stages
(pm_add_gpudims) turn
GLOBAL/LOCAL axes into
gidx/lidx thread indices rather than loops;
the renderer emits a Barrier where a
GROUP_REDUCE requires one. The mapping from loop structure
to hardware is, in other words, explicit in the IR, decided by
the schedule, and carried by annotation — not a convention buried in a
backend.
OCANNL today is at the opposite pole, and it is worth stating plainly
because the gap is the point: For_loop carries an index,
bounds, a body, and a tracing flag — nothing else — and every backend
currently emits single-threaded kernels. The CUDA backend
launches with a 1×1 grid and guards the kernel body with
if (threadIdx.x != 0 || blockIdx.x != 0) return;; Metal
dispatches one threadgroup. This is deliberate sequencing (correctness
and the optimizer first, parallel codegen at v0.8), but it means the
loop-to-hardware question has not yet been answered anywhere,
and the choice now is whether to answer it as a backend convention
(“backends parallelize outer loops they deem profitable”) or as IR
annotation. The tinygrad experience argues strongly for annotation: one
field on the loop — or on the loop symbol — is what makes the schedule
layer (port 1) expressible at all, since Split-and-retype needs a type
to retype to; it is what #412’s grid/block
mapping decision becomes once made explicit; and it is the natural home
for the address-space cousin already prospected at the constant end (#195, CUDA
__constant__ memory).
What the annotation unlocks beyond bare parallelism is the achievable
80% of “thread synchronization”: GROUP_REDUCE axes plus
workgroup Barriers plus LOCAL-address-space
buffers give shared-memory reductions and tiled matmuls — the
structured, well-understood patterns that Flash-attention-style kernels
eventually consume. And it is honest to mark where the 80% ends, because
tinygrad marks it too: the spec’s Barrier is
workgroup-scoped, kernels split at reduction boundaries, and
nothing in the current tree does grid-level synchronization or
persistent thread blocks — cross-kernel sequencing remains the host’s
command buffers. The megakernel ambition the previous essay discussed
(and OCANNL’s #318 write-up
explored) is not solved by tinygrad’s spec either; AxisTypes
are the staircase toward it, not the summit. Porting them buys what
tinygrad has — workgroup-level structure — and leaves the grid-level
question open on both sides.
Port 3: min/max intervals — cheapest, pays everywhere
Every tinygrad UOp carries min_max: an interval bound on
its value, computed as a cached derived property
(vmin/vmax in uop/ops.py) by
local rules — a constant is [v, v], a Range
over n is [0, n−1], addition adds endpoints, multiplication
takes the extremal product, division and modulus by constants get
careful special cases, and everything else defaults to its dtype’s
range. The payoff is diffuse and constant: the symbolic simplifier uses
intervals to discharge guards (a comparison whose operands’ intervals
don’t overlap is a constant; a validity mask provably true is deleted),
prove indices in-bounds (eliminating bounds checks), and fold the
comparisons that index arithmetic generates. It is the quiet enabler of
the louder features: Padto is affordable because most of the
masks it introduces are discharged by interval reasoning, and rangeify’s
index rewriting generates exactly the mod/div/comparison soup that
interval bounds clean up.
The OCANNL port is unusually crisp because the receiving site already
exists. scalar_t — the scalar expression IR that
simplify_llc rewrites — has precisely the constructors an
interval lattice wants: Constant is exact,
Embed_index is bounded by its loop’s extent (statically
known, since projections fix every loop range), Get is
bounded by its dtype, and the arithmetic nodes take interval rules. An
interval_of : scalar_t -> float * float (or the integer
analogue over index expressions) slots into simplify_llc’s
world as one more analysis the rewrite arms can consult.
What makes this the best effort-to-payoff item in the inventory is that four existing efforts independently approximate it. The non-linear-inlining work’s injectivity stage (#133) is range arithmetic over loop extents by another name; the local-initialization tracking (#340) wants to know whether a read can precede a write, a question intervals over index expressions answer; the one-hot matcher (#343) needs side-conditions of the form “this index expression stays within these bounds”; and the surjectivity reasoning that landed in #420 is the same arithmetic again. Four passes, four ad-hoc approximations of one analysis — the port is the unifying upgrade, and it sequences naturally before the schedule layer, since Padto’s masks will want it on arrival.
Port 4: the shard axis — port the property, not the device tuples
tinygrad’s multi-device design has a part to admire and a part to decline, and they separate cleanly.
The mechanism: a tinygrad device may be an n-tuple of
devices, a buffer created on a tuple is split across them, and the split
position is tracked as one more derived property — axis,
the sharding dimension — propagated by local per-op rules exactly the
way dtype and shape are. Reshape remaps the axis so the shard boundary
is preserved (it must solve for an output position where the element
counts left of the boundary agree); Permute carries the axis through the
permutation; a Reduce on the shard axis annihilates it (the
result is no longer sharded); shrinking it likewise; Copy strips it.
With that bookkeeping in place, the collectives are not runtime
primitives but derived programs — the spec writes them out in
six lines each: a broadcast is reshape-expand-copy, and
def allreduce(T):
return allgather(reduce_scatter(T))where reduce_scatter is itself a
reshape-permute-copy-sum composition. The system always knows where data
is split, because knowing it is just property propagation; and the
communication patterns are ordinary graph programs the compiler
schedules like anything else.
The part to port is that bookkeeping — the propagation algebra. The part to decline is the substrate it rides on: interchangeable device n-tuples. A tuple of devices is a homogeneity assumption — every shard the same size, every device the same kind, any device substitutable for any other — and it is precisely the assumption OCANNL’s first-class heterogeneous contexts are built to avoid. The previous essay’s motivating pair — a Mac Studio and a MacBook over a network — has shards of different sizes on devices of different backends with different transfer costs, and “which context performs the combination” is a real decision (bandwidth, memory headroom, who is busy) that no annotation can make by itself.
This port area has a decision behind it now, which the analysis here
was input to. OCANNL’s sharding cluster (#293) resolved
its design verdict in June 2026 (the 293b elaboration): sharding
lands as per-shard backend contexts with explicit
primitives — shard_along, gather,
grad_sync — host-orchestrated, with merge-buffer copies as
the one cross-shard channel, composing with slice-as-alias for copy-free
sharding and the training-loop
integration on top. Read against that verdict, the tinygrad algebra
is not a rival proposal but the checking and sugar layer over
it: a per-tnode placement annotation, threaded through shape
inference and projections the way tinygrad threads axis
through UOps, would let the system (first) verify that the
explicit primitives are placed consistently — that a gather
really sits where a shard boundary dies, that a reduce over a sharded
axis is preceded by the grad_sync it needs — and (later)
insert them, lowering annotation boundaries to the same
merge-buffer transfers the explicit code writes today, with terser syntax
as the surface. The honest port is therefore the propagation rules plus
an explicit placement policy hook — the heterogeneity decision
as a first-class parameter, where tinygrad hardcodes interchangeability
— and its natural sequencing is after the explicit primitives ship and
the feasibility-study
end of multi-node work makes the policy hook earn its keep.
Port 5: the Function/Call layer — joint compilation of a lineage
tinygrad’s call layer is lambda calculus on graphs:
Function substitutes Param placeholders with
arguments, Call is an opaque invocation of something
compiled, Tuple/GetTuple package multiple
results, and ordering rides on After — a passthrough of a
buffer that guarantees its dependencies executed, so that “assign”
decomposes into a Store plus an After. Most of
this, OCANNL has natively or does not need. Graph capture, as noted, is
moot. The Call-shaped object exists as the routine;
After’s job — ordering effects on shared buffers — is done
at compile time by the hazard frontier in context.ml, which
tracks per-tnode last_writer and last_readers
and derives the read-after-write, write-after-read, and
write-after-write edges as routines are compiled in a lineage (the execution-dependency-tracking
design extends the same information to run time, and the
merge-buffer static verification of #288 already
leans on it).
The narrower idea worth taking is what tinygrad’s pipeline does
between capture and render, read as a pair. First: a
Function is a unit of joint compilation — everything inside
it is scheduled and optimized as one program, however many kernels fall
out. OCANNL’s analogue would be joint compilation of a lineage:
several comps whose hazard edges are already known, handed
to the compiler as one unit, producing one artifact — one C file, one
ordered kernel sequence — instead of routine-at-a-time compiles whose
boundaries are accidents of how the user grouped code. The hazard
frontier means the dependency information is already in hand; what is
missing is only the entry point that accepts more than one comp. Second,
the spec’s Memory Plan stage: “allocate and reuse
GLOBAL, LOCAL, and REG storage for values with non-overlapping
lifetimes.” (Implementation honesty: in the current tree this is a
per-kernel linear-scan register allocator,
codegen/late/regalloc.py, for the ISA backends — the grand
cross-kernel version is the spec’s ambition more than today’s code.)
OCANNL holds the two ends of this already — Local_scope
hoisting and CSE are the REG end, and the universal pool
allocator with its lifetime-based reuse is the GLOBAL
end (#340
sharpening the local end further). The middle — lifetime analysis
across the routines of a jointly compiled lineage, so that two
never-overlapping intermediates from different comps share a buffer — is
exactly what joint lineage compilation would make well-posed. The two
halves of this port are one feature seen from two sides.
Port 6: the rewrite engine — port the discipline, not the machinery
The case for restraint first. tinygrad’s PatternMatcher
exists because Python lacks pattern matching, and much of its
sophistication (pre-compiled match functions, early-reject sets) is
performance work to make an interpreted matcher viable on hot paths.
OCaml’s match is exhaustive, typed, and compiled;
simplify_llc’s rewrite arms — constant folding, identity
elimination, reassociation, FMA fusion, integer-power unrolling — are
ordinary code that the compiler checks. Porting the machinery would be
importing a workaround for a problem OCANNL’s language already
solves.
What does not come for free in OCaml is what tinygrad gets as a side
effect of rules-being-data: provenance and observability. Every
tinygrad rewrite step is attributable — a named rule, in a named stage,
and the VIZ=1 tooling will show you every rewrite that
fired on the way from tensor expression to kernel. OCANNL’s rewrites are
anonymous match arms, and the cost of anonymity has already been paid
once: the CSE
alpha-equivalence unsoundness was a subtle bad rewrite that lived in
the optimizer precisely because no per-rule accounting existed to make
it visible. The relevant trajectory is that the rule count is about to
grow — the one-hot/gather collapse (#343) and the
non-linear inlining family (#133) are
pattern-rewrites with side-conditions, exactly the population that wants
rules-as-data with provenance, run to fixpoint, with the lowering audit as
the documentation prerequisite. An e-graph over scalar_t —
equality saturation instead of ordered rewriting — is the possible
eventual shape, already surveyed as a v0.9 candidate in the superoptimizer review. Lowest
urgency of the six; the trigger to act is the moment match arms in
simplify_llc start needing comments that say “this must run
before that.”
The bets underneath
Step back and the six ports sort themselves by what kind of thing they are. Three are mechanisms tinygrad has and OCANNL lacks outright — the schedule layer, the axis types, the intervals — and they form a single dependency chain: intervals make masks cheap, axis types make schedules expressible, schedules make the search worth running. Their sequencing is forced, and it is the v0.8-and-beyond performance arc under another description. One is a property to re-derive on different substrate (the shard axis, as sugar over the decided explicit primitives). One is a boundary redraw (joint lineage compilation, making the unit of optimization bigger than the unit of authorship). And one is a discipline (rules with provenance) whose machinery should stay home.
What the deep dive changes, relative to its ShapeTracker-era predecessor, is where the gap between the frameworks lives. It used to live in the representation: stride stacks versus loop nests was a translation problem, and every comparison stumbled on it. Rangeify dissolved that — the two systems now agree that movement is index arithmetic on early-created loops, that materialization is a cost decision, that views are the special case and not the foundation. The gap that remains is the schedule: tinygrad has a complete, minimal, searchable language for mapping one loop nest onto hardware many ways, and OCANNL — whose single-threaded kernels are a sequencing choice, but a fact — does not yet. That is a much better gap to have. It is bounded, it is well-specified by a working system whose spec fits on a page, and OCANNL’s side of the foundation (typed loop IR, static shapes, contexts as values) is laid.
And the roadmap sentence this all orbits — “instead of dynamic scheduling as in tinygrad, we can schedule statically by program search” — reads differently after the dive than before it. The opposition it names is real but narrower than it sounds: tinygrad’s BEAM is program search, run per-kernel at compile time against the live device; the “static” ambition is the same search moved offline, its results cached and shipped, its cost models trained rather than measured. The two frameworks disagree about when the search runs and what state it mutates — tinygrad searches against a global device at trace time, OCANNL would search over schedule values against context values, whenever convenient — and that disagreement is the same one this whole comparison kept finding: one dialect rewritten in place under derived properties, against typed layers and values inferred ahead. tinygrad spends late-bound flexibility to avoid early commitments; OCANNL spends early commitments to know things sooner. Both systems refuse primitives; they disagree about when to bind. The refusal is why the ports are possible. The binding is why they are worth doing.
tinygrad is at github.com/tinygrad/tinygrad;
the spec quoted throughout is spec/tinyspec.tex at commit
2bfdf85f87da. OCANNL is open source at github.com/ahrefs/ocannl.