A Range Is Not Its Shape: Reading Rangeify-Era tinygrad from OCANNL

Łukasz Stafiniak and Claude (Anthropic)

Writing of this article was interrupted by the ban of Claude Fable.

A Range Is Not Its Shape

The previous essay placed tinygrad on a design-space map and left it as the most striking point on one axis: the framework that had just crossed from carrying movement as stride metadata to expressing it as loop structure, arriving — from the opposite direction — at the representation OCANNL was built on. That essay ended by calling OCANNL “the borrower, pulled by tinygrad toward a representation it does not yet have,” and moved on. This essay does not move on. It goes inside tinygrad, reads the rangeify-era machinery in detail, and asks the prospector’s question: which of tinygrad’s solutions are worth porting to OCANNL, in what form, and in what order.

The title is tinygrad’s own distinction, turned into the series’ refrain. In tinygrad’s framing, a shape is a declarative constraint — this tensor has these extents — while a range is an imperative statement of how the device should loop. They are distinct objects in a fixed correspondence, exactly the kind of pair this series keeps finding: a shape is not its index, a tensor is not its storage, and a range is not its shape. Once the two are distinct, you get to choose which one your compiler manipulates. tinygrad’s rangeify is the decision to manipulate the ranges; the rest of this essay is about what that decision bought them and what OCANNL should buy with it.

Some context on standing. tinygrad has been listed among OCANNL’s inspirations in the README from early on, and OCANNL’s roadmap defines its own scheduling ambition explicitly against it — “instead of dynamic scheduling as in tinygrad, we can schedule statically by program search,” followed immediately by “we should also reproduce the search that tinygrad is doing.” There was also a prior personal deep dive, back in the ShapeTracker era, so the question here is not what tinygrad is but what changed — and the answer is: the thing that made the two frameworks hardest to compare. All claims about tinygrad below are pinned to the master branch at commit 2bfdf85f87da (2026-06-12, version 0.13.0), and to the tinyspec document (spec/tinyspec.tex) the project now ships as its core specification; claims about OCANNL are verified against the current tree. Readers coming from the earlier posts in this series have everything they need; readers who know tinygrad but not OCANNL may want the storage essay first.

One dialect, end to end

The fastest way to convey tinygrad’s architecture is its spec’s subtitle: a single dialect from Tensor programs to Command Buffers. Everything — and the word is doing real work — is a UOp: a tuple of an operation, a tuple of source UOps, an op-dependent argument, and a processing tag. The user-facing tensor graph is UOps. The movement operations are UOps. Loops are UOps (Range, closed by End). Memory writes are UOps (Store, the only op with observable side effects). Workgroup barriers, GPU thread indices, conditionals, tensor-core invocations are UOps. The linearized instruction sequence is a UOp (Linear), and so is the compiled kernel itself (Program, whose sources include the rendered Source string and the machine-code Binary). There is no IR boundary anywhere from the Python expression you typed to the bytes the driver receives: compilation is one graph being rewritten in place, through roughly twenty named rewrite stages (full_rewrite_to_sink in codegen/__init__.py), until what remains is the program.

Each UOp has five derived properties — dtype, shape, device, address space, and min_max — computed bottom-up from local rules. Hold on to two of those. Address space (GLOBAL, LOCAL, REG) is the memory hierarchy made explicit on every node, which is what lets a later stage assign and reuse storage mechanically. And min_max is an interval — every UOp knows a lower and upper bound on its value — threaded through the whole graph by arithmetic rules. Both will return as port candidates.

The op set itself is aggressively, delightfully minimal. The unary arithmetic primitives are exactly three: Recip, Trunc, Cast. Everything else is decomposed: Neg is multiplication by −1, Sub is add-of-neg, Sqrt is Exp2(0.5 · Log2(x)), and Exp2/Log2/Sin are themselves polynomial-approximation compositions. Random number generation is Threefry, five rounds of add-rotate-xor on counters — the same counter-based-RNG bet OCANNL made with its threefry4x32 machinery, arrived at independently, and for the same reason: a counter-based generator is a pure function, and pure functions are what a graph compiler can reason about. The spec then demonstrates the philosophy at tensor level by deriving common operations as compositions: matrix multiply is reshape-multiply-reduce; arange is a prefix sum of ones, itself built from a pad/reshape/expand sliding-window trick; and gather is a one-hot mask contracted against the table —

# gather: out[i] = T[idx[i]]. one-hot mask along gather axis, then reduce
pos  = arange(K).reshape(K, 1)
mask = (pos == idx.reshape(1, -1)).cast(T.dtype)
return (T.reshape(K, 1) * mask).sum(0)

— which is, symbol for symbol, the pattern OCANNL’s one-hot work (#343) starts from: represent indexed access as (k equals index) times table[k] summed over k, and trust the compiler to collapse the loop back into an indexed read. Two frameworks that refuse primitives end up refusing the same primitives, because the algebra underneath is the same. This is the deep kinship, and it is why the porting question is well-posed at all: the systems agree about what computation is; they differ in how the compiler is organized around it.

It is worth being concrete about that difference now, because it frames every section below. tinygrad is one dynamically-typed graph rewritten to fixpoint by a pattern-matching engine, with properties derived locally and checked at rewrite time. OCANNL is a stack of statically-typed IRs — tensor expressions, then the assignments language (Assignments.comp), then the loop-nest IR (Low_level.t), then backend C/CUDA/MSL syntax — with shapes inferred globally ahead of lowering by the row-polymorphic system the earlier posts in this series developed, and each layer’s invariants carried in its types. One dialect versus typed layers; derived-upward versus inferred-ahead; fixpoint rewriting versus ordered passes. Neither is dominant; each makes some ports trivial and others structural. The inventory below keeps returning to this.

From stride stacks to ranges

What rangeify replaced deserves a paragraph, because the replacement is the whole reason this comparison became tractable. The old ShapeTracker was a stack of view records — shape, strides, offset, and a validity mask per view — composing the six movement operations (reshape, permute, expand, pad, shrink, flip) without copying: the lazy, multi-layer analogue of a NumPy view. It was tinygrad’s signature data structure, and it carried the movement semantics of the entire framework: any indexing question was answered by pushing an index expression down through the stack of views, stride arithmetic at each level. In the current tree it is gone — not deprecated, gone; there is no file and no reference. Its replacement, in schedule/rangeify.py, works the other way around. Loop variables — Range UOps, each with an extent and an axis type — are created early, on the operation graph itself. Movement operations are then compiled away into index arithmetic on those ranges: apply_movement_op (in schedule/indexing.py) answers “given the output’s ranges, what index expression reads the right input element?”, so a permute reorders which range feeds which axis, a shrink adds an offset, a pad wraps the index in a validity guard, a reshape solves for the input ranges that linearize to the same address. When an indexing operation lands on a chain of movement ops, the chain is folded into the index computation directly (_mop_index), so no intermediate ever exists.

Fusion then stops being a separate planning problem and becomes a fact about ranges: operations that end up iterating over the same ranges collapse into the same loop nest; operations that don’t, don’t. What is genuinely decided — by a cost heuristic, remove_bufferize, weighing how many buffers an expression touches and whether reduction axes intervene — is bufferization: whether an intermediate value gets materialized into a buffer (Store, plus an After marking the lifetime) or stays inline as a subexpression of its consumers. A Contiguous marker forces materialization; kernel boundaries fall out at reduction edges (split_reduceop).

Readers of this series have seen this machine before, under different names, because it is OCANNL’s machine. OCANNL never had a ShapeTracker to tear out: einsum specifications go through shape inference to projections — for every operation, an assignment of affine index expressions (Iterator, Fixed_idx, and affine combinations of loop symbols, the axis_index type in arrayjit/lib/indexing.ml) to every axis of every operand — and projections lower to For_loop nests in Low_level.t. Loops created early, movement as index arithmetic, no stride metadata surviving to runtime: that has been the architecture from the start, which is exactly what the previous essay meant by the convergence coming from tinygrad’s side. And the materialization decision has an equally direct counterpart: OCANNL’s virtualization pass (virtual_llc) decides whether a tensor node becomes a buffer or is inlined into its consumers as a recomputed expression — bufferize is “materialized,” remove_bufferize is “virtual,” and tinygrad’s buffer-count cost heuristic plays the role of OCANNL’s visit-counting (virtualize_max_visits) and inlining toggles. Even the address-space story aligns: tinygrad’s GLOBAL/LOCAL/REG is OCANNL’s memory-mode lattice collapsing toward stored-versus-inlined, with Local_scope scalars as the REG end.

A small Rosetta stone, for orientation — the left column from the tinygrad spec and sources, the right from OCANNL’s tree:

tinygrad	OCANNL
UOp graph (one dialect)	`Assignments.comp` → `Low_level.t` → backend syntax (typed layers)
`Range` / `End`	`For_loop { index; from_; to_; body; … }`
movement ops → index rewrites (`apply_movement_op`)	einsum projections → affine `axis_index` per operand axis
bufferize vs. inline (`remove_bufferize` cost)	materialized vs. virtual node (`virtual_llc`, visit counts)
addrspace `GLOBAL`/`LOCAL`/`REG`	memory modes; `Local_scope` scalars at the `REG` end
`Store` + `After` (assign = write, then ordered passthrough)	assignments + compile-time hazard frontier (`last_writer`, `last_readers`)
`graph_rewrite` to fixpoint over `PatternMatcher` rules	ordered passes: `virtual_llc` → `simplify_llc` → CSE/hoisting
derived properties (dtype, shape, device, min_max)	static row-polymorphic shape inference ahead of lowering
`min_max` interval on every UOp	— (port candidate 3)
`AxisType` on every range	— (port candidate 2)
OptOps + BEAM search	— (port candidate 1)
device n-tuples, shard `axis` property	first-class contexts, merge buffers (port candidate 4: the property)

The dashes in the right column are the article.

What stays home

Before the inventory of what to take, the inventory of what not to — the things tinygrad built that OCANNL gets from its own constitution, listed not to score points but because each one shapes a porting decision below.

Shape inference. tinygrad’s shapes are derived properties: computed bottom-up per UOp, checked when an op is constructed, with broadcasting resolved by right-aligned per-axis rules. It works, and in a trace-everything system it is the natural design. OCANNL’s shapes are inferred, globally and bidirectionally, with row variables and the broadcast order the first four posts built: under-specified shapes flow both ways through a program and a conflict is reported before anything lowers. Nothing in tinygrad needs importing here, and — more to the point — several ports below get easier on OCANNL’s side because shape information is total and static by the time the loop IR exists.

Graph capture. tinygrad’s @function decorator exists because its programs are Python: to get a reusable graph fragment out of a Python function you must trace it — run it lazily, find the tensors, replace them with Param placeholders, wrap the body in a Function UOp. OCANNL’s %op/%cd syntax extensions build graphs natively; there is nothing to trace and no tracing infrastructure to maintain. The interesting part of tinygrad’s Function/Call layer is not the capture but what happens after, and that part is a real port candidate (number 5).

The rewrite engine, as machinery. tinygrad’s PatternMatcher — rules as a list of (UPat pattern, callback) pairs, applied by graph_rewrite to fixpoint — is the engine the entire compiler runs on, and a beautiful piece of bootstrapping in a language without pattern matching. OCaml has pattern matching; OCANNL’s simplify_llc is an ordinary exhaustive, typed, compiled match over scalar_t. The machinery is not the port. The discipline around it might be (number 6).

Contexts as values. tinygrad’s device state is global and stringly addressed ("CUDA:0"); kernels compile against the ambient device. OCANNL’s contexts are first-class values with explicit lineage — the previous essay’s whole subject — and this turns out to matter operationally for the very first port below: an autotuner wants to compile and time many candidate schedules for the same context state, and sibling compiles from one explicit frontier are a cleaner substrate for that than careful mutation of a global.

What remains is what tinygrad has and OCANNL wants. Six areas, in descending value-per-difficulty.

Port 1: the OptOps schedule layer

This is the big one, and rangeify is what moved it within reach.

In the ShapeTracker era, tinygrad’s kernel optimizations were entangled with view arithmetic and hard to even state in OCANNL’s vocabulary. In the rangeify era they are loop-nest transforms, and the vocabulary is shared. The spec defines an optimization as a triple — (op, axis, arg) — and the op set is small enough to quote whole: Split an axis by a factor, the new sub-axis acquiring a designated axis type; Padto an axis to a multiple, introducing validity masks; Swap two axes; Nolocals; and TC, mapping a reduction onto tensor cores. That is the entire schedule language. The classical names map one-to-one: Split is tiling when the new axis is a loop, vectorization when its type is UPCAST, unrolling when UNROLL, CPU parallelism when THREAD, a shared-memory group reduction when GROUP_REDUCE; Swap is loop interchange; Padto is the pad-to-multiple that makes tile sizes divide. (The implementation spells the splits out as separate enum members — UPCAST, UNROLL, LOCAL, THREAD, GROUP, … in codegen/opt/__init__.py — but they are one operation differing in the target type of the new axis, and the spec’s unification is the insight: a schedule transform is a split plus a retyping.) Schedules compose left-to-right as plain lists of Opt values; the search space is sequences of triples; and the search itself (beam_search in codegen/opt/search.py) is a BEAM over those sequences, compiling each candidate and timing it on the device, in parallel, keeping the fastest few per round until improvement stalls.

Halide’s lesson, which tinygrad inherits through TVM’s lineage (the TVM deep dive traces this), is that the algorithm and the schedule are separate artifacts, and the OptOps layer is that lesson reduced to its minimum viable form: not a scheduling DSL, just a list of triples acting on loop axes — small enough to search mechanically.

Now the OCANNL side, with the seed inventory first because it is more advanced than “no schedule layer” suggests. The substrate the OptOps act on — explicit loop nests — is Low_level.t’s native form. Unrolling exists as a code-generation mode (unroll_dims builds fully unrolled nests with fixed indices). The vectorization seed exists: Set_from_vec with its vec_unop family is a genuine vector-store primitive in the IR, currently with a single occupant (the threefry bit-conversion Uint4x32_to_prec_uniform) but with the right shape for UPCAST-style codegen to grow into. And the masking machinery Padto needs — generating guarded strips over the out-of-range region of a loop — already exists for a different purpose, as loop_over_padding_region, which walks exactly the left-margin/valid-middle/right-margin decomposition per dimension that pad-to-multiple requires. What is genuinely missing is bounded and falls in three parts:

(a) Axis annotations richer than For_loop’s current fields — that is port 2, the prerequisite, below.

(b) The schedule as a value, kept separate from the IR. Here OCANNL should follow the Halide discipline more strictly than tinygrad does. tinygrad’s Opt triples are values (frozen, serializable — that is what BEAM caches), but their application (apply_opt in codegen/opt/postrange.py) destructively rewrites the one graph in place, mid-pipeline. The OCANNL-shaped version is a pure pass: a schedule is a list of (optop, axis reference, argument) triples, and applying it is a Low_level.t -> Low_level.t function run after virtualization — the kernel before scheduling and the kernel after are both ordinary values of the IR type, comparable, printable, testable. This is the v0.8 tiling work’s natural form, and the existing tiling proposals (#412 for GPU, the CPU matmul plan for the sibling) describe the transforms; what no proposal yet pins down is this schedule IR itself — the Tiramisu study’s thesis is precisely that this layer is what OCANNL is missing, and the gap now has a stub of its own.

(c) The search harness — and this is the easy part, not the hard one. BEAM over schedule prefixes with on-device timing is a few hundred lines in any language once (a) and (b) exist; the superoptimizer survey covers the design space of cost models and search strategies beyond it. The one structural remark worth making: contexts-as-values are a better autotuner substrate than tinygrad’s global device state. Candidate timings are sibling compiles from one frontier — the same context value, compiled against repeatedly, with no mutation to undo between candidates and no risk that timing candidate 7 perturbed the state candidate 8 compiles against. tinygrad’s BEAM manages this with care (per-worker device isolation in a multiprocessing pool); OCANNL’s design makes the isolation a non-event.

The honest design problem — the one the article cannot solve, only locate — is pass ordering against virtualization. Inlining changes which loops exist, so the schedule must act after virtual_llc; but a schedule can change what should have been inlined — Padto, in particular, grows a computation and wraps it in guards, which can flip the cost balance that made a node virtual. tinygrad dodges the question by construction: everything is one graph, simplification is re-run between every stage because re-running a fixpoint rewriter is cheap, and a bufferize decision revisited is just another rewrite. In OCANNL’s layered architecture, “go back and re-virtualize” is a structural event, not another rewrite round. The plausible answer is the boring one — pick the order virtualize → schedule → simplify, accept that a Padto-triggered re-virtualization is a second iteration of the pipeline rather than a fixpoint subtlety, and measure how often it matters — but it should be designed deliberately, not discovered.

Port 2: AxisTypes — the prerequisite

Every tinygrad range carries an AxisType, and the type system of axes is where the schedule language and the hardware mapping share a vocabulary: GLOBAL (a GPU grid dimension), LOCAL (a workgroup dimension, with access to shared memory), WARP, THREAD (CPU parallelism), plain LOOP, REDUCE, GROUP_REDUCE (a reduction staged through shared memory), UPCAST (register vectorization), UNROLL. An OptOp’s Split names the type of the axis it creates; the code generator’s later stages (pm_add_gpudims) turn GLOBAL/LOCAL axes into gidx/lidx thread indices rather than loops; the renderer emits a Barrier where a GROUP_REDUCE requires one. The mapping from loop structure to hardware is, in other words, explicit in the IR, decided by the schedule, and carried by annotation — not a convention buried in a backend.

OCANNL today is at the opposite pole, and it is worth stating plainly because the gap is the point: For_loop carries an index, bounds, a body, and a tracing flag — nothing else — and every backend currently emits single-threaded kernels. The CUDA backend launches with a 1×1 grid and guards the kernel body with if (threadIdx.x != 0 || blockIdx.x != 0) return;; Metal dispatches one threadgroup. This is deliberate sequencing (correctness and the optimizer first, parallel codegen at v0.8), but it means the loop-to-hardware question has not yet been answered anywhere, and the choice now is whether to answer it as a backend convention (“backends parallelize outer loops they deem profitable”) or as IR annotation. The tinygrad experience argues strongly for annotation: one field on the loop — or on the loop symbol — is what makes the schedule layer (port 1) expressible at all, since Split-and-retype needs a type to retype to; it is what #412’s grid/block mapping decision becomes once made explicit; and it is the natural home for the address-space cousin already prospected at the constant end (#195, CUDA __constant__ memory).

What the annotation unlocks beyond bare parallelism is the achievable 80% of “thread synchronization”: GROUP_REDUCE axes plus workgroup Barriers plus LOCAL-address-space buffers give shared-memory reductions and tiled matmuls — the structured, well-understood patterns that Flash-attention-style kernels eventually consume. And it is honest to mark where the 80% ends, because tinygrad marks it too: the spec’s Barrier is workgroup-scoped, kernels split at reduction boundaries, and nothing in the current tree does grid-level synchronization or persistent thread blocks — cross-kernel sequencing remains the host’s command buffers. The megakernel ambition the previous essay discussed (and OCANNL’s #318 write-up explored) is not solved by tinygrad’s spec either; AxisTypes are the staircase toward it, not the summit. Porting them buys what tinygrad has — workgroup-level structure — and leaves the grid-level question open on both sides.

Port 3: min/max intervals — cheapest, pays everywhere

Every tinygrad UOp carries min_max: an interval bound on its value, computed as a cached derived property (vmin/vmax in uop/ops.py) by local rules — a constant is [v, v], a Range over n is [0, n−1], addition adds endpoints, multiplication takes the extremal product, division and modulus by constants get careful special cases, and everything else defaults to its dtype’s range. The payoff is diffuse and constant: the symbolic simplifier uses intervals to discharge guards (a comparison whose operands’ intervals don’t overlap is a constant; a validity mask provably true is deleted), prove indices in-bounds (eliminating bounds checks), and fold the comparisons that index arithmetic generates. It is the quiet enabler of the louder features: Padto is affordable because most of the masks it introduces are discharged by interval reasoning, and rangeify’s index rewriting generates exactly the mod/div/comparison soup that interval bounds clean up.

The OCANNL port is unusually crisp because the receiving site already exists. scalar_t — the scalar expression IR that simplify_llc rewrites — has precisely the constructors an interval lattice wants: Constant is exact, Embed_index is bounded by its loop’s extent (statically known, since projections fix every loop range), Get is bounded by its dtype, and the arithmetic nodes take interval rules. An interval_of : scalar_t -> float * float (or the integer analogue over index expressions) slots into simplify_llc’s world as one more analysis the rewrite arms can consult.

What makes this the best effort-to-payoff item in the inventory is that four existing efforts independently approximate it. The non-linear-inlining work’s injectivity stage (#133) is range arithmetic over loop extents by another name; the local-initialization tracking (#340) wants to know whether a read can precede a write, a question intervals over index expressions answer; the one-hot matcher (#343) needs side-conditions of the form “this index expression stays within these bounds”; and the surjectivity reasoning that landed in #420 is the same arithmetic again. Four passes, four ad-hoc approximations of one analysis — the port is the unifying upgrade, and it sequences naturally before the schedule layer, since Padto’s masks will want it on arrival.

Port 4: the shard axis — port the property, not the device tuples

tinygrad’s multi-device design has a part to admire and a part to decline, and they separate cleanly.

The mechanism: a tinygrad device may be an n-tuple of devices, a buffer created on a tuple is split across them, and the split position is tracked as one more derived property — axis, the sharding dimension — propagated by local per-op rules exactly the way dtype and shape are. Reshape remaps the axis so the shard boundary is preserved (it must solve for an output position where the element counts left of the boundary agree); Permute carries the axis through the permutation; a Reduce on the shard axis annihilates it (the result is no longer sharded); shrinking it likewise; Copy strips it. With that bookkeeping in place, the collectives are not runtime primitives but derived programs — the spec writes them out in six lines each: a broadcast is reshape-expand-copy, and

def allreduce(T):
  return allgather(reduce_scatter(T))

where reduce_scatter is itself a reshape-permute-copy-sum composition. The system always knows where data is split, because knowing it is just property propagation; and the communication patterns are ordinary graph programs the compiler schedules like anything else.

The part to port is that bookkeeping — the propagation algebra. The part to decline is the substrate it rides on: interchangeable device n-tuples. A tuple of devices is a homogeneity assumption — every shard the same size, every device the same kind, any device substitutable for any other — and it is precisely the assumption OCANNL’s first-class heterogeneous contexts are built to avoid. The previous essay’s motivating pair — a Mac Studio and a MacBook over a network — has shards of different sizes on devices of different backends with different transfer costs, and “which context performs the combination” is a real decision (bandwidth, memory headroom, who is busy) that no annotation can make by itself.

This port area has a decision behind it now, which the analysis here was input to. OCANNL’s sharding cluster (#293) resolved its design verdict in June 2026 (the 293b elaboration): sharding lands as per-shard backend contexts with explicit primitives — shard_along, gather, grad_sync — host-orchestrated, with merge-buffer copies as the one cross-shard channel, composing with slice-as-alias for copy-free sharding and the training-loop integration on top. Read against that verdict, the tinygrad algebra is not a rival proposal but the checking and sugar layer over it: a per-tnode placement annotation, threaded through shape inference and projections the way tinygrad threads axis through UOps, would let the system (first) verify that the explicit primitives are placed consistently — that a gather really sits where a shard boundary dies, that a reduce over a sharded axis is preceded by the grad_sync it needs — and (later) insert them, lowering annotation boundaries to the same merge-buffer transfers the explicit code writes today, with terser syntax as the surface. The honest port is therefore the propagation rules plus an explicit placement policy hook — the heterogeneity decision as a first-class parameter, where tinygrad hardcodes interchangeability — and its natural sequencing is after the explicit primitives ship and the feasibility-study end of multi-node work makes the policy hook earn its keep.

Port 5: the Function/Call layer — joint compilation of a lineage

tinygrad’s call layer is lambda calculus on graphs: Function substitutes Param placeholders with arguments, Call is an opaque invocation of something compiled, Tuple/GetTuple package multiple results, and ordering rides on After — a passthrough of a buffer that guarantees its dependencies executed, so that “assign” decomposes into a Store plus an After. Most of this, OCANNL has natively or does not need. Graph capture, as noted, is moot. The Call-shaped object exists as the routine; After’s job — ordering effects on shared buffers — is done at compile time by the hazard frontier in context.ml, which tracks per-tnode last_writer and last_readers and derives the read-after-write, write-after-read, and write-after-write edges as routines are compiled in a lineage (the execution-dependency-tracking design extends the same information to run time, and the merge-buffer static verification of #288 already leans on it).

The narrower idea worth taking is what tinygrad’s pipeline does between capture and render, read as a pair. First: a Function is a unit of joint compilation — everything inside it is scheduled and optimized as one program, however many kernels fall out. OCANNL’s analogue would be joint compilation of a lineage: several comps whose hazard edges are already known, handed to the compiler as one unit, producing one artifact — one C file, one ordered kernel sequence — instead of routine-at-a-time compiles whose boundaries are accidents of how the user grouped code. The hazard frontier means the dependency information is already in hand; what is missing is only the entry point that accepts more than one comp. Second, the spec’s Memory Plan stage: “allocate and reuse GLOBAL, LOCAL, and REG storage for values with non-overlapping lifetimes.” (Implementation honesty: in the current tree this is a per-kernel linear-scan register allocator, codegen/late/regalloc.py, for the ISA backends — the grand cross-kernel version is the spec’s ambition more than today’s code.) OCANNL holds the two ends of this already — Local_scope hoisting and CSE are the REG end, and the universal pool allocator with its lifetime-based reuse is the GLOBAL end (#340 sharpening the local end further). The middle — lifetime analysis across the routines of a jointly compiled lineage, so that two never-overlapping intermediates from different comps share a buffer — is exactly what joint lineage compilation would make well-posed. The two halves of this port are one feature seen from two sides.

Port 6: the rewrite engine — port the discipline, not the machinery

The case for restraint first. tinygrad’s PatternMatcher exists because Python lacks pattern matching, and much of its sophistication (pre-compiled match functions, early-reject sets) is performance work to make an interpreted matcher viable on hot paths. OCaml’s match is exhaustive, typed, and compiled; simplify_llc’s rewrite arms — constant folding, identity elimination, reassociation, FMA fusion, integer-power unrolling — are ordinary code that the compiler checks. Porting the machinery would be importing a workaround for a problem OCANNL’s language already solves.

What does not come for free in OCaml is what tinygrad gets as a side effect of rules-being-data: provenance and observability. Every tinygrad rewrite step is attributable — a named rule, in a named stage, and the VIZ=1 tooling will show you every rewrite that fired on the way from tensor expression to kernel. OCANNL’s rewrites are anonymous match arms, and the cost of anonymity has already been paid once: the CSE alpha-equivalence unsoundness was a subtle bad rewrite that lived in the optimizer precisely because no per-rule accounting existed to make it visible. The relevant trajectory is that the rule count is about to grow — the one-hot/gather collapse (#343) and the non-linear inlining family (#133) are pattern-rewrites with side-conditions, exactly the population that wants rules-as-data with provenance, run to fixpoint, with the lowering audit as the documentation prerequisite. An e-graph over scalar_t — equality saturation instead of ordered rewriting — is the possible eventual shape, already surveyed as a v0.9 candidate in the superoptimizer review. Lowest urgency of the six; the trigger to act is the moment match arms in simplify_llc start needing comments that say “this must run before that.”

The bets underneath

Step back and the six ports sort themselves by what kind of thing they are. Three are mechanisms tinygrad has and OCANNL lacks outright — the schedule layer, the axis types, the intervals — and they form a single dependency chain: intervals make masks cheap, axis types make schedules expressible, schedules make the search worth running. Their sequencing is forced, and it is the v0.8-and-beyond performance arc under another description. One is a property to re-derive on different substrate (the shard axis, as sugar over the decided explicit primitives). One is a boundary redraw (joint lineage compilation, making the unit of optimization bigger than the unit of authorship). And one is a discipline (rules with provenance) whose machinery should stay home.

What the deep dive changes, relative to its ShapeTracker-era predecessor, is where the gap between the frameworks lives. It used to live in the representation: stride stacks versus loop nests was a translation problem, and every comparison stumbled on it. Rangeify dissolved that — the two systems now agree that movement is index arithmetic on early-created loops, that materialization is a cost decision, that views are the special case and not the foundation. The gap that remains is the schedule: tinygrad has a complete, minimal, searchable language for mapping one loop nest onto hardware many ways, and OCANNL — whose single-threaded kernels are a sequencing choice, but a fact — does not yet. That is a much better gap to have. It is bounded, it is well-specified by a working system whose spec fits on a page, and OCANNL’s side of the foundation (typed loop IR, static shapes, contexts as values) is laid.

And the roadmap sentence this all orbits — “instead of dynamic scheduling as in tinygrad, we can schedule statically by program search” — reads differently after the dive than before it. The opposition it names is real but narrower than it sounds: tinygrad’s BEAM is program search, run per-kernel at compile time against the live device; the “static” ambition is the same search moved offline, its results cached and shipped, its cost models trained rather than measured. The two frameworks disagree about when the search runs and what state it mutates — tinygrad searches against a global device at trace time, OCANNL would search over schedule values against context values, whenever convenient — and that disagreement is the same one this whole comparison kept finding: one dialect rewritten in place under derived properties, against typed layers and values inferred ahead. tinygrad spends late-bound flexibility to avoid early commitments; OCANNL spends early commitments to know things sooner. Both systems refuse primitives; they disagree about when to bind. The refusal is why the ports are possible. The binding is why they are worth doing.

tinygrad is at github.com/tinygrad/tinygrad; the spec quoted throughout is spec/tinyspec.tex at commit 2bfdf85f87da. OCANNL is open source at github.com/ahrefs/ocannl.