Split and Concatenation Are Two Directions of a Coproduct: Coupled Iteration in OCANNL

Łukasz Stafiniak and Claude (Anthropic)

Split and Concatenation Are Two Directions of a Coproduct

Here is the loop nest that concatenating two vectors compiles to, a of length A and b of length B placed end to end into out of length A + B:

for i in 0 .. A-1:
    out[i] = a[i]
for j in 0 .. B-1:
    out[A + j] = b[j]

And here is the gradient — the incoming gradient g on the joined axis distributed back to the two sources:

for i in 0 .. A-1:
    a.grad[i] += g[i]
for j in 0 .. B-1:
    b.grad[j] += g[A + j]

The two nests are the same arithmetic run in opposite directions. Forward, two reads are gathered into one output axis, each source written to its own stretch of the result; backward, one gradient is scattered back, each stretch read off and returned to its source. The offset A — the length of the first component — is the only quantity that does any work, and it appears in both, once placing a write and once placing a read. This post is about that offset, and about the single structure the two nests traverse in opposite directions.

Three companion pieces built the machinery this rests on. Broadcasting Is an Order developed shape inference: what a shape is, and how a solver fills in the sizes. Contraction Is Emergent developed projection inference: how the same constraints, re-read as an equivalence over axes rather than an order over sizes, become the loop nest of an operation, with reductions read off the result rather than declared. Convolution Is a Contraction added the affine index — a third index form, a function of loop variables — and with it a stratified solve-then-evaluate solver, padding accumulated as a monotonic fixpoint and frozen at finalization, and tensor initialization. All three deferred one construct, and named it the same way each time: the one that enlarges the machinery rather than populating a corner of it. This is that construct.

What the first three posts kept flat

Every operation in the predecessors compiled to a single nest over a product space — a Cartesian product of integer ranges, one loop variable per factor, the factors independent. A matrix multiply iterates {i, j, k}; a pointwise add iterates the result’s axes; a convolution iterates output positions and kernel positions, and reads its input at an affine combination of the two. The affine index of the third post looked like it might break the pattern, since one index-map component named two loop variables instead of one — but it did not: it added no factor and coupled none, it only read an existing flat product at a computed address. The third post said so in as many words: the product space stayed flat throughout, one factor per axis, each a single range. The first two posts had each flagged, as the thing they were deferring, exactly the construct that would not stay flat.

Concatenation is that construct, and the break is precise. A factor stops being a single range and becomes a disjoint union of ranges, traversed in sequence rather than as a product. In the opener, the output axis of length A + B is one factor, but iterating it is not one loop — it is the A-loop followed by the B-loop, two ranges laid end to end, each driving a different source. The product of ranges that carried three posts acquires its first coproduct.

It is worth saying immediately why this is the same word a category theorist would reach for, because the structure earns the name and the rest of the post turns on it. A coproduct of two ranges R_a and R_b — their disjoint union R_a \sqcup R_b — comes equipped with two injections, \iota_a : R_a \hookrightarrow R_a \sqcup R_b and \iota_b : R_b \hookrightarrow R_a \sqcup R_b, that place each summand into the whole without overlap. For ranges of integers the injections are exactly the offset maps: \iota_a(i) = i and \iota_b(j) = A + j. Those are the two addresses in the opener. Concatenation is the injections — it places each component into its slice of the result, offset by the lengths of the components before it. The defining property of a coproduct is that an element of the whole comes from exactly one summand: the slices tile the result and do not overlap. We will see that the disjointness — no position claimed by two summands — is not decoration but a checked invariant, the thing that separates a well-formed concatenation from an ambiguous one.

This is the categorical dual of what the predecessors iterated. Their domain was a product of ranges, and a contraction (the second post) concerns a product factor the output omits, reduced away by the assignment’s accumulation. Concatenation introduces the dual construct: a coproduct factor. And because the size of a disjoint union is the sum of its parts’ sizes, the joined axis comes out exactly as long as its components together — the one place a sum appears here, and it is the cardinality of the coproduct, nothing more.

One term, two readings

The affine index of the third post came from enriching the dimension term with a constructor read by two solvers — a size solver that asked how big an axis is, and a projection solver that asked which cell it addresses — so that size and address could not disagree, because they were computed from one source. Concatenation enriches the same term the same way. Alongside the third post’s Affine, the dimension term gains

Concat of dim list

a dimension built by concatenating its components. And it is read twice.

The size reading is a sum: once the components are known, a Concat resolves to the sum of their dimensions. A concatenation of a length-A and a length-B axis is a length-(A + B) axis. (When the size solver meets a Concat whose components repeat the same variable, it may rewrite it internally as a stride-only affine term — n copies of a length-d axis being a length-n·d axis. That reuses the affine representation for something that is not affine as a projection, but the rewrite is confined to size solving and never reaches projection inference, so nothing downstream sees the discrepancy.)

The projection reading is where the offset lives. As an index, a concatenated axis is

Concat of symbol list

one iterator symbol per component, and at lowering it resolves to whichever component is active at that point in the iteration, addressed at that component’s iterator plus the injection offset — the cumulative size of the components before it. The first component reads at its own iterator with offset zero; the second at its iterator plus the first component’s length; the k-th at its iterator plus \sum_{m<k} (length of component m).

One small asymmetry with the broadcast machinery is worth flagging now because it will matter at lowering. The second post taught that a size-one axis that broadcasts is severed and pinned to Fixed_idx 0 — it earns no loop, it is read at its single cell while its partner iterates. A size-one axis that participates in a Concat is not collapsed that way. Every component of a concatenation keeps a real iterator symbol, even a length-one one, because lowering needs those symbols to decide which component is active and to compute its offset. Broadcasting throws away the loop of a unit axis; concatenation cannot, because the unit is one summand of the disjoint union and the bookkeeping needs to know it is there.

The coupling: a second union-find

The second post’s projection inference was a union-find. It read the elaborated constraints as an equivalence over axes: two axes the spec forced to coincide were unioned into a co-iteration class, one loop variable per class, and a reduction was an output that omitted a class. Equality unions; broadcasting severs. The whole of it was flat — each class became one factor, each factor one range.

Concatenation adds a second union-find, over the same symbols, computing a different relation. After the first pass has settled which axes co-iterate and minted iterator symbols for them, the projection deriver collects every Concat index and treats it as a hyperedge over the symbols it names. It then takes the connected components of that hypergraph. Each component becomes one factor of the product space, and the symbols within a component are made to iterate sequentially — one after another, the result laid out as consecutive loops — rather than nested as an independent product would be. The product space is consequently no longer an array of single ranges but an array of lists of ranges: a factor carries a list of sizes and a list of iterators, singletons everywhere except where a concatenation coupled several axes into one.

The two union-finds compute different relations over the same symbols. The second post’s groups symbols that are the same loop: an equivalence, “these axes are driven by one variable.” This post’s groups symbols that must occupy the same factor and run consecutively: “these axes are not the same loop, they are different loops that share a slot and take turns.” The first is identification, the second is sequencing. An equivalence class collapses its members to one iterator; a connected component keeps its members as distinct iterators and orders them. Where the second post said equality unions, broadcasting severs, this post adds the third verb: concatenation couples.

The sequencing is what the disjoint union demands. Iterating R_a \sqcup R_b means running through R_a and then through R_b — not the pairs R_a \times R_b, which would be their Cartesian product, a different operation. So the lowering emits the components of a coupled factor as a flat sequence of loops, not a nest. At each point one component is being traversed; the others are inactive.

That the components run in sequence is a fact about the lowered IR, not about the operation. The components are independent — prefix and suffix could be written in either order, or at once — so the consecutive loops are an over-specification, a total order imposed on a computation that requires only a partial one. This is the lowered IR’s character in general, not a quirk of concatenation: it can be read as committing to a purely sequential schedule. OCANNL treats that as a separation of concerns. The backends — the GPU emitters and optimizers — are meant to re-derive the independences latent in the lowered IR and parallelize accordingly, and to refuse a computation whose shape does not admit efficient execution, rather than to faithfully honor whatever sequential reading the IR happens to express. Efficient GPU execution rests on order not affecting the result, and the coupling is built so that it does not. (This part of OCANNL is not yet fully developed.)

Two directions, and the swap between them

Forward concatenation reads several sources and writes one result, placing each source into its slice. Its components are the injections; the operation is a gather along them. But the same Concat index, and the same offsets, describe a whole family of operations, distinguished only by which side of the assignment the concatenated axis sits on and which components are present:

Spec	Operation
`a; b => a^b`	concatenate: place two sources end to end
`a^b => a`	extract the prefix (slice)
`a^b => b`	extract the suffix (slice)
`a => a^b`	write the prefix, leave the suffix (partial update)
`b => a^b`	write the suffix, leave the prefix (partial update)
`a^b^c => b`	extract the middle
`b => a^b^c`	write the middle

Reading the index along the injections — component to its offset position in the whole — builds the result: concatenation, and the partial updates that write one slice. Reading it against the injections — restricting the whole back to one summand’s range — takes a part out: the slices and extractions. These are the two directions the title names. A coproduct does not come with projections the way a product does, so “the other direction” is not a categorical projection; it is the partial inverse of an injection, the restriction of the joined axis to one component’s image. Split is concatenation read backwards, and that is all the relation there is between them: one structure, the disjoint union with its offsets, traversed two ways.

The gradients make the swap literal. The gradient of a gather is a scatter: differentiate a; b => a^b and the incoming gradient on the joined axis must be distributed back to a and b, each component reading its slice — which is the backward direction, an extraction per source. The gradient of a slice is the reverse: differentiate a^b => a and the prefix’s gradient must be placed back into the prefix slice of the source’s gradient, zero elsewhere — the forward direction, a placement. So concatenation and split are not only two directions of one structure; each is the gradient of the other. The four operations — concatenate, split, and their two gradients — are the injections and their reverses, nothing more.

In the assignment IR this reversal has a name, and the reason it needs a dedicated one — rather than the slot permutation the second post used for ordinary gradients — is concrete. Every accumulating operation in the IR writes a single left-hand side: a Unop, Binop, or Block reads its operands and accumulates into one tensor. The forward concatenation a; b => a^b fits that mould as a Block — its Concat sits on the single written result and selects, per position, which one of the several read operands is live (at most one, by the disjoint cover). Its gradient does not fit. The incoming gradient on the joined axis has to be scattered back to several tensors, a’s gradient and b’s, with the Concat now choosing per position which one to write. That is a masked array of write targets, and no single-left-hand-side form expresses it; nor does any permutation of buffers within such a form, since a permutation relabels the one write target it has and cannot conjure a second. The IR provides exactly one construct whose written-to side is an array — Rev_sides — and it works by reuse, not re-derivation: it takes the forward Block, makes that block’s single left-hand side the tensor read from and its operand array the masked tensors written to, and carries the Concat, the offsets, and the coupled factor over untouched. A gather selects among reads and writes one; its gradient reads one and selects among writes; the move between them swaps which side is the array, and that is the move a permutation cannot make and Rev_sides is.

Block and Rev_sides are the only accumulating operations the IR lets carry concatenated axes, and the reason is the disjoint cover. A Concat index is well-defined as a write only if no result position is claimed by more than one source: the sources’ slices must not overlap. The lowering enforces this directly. As it walks the coupled factor it determines, for each position, which component’s range is being traversed, admits only the source whose component is active there, and resolves the Concat to that component’s iterator offset by its cumulative size. If two sources were ever viable at one position — if their slices overlapped — the operation would try to combine both into one cell, and the lowering rejects it as ambiguous. It is allowed for no source to be active at a position: a concatenation may leave gaps, which is how a partial update writes one slice and leaves the rest of the target untouched. At most one source per position — exactly one where the slices cover the result, none where they leave a gap — is the coproduct’s universal property restated as a code-generation check.

To see the surface syntax, take the single-source slice a^b => a in %cd. The forward extraction reads the prefix of the concatenated source into its whole output:

out =:+ source ~logic:"a^b => a"

The leading =: clears out to the accumulator’s neutral element and + is the accumulating operator — though on a write-once projection like this slice, where every output cell is assigned exactly once, the accumulation has nothing to do and the line coincides with a plain set, out =: source. The lowering knows this: when the projection is injective it emits a set and skips the accumulation altogether. (The colon-first spelling is the one the second post traced to OCaml fixing an assignment operator’s meaning by its first character, which is why there is no +=.) The accumulation earns its keep in the gradient, which adds rather than erases because source is shared with whatever else reads it:

source.grad =+ out.grad ~logic:"a => a^b"

Both of these are Blocks — a single-source slice and a single-target partial write — and ~logic builds each projection straight from its spec; neither is reversed. Rev_sides is what the gradient builder reaches for in the case of the previous paragraph — the multi-source concatenation whose gradient scatters to an array of tensors. An operation is defined by a pair of builders, a forward op_asn and a backward grad_asn, and both are handed the same projections (the gradient builder also receiving the incoming gradient and the operand tensors). The gradient is therefore not re-derived from a written spec: grad_asn already holds the forward’s projections, and for a concatenation it builds the Rev_sides-tagged assignment directly from them, array-valued write side and all. Only the many-target scatter truly requires the reversal — a single-target gradient could be a Block — though the ++^ combinator builds every concatenation gradient as a Rev_sides, the one-source case included.

The empty summand: dimension zero and `discardable_vars`

Concatenation extends the dimension order the first three posts were built on.

The first post’s order had a top: 1_\emptyset, the claim-free size-one axis, the thing a scalar carries and the thing rank-broadening inserts. It is the top of broadcasting — the no-claim unit inserted to make up rank, the identity of the operation that fills an axis by replication. In the algebra of sizes under broadcasting, 1_\emptyset is the unit of the product: replicating along a size-one claim-free axis changes nothing, A \times 1 \cong A — but that isomorphism is categorical, a fact about element counts, not an operational identity in the einsum sense. A materialized size-one axis still occupies a position and so shifts axis alignment, which is exactly why the first post keeps the rank-zero empty row and the rank-one []\cdot\diamond\cdot[1_\emptyset] distinct: same data, different operands.

Concatenation has a different unit, and it is not one but zero. The unit of a coproduct is the empty summand: A \sqcup \emptyset \cong A, joining an empty range to a range leaves it unchanged. A length-zero axis is the empty range, and so a concatenation that includes an empty component is a concatenation with one fewer component — the empty slice contributes no positions, places nothing, and the offsets of everything after it are unmoved. For the coproduct to have a unit at all, the size domain must be allowed to reach zero, which the first post’s domain (sizes \geq 1, with closing guessing unforced axes to 1) deliberately could not.

So concatenation requires a controlled extension: in concatenation contexts, certain dimension variables are permitted to resolve to zero, a size that is otherwise illegal. These are the components that a specification may discard or leave empty, and the system tracks them as a set of discardable variables (discardable_vars in the code). Permitting that zero is a deliberate exception, confined to exactly these variables and named for what licenses it: a variable is discardable precisely when its component can be dropped. When closing guesses unforced variables to the unit of the operation that owns them, an ordinary variable guesses to 1 (the product unit, the first post’s policy) but a discardable variable guesses to 0 (the coproduct unit). The two units sit side by side, each the identity of its own operation, and which one an axis falls to is decided by whether it is a discardable summand of a concatenation.

It is worth being exact about what “two units” means, because they do not live in one order. The first post’s order is the broadcast order: a d_1-axis refines a d_2-axis when the second broadcasts to fill the first, which puts 1_\emptyset at the top — the claim-free unit, refined by anything — leaves distinct concrete sizes mutually incomparable in the middle, and (once completed to a lattice) puts the broadcast error at the bottom. Zero is none of these. A length-zero axis broadcasts into nothing and nothing broadcasts into it: you can neither replicate an empty axis to fill a populated one nor collapse a populated one to empty. Under broadcasting 0 is neither above nor below 1, and it is not the error bottom either — it is incomparable to everything, off the broadcast order entirely (though it is the bottom of the additive order the coproduct itself induces, where 0 is a summand of every range: the familiar pattern of an empty product sitting at a top and an empty sum at a bottom). The two units belong to two structures sharing one carrier — broadcasting with unit 1_\emptyset (its top), concatenation with unit 0 (off the broadcast order) — so closing an unforced axis is not “fall to a distinguished element of one lattice” but “guess the unit of the operation that owns it”: 1_\emptyset where broadcasting left it free, 0 where a concatenation left a component free as the empty summand. The discardable set records which operation owns which free axis.

Which components are discardable is a real question. Consider what “discardable” should mean. In a^b => a, the result is just the prefix; the suffix b appears in the source spec but contributes nothing to the result, so b may be empty — it is the part being sliced away, and an extraction of the prefix should work whatever the suffix’s size, including zero. In a^b => b symmetrically a is discardable. But in a^b => a^b neither is: both components appear on both sides, both carry real data through, neither may vanish. The rule has to distinguish “this component is projected away on the other side” from “this component is carried through.”

The precise condition is a quantifier alternation. A variable v, appearing in one component of a Concat on one side of the spec, is discardable when: for every shape on the other side, there exists an axis such that for every component of that axis, the complement of v’s component intersects it — where the complement is the union of variables from the other components of v’s own concatenation. Read operationally: v’s component is discardable exactly when, on the other side, v’s siblings already cover some axis completely, so that v itself is redundant there and may be dropped to nothing. The worked cases:

a^b => a: the complement of b is \{a\}, and on the result side the single axis a is covered by \{a\} — so b is discardable. (Symmetrically a in a^b => b.)
a^b => a^b: the complement of b is \{a\}, but the result axis is itself a^b, whose components are a and b separately; \{a\} covers the a component but not the b component, so the “for every component” clause fails. Neither variable is discardable — correct, both carry through.
a; b => a^b: two sources, a alone and b alone. For b to be discardable its complement \{a\} would have to cover an axis of every source, but the second source is just b, which \{a\} does not touch. Neither is discardable.
a; b => a^c: here c is not discardable, because its complement \{a\} fails to cover the second source b. The slot c is left unconstrained, not redundant, and must be solved rather than zeroed.
a; b => a^b^c: c’s complement is \{a, b\}, and each source — a alone, b alone — is an axis fully covered by \{a, b\}, so c is discardable. The result names a third component that no source fills, and it falls to the empty summand: the spec computes a^b with a vestigial empty slot.

The alternation — for all shapes, there exists an axis, for all components — is doing exactly the work of asking whether a summand is already accounted for elsewhere and may therefore be the empty summand. It is the syntactic test for “this is a unit of the coproduct here,” and it is what lets one notation express both a concatenation (every component carries through, nothing discardable) and a slice or partial update (the unwritten components discardable to zero) without a separate operator for each. Specifications with integer-sized components — 3^a => a, skip three at the front; a => a^3, write all but the last three — are the same idea with the discarded summand’s size given rather than inferred: the constant component becomes a fresh internal symbol that the projection skips, a known-size empty-on-the-result slice.

The a; b => a^b^c case is worth pausing on as a design question rather than a derivation, because it is the rule at its most permissive. The spec is accepted and c falls to nothing, which is graceful when the empty slot is meant — a fixed template prefix^middle^suffix reused in a call that supplies only two of the three parts, the third defaulting to empty — and it is the very mechanism that lets the ^ notation cover slices and partial updates with no new operator. But it also accepts a typo: a user who wrote c expecting a third source and forgot to pass it gets a silently empty component rather than an error. Whether that should stay silent, warn, or demand that an intentionally empty slot be marked explicitly — some annotation separating “leave this empty” from “a source belongs here” — is open, and the coproduct reading is neutral on it, since an empty summand is well-defined either way. The decision is about how much the surface notation should let inference guess on the user’s behalf: the same tension the first post’s closing policy raised, now with 0 rather than 1 as the value being guessed into existence.

Operationally, a discardable variable is a concatenation component the operation does not iterate. A component is discardable just when it is projected away on the other side of the spec — written in the notation but contributing no indexing to the computation — and an axis the operation never reads needs no extent, so its size is free to fall to zero. “May be the empty summand” and “is not iterated here” are one fact seen from the size end and the projection end. The size is still shared across the program, so a discardable variable can be forced positive from elsewhere: if the same variable indexes a real axis in another operation it takes that size there, while the operation that discards it simply never loops over it. Discardability is a verdict on an occurrence in a spec, not on the variable globally.

There is care needed in the solver so that this extension does not destabilize the rest. The first post’s closing policy guessed each unforced axis to a unit; now there are two units to choose between, and guessing the wrong one is a bug, not a broadcast. So the discardable set is computed up front from the specification’s structure (the quantifier test above, run over the spec’s two sides) and consulted at exactly the closing stages where unit-guessing happens, which guess discardable variables to 0 and everything else to 1. And unification of Concat terms is mostly structural — two concatenations of equal length unify component-wise — with one inequality case held open: a concatenation almost all of whose components are discardable can be left as an inequality rather than forced to an equation, so the solver can still infer the one live component from context. The machinery is small, but it is the first place the size domain has more than one unit to default to, and the second unit is there because the coproduct needs one.

Block matrices: the coproduct in more than one axis

Nothing restricts a single operation to one coupled factor. A specification may concatenate along several axes at once, and when it does, the coproduct structure reaches into more than one factor of the product space — which is exactly a block matrix and its relatives.

Take a result whose two axes are each a concatenation: a 2\times2 block matrix assembled from four sub-blocks is a result indexed r^s, c^t (rows split into r then s, columns into c then t), with each source block written to one (row-slice, column-slice) rectangle. Here two factors are coupled, the row factor and the column factor, and the active component must be chosen independently in each. The disjoint-cover invariant now lives in two dimensions: the rectangles tile the result, and at each point exactly one block is active — the pair of active row-component and active column-component picks it out. The lowering handles the multi-axis case by the same machinery as the single-axis one, the active-component test applied per concatenated axis and the offsets accumulated per axis; the no-overlap check that guarded one coupled factor now guards the pair, rejecting any position where two rectangles are simultaneously viable.

The case worth dwelling on is when the concatenations across different axes share symbols. If the row split and the column split are driven by the same iterators rather than independent ones, the active rectangle is constrained to the diagonal — the operation writes a block-diagonal or banded result, the off-diagonal blocks never active because no point makes two unrelated components simultaneously live. Disjoint symbols across the concatenated axes give a full block tensor; overlapping symbols give a partially diagonal one. That a single mechanism — couple the axes, tile the factors, admit at most one summand per cell — produces both block assembly and banded structure, with the difference being only whether the couplings share iterators, is the clearest measure of how far the coproduct reading reaches. It is the same observation the convolution post made about the affine index, that one enriched term covered convolution, striding, interleaving, and their inverses by varying a coefficient; here one coupled-factor mechanism covers concatenation, slicing, block matrices, and banded tensors by varying how many axes are coupled and whether they share symbols.

(This is a statement about the operation, not about any particular surface syntax. What notation should most conveniently describe a block assembly — whether the natural thing for a user to write is a blocking spec or a stacking construct — is a separate question, returned to below. The machinery underneath is concatenation along one or more axes, however it is spelled.)

A later seam

The convolution post gave up one property to get the affine index: flatness of the solve. Its index could name loop variables the solver had not yet minted, so the solve ran in two strata — fix the base co-iteration classes, then evaluate the derived terms against them — and a check between two derived terms could fire only after the stratum that evaluated its operands, so a conflict could surface later than where the constraint was written. Concatenation continues down this road.

The connected-component decomposition is a structural pass that runs after the co-iteration union-find, on its output: it cannot group symbols into factors until the symbols exist. And the disjoint-cover and ambiguity checks fire later still, at lowering, when the offsets are concrete and the lowering can ask, position by position, which component is active — an overlap is detected there, when two components turn out viable at one cell, not when the spec was parsed. So the monotonic-core-with-frozen-seams picture from the convolution post (where padding accumulated under monotonic rules and was committed once at finalization) gains another seam: the coupling and its tiling check are resolved past the point where shapes are closed, on the far side of the same finalization boundary. The core stays monotonic; the non-monotonic commitments cluster at the seams; concatenation adds one.

Reaching the surface, and a note on what to build downstream

At the surface, concatenation is the einsum notation of the convolution post with ^ joining axis components. Under %op the operator ++^ expands to concat (also spelled concat_sum): it assembles a Block from an array of sources against a spec — (a, b, c) ++^ "x, ...; y, ...; z, ... => x^y^z, ..." joins three tensors along their first axis — accumulating by sum, as the name records; its Rev_sides gradient likewise adds into the source gradients. The capture list from the convolution post carries over unchanged: ++^ takes spec labels whose inferred sizes are bound for reuse, the same variable_ref mechanism, now reporting the size a component resolved to. Single-source members of the family — slices and partial updates — need no special operator; they are ordinary %cd assignments whose ~logic carries a ^ spec, as with the out =:+ source ~logic:"a^b => a" above. And summed placement is only the default: because an operation is nothing but an op_asn/grad_asn pair of %cd code over a projections, a user can run a concatenation’s coupled projection through arbitrary arithmetic instead of an identity copy, fusing the join with a computation in one pass. %cd gives that directly.

That leaves stacking, which is a near neighbour of concatenation and a good illustration of what the coproduct reading buys downstream, but is not a concatenation. Concatenation extends an existing axis: it joins length-A and length-B data into one length-(A+B) axis, and the components may differ in size. Stacking introduces a new leading axis: it takes N equally-shaped tensors and produces one tensor with a fresh axis of size N, the components stacked along it. In coproduct terms the difference is sharp. Concatenation is a coproduct with possibly-unequal summands glued into an existing factor. Stacking is the regular case — N summands all of the same size — and crucially it makes a new factor rather than enlarging one, so its iteration is N × (component shape), a genuine product of the new axis with the component, where concatenation’s is a sum within one factor.

The implementation question is how to build stacking on the machinery just developed, and the coproduct reading points to an answer. One could desugar a stack into a concatenation through the einsum path — give each component a fresh size-one axis and concatenate along it, N unit slices tiling a new length-N axis — and the documentation at one point described doing exactly that. But that routes a regular, equal-sized, new-axis operation through the irregular, possibly-unequal, existing-axis spec language, and overloads the einsum notation to carry it. The cleaner design, which the structure points to, is to give stacking its own shape-logic entry — it is a different operation, with a regularity the spec language does not need to express — while having it share the projection machinery with concatenation, since a stack of N equal components along a fresh axis is precisely the regular coupled factor, the same connected-component-and-offsets code with all offsets a multiple of one component size. A dedicated entry for the size logic, shared coupling for the projections: the operations are siblings in projection and distinct in shape, and the code follows that exactly rather than forcing one to wear the other’s notation. (Tensor literals fall out as the scalar base case of stacking, the way the convolution post observed the affine index already contained the iterator and the fixed index as its degenerate forms — a literal is a stack of zero-dimensional components, the brackets that separate output, input, and batch axes being stacks along each kind.)

A surface detail of stacking is worth recording because it is a parsing question the coproduct framing does not settle on its own: the bracket that introduces a new input axis is the parenthesis-comma form (a, b), which collides syntactically with an ordinary multi-argument application. The disambiguation is contextual and rests on a structural fact about OCANNL shapes — input axes are always nested within output axes — so the parser can tell a stacking construct in axis position from an argument list in application position by where it sits. The details are not yet pinned down, and they are a syntax concern rather than a shape-inference one; the article’s job is the machinery, and the machinery does not care whether a given stack was written with parentheses, brackets, or array bars.

Where this leaves the series

Each of the first three posts found a familiar operation hiding inside a humbler one and reused the core rather than extending it. Broadcasting was an order, and rank-polymorphism lived in the order with no type schemes added. Contraction was emergent, read off an output index map that omitted a loop variable, with no “reduce” primitive. Convolution was a contraction, the same loop nest with one index made affine, and striding and interleaving and their gradients came along for free in the same enriched term. The refrain was: the construct populates a corner the machinery already had.

Concatenation is the one that enlarged the machinery, and the enlargement can be named precisely. The iteration domain, a pure product of ranges through three posts, gains its first coproduct: one factor becomes a disjoint union, its components coupled into sequential traversal, their offsets the coproduct’s injections, their non-overlap its at-most-one-summand condition, and the size domain extended to zero — where the first post’s order stopped at one — to give the coproduct a unit. Once the domain has a coproduct, the rest is two directions through it. Concatenation runs along the injections and places components in; split runs against them and reads a component out; the gradient of each is the other, the same offsets with the assignment’s sides reversed. Block matrices are the coproduct in more than one factor at once; banded tensors are that with the couplings sharing iterators. One coupled factor, read forwards and backwards, is the whole of what concatenation added to a domain that until now knew only products.

OCANNL is open source, at github.com/ahrefs/ocannl.