Split and Concatenation Are Two Directions of a Coproduct: Coupled Iteration in OCANNL
Split and Concatenation Are Two Directions of a Coproduct
Here is the loop nest that concatenating two vectors compiles to,
a of length A and b of length
B placed end to end into out of length
A + B:
for i in 0 .. A-1:
out[i] = a[i]
for j in 0 .. B-1:
out[A + j] = b[j]
And here is the gradient — the incoming gradient g on
the joined axis distributed back to the two sources:
for i in 0 .. A-1:
a.grad[i] += g[i]
for j in 0 .. B-1:
b.grad[j] += g[A + j]
The two nests are the same arithmetic run in opposite directions.
Forward, two reads are gathered into one output axis, each
source written to its own stretch of the result; backward, one gradient
is scattered back, each stretch read off and returned to its
source. The offset A — the length of the first component —
is the only quantity that does any work, and it appears in both, once
placing a write and once placing a read. This post is about that offset,
and about the single structure the two nests traverse in opposite
directions.
Three companion pieces built the machinery this rests on. Broadcasting Is an Order developed shape inference: what a shape is, and how a solver fills in the sizes. Contraction Is Emergent developed projection inference: how the same constraints, re-read as an equivalence over axes rather than an order over sizes, become the loop nest of an operation, with reductions read off the result rather than declared. Convolution Is a Contraction added the affine index — a third index form, a function of loop variables — and with it a stratified solve-then-evaluate solver, padding accumulated as a monotonic fixpoint and frozen at finalization, and tensor initialization. All three deferred one construct, and named it the same way each time: the one that enlarges the machinery rather than populating a corner of it. This is that construct.
What the first three posts kept flat
Every operation in the predecessors compiled to a single nest over a
product space — a Cartesian product of integer ranges,
one loop variable per factor, the factors independent. A matrix multiply
iterates {i, j, k}; a pointwise add iterates the result’s
axes; a convolution iterates output positions and kernel positions, and
reads its input at an affine combination of the two. The affine index of
the third post looked like it might break the pattern, since one
index-map component named two loop variables instead of one — but it did
not: it added no factor and coupled none, it only read an existing flat
product at a computed address. The third post said so in as many words:
the product space stayed flat throughout, one factor per axis, each
a single range. The first two posts had each flagged, as the thing
they were deferring, exactly the construct that would not stay flat.
Concatenation is that construct, and the break is precise. A factor
stops being a single range and becomes a disjoint union of
ranges, traversed in sequence rather than as a product. In the
opener, the output axis of length A + B is one factor, but
iterating it is not one loop — it is the A-loop
followed by the B-loop, two ranges laid end to
end, each driving a different source. The product of ranges that carried
three posts acquires its first coproduct.
It is worth saying immediately why this is the same word a category theorist would reach for, because the structure earns the name and the rest of the post turns on it. A coproduct of two ranges R_a and R_b — their disjoint union R_a \sqcup R_b — comes equipped with two injections, \iota_a : R_a \hookrightarrow R_a \sqcup R_b and \iota_b : R_b \hookrightarrow R_a \sqcup R_b, that place each summand into the whole without overlap. For ranges of integers the injections are exactly the offset maps: \iota_a(i) = i and \iota_b(j) = A + j. Those are the two addresses in the opener. Concatenation is the injections — it places each component into its slice of the result, offset by the lengths of the components before it. The defining property of a coproduct is that an element of the whole comes from exactly one summand: the slices tile the result and do not overlap. We will see that the disjointness — no position claimed by two summands — is not decoration but a checked invariant, the thing that separates a well-formed concatenation from an ambiguous one.
This is the categorical dual of what the predecessors iterated. Their domain was a product of ranges, and a contraction (the second post) concerns a product factor the output omits, reduced away by the assignment’s accumulation. Concatenation introduces the dual construct: a coproduct factor. And because the size of a disjoint union is the sum of its parts’ sizes, the joined axis comes out exactly as long as its components together — the one place a sum appears here, and it is the cardinality of the coproduct, nothing more.
One term, two readings
The affine index of the third post came from enriching the dimension
term with a constructor read by two solvers — a size solver that asked
how big an axis is, and a projection solver that asked which cell it
addresses — so that size and address could not disagree, because they
were computed from one source. Concatenation enriches the same term the
same way. Alongside the third post’s Affine, the dimension
term gains
Concat of dim list
a dimension built by concatenating its components. And it is read twice.
The size reading is a sum: once the components are
known, a Concat resolves to the sum of their dimensions. A
concatenation of a length-A and a length-B
axis is a length-(A + B) axis. (When the size solver meets
a Concat whose components repeat the same variable, it may
rewrite it internally as a stride-only affine term — n
copies of a length-d axis being a length-n·d
axis. That reuses the affine representation for something that is not
affine as a projection, but the rewrite is confined to size solving and
never reaches projection inference, so nothing downstream sees the
discrepancy.)
The projection reading is where the offset lives. As an index, a concatenated axis is
Concat of symbol list
one iterator symbol per component, and at lowering it resolves to whichever component is active at that point in the iteration, addressed at that component’s iterator plus the injection offset — the cumulative size of the components before it. The first component reads at its own iterator with offset zero; the second at its iterator plus the first component’s length; the k-th at its iterator plus \sum_{m<k} (length of component m).
One small asymmetry with the broadcast machinery is worth flagging
now because it will matter at lowering. The second post taught that a
size-one axis that broadcasts is severed and pinned to
Fixed_idx 0 — it earns no loop, it is read at its single
cell while its partner iterates. A size-one axis that participates in a
Concat is not collapsed that way. Every
component of a concatenation keeps a real iterator symbol, even a
length-one one, because lowering needs those symbols to decide which
component is active and to compute its offset. Broadcasting throws away
the loop of a unit axis; concatenation cannot, because the unit is one
summand of the disjoint union and the bookkeeping needs to know it is
there.
The coupling: a second union-find
The second post’s projection inference was a union-find. It read the elaborated constraints as an equivalence over axes: two axes the spec forced to coincide were unioned into a co-iteration class, one loop variable per class, and a reduction was an output that omitted a class. Equality unions; broadcasting severs. The whole of it was flat — each class became one factor, each factor one range.
Concatenation adds a second union-find, over the same symbols,
computing a different relation. After the first pass has settled which
axes co-iterate and minted iterator symbols for them, the projection
deriver collects every Concat index and treats it as a
hyperedge over the symbols it names. It then takes the
connected components of that hypergraph. Each component
becomes one factor of the product space, and the symbols within a
component are made to iterate sequentially — one after another,
the result laid out as consecutive loops — rather than nested as an
independent product would be. The product space is consequently no
longer an array of single ranges but an array of lists of
ranges: a factor carries a list of sizes and a list of iterators,
singletons everywhere except where a concatenation coupled several axes
into one.
The two union-finds compute different relations over the same symbols. The second post’s groups symbols that are the same loop: an equivalence, “these axes are driven by one variable.” This post’s groups symbols that must occupy the same factor and run consecutively: “these axes are not the same loop, they are different loops that share a slot and take turns.” The first is identification, the second is sequencing. An equivalence class collapses its members to one iterator; a connected component keeps its members as distinct iterators and orders them. Where the second post said equality unions, broadcasting severs, this post adds the third verb: concatenation couples.
The sequencing is what the disjoint union demands. Iterating R_a \sqcup R_b means running through R_a and then through R_b — not the pairs R_a \times R_b, which would be their Cartesian product, a different operation. So the lowering emits the components of a coupled factor as a flat sequence of loops, not a nest. At each point one component is being traversed; the others are inactive.
That the components run in sequence is a fact about the lowered IR, not about the operation. The components are independent — prefix and suffix could be written in either order, or at once — so the consecutive loops are an over-specification, a total order imposed on a computation that requires only a partial one. This is the lowered IR’s character in general, not a quirk of concatenation: it can be read as committing to a purely sequential schedule. OCANNL treats that as a separation of concerns. The backends — the GPU emitters and optimizers — are meant to re-derive the independences latent in the lowered IR and parallelize accordingly, and to refuse a computation whose shape does not admit efficient execution, rather than to faithfully honor whatever sequential reading the IR happens to express. Efficient GPU execution rests on order not affecting the result, and the coupling is built so that it does not. (This part of OCANNL is not yet fully developed.)
Two directions, and the swap between them
Forward concatenation reads several sources and writes one result,
placing each source into its slice. Its components are the injections;
the operation is a gather along them. But the same
Concat index, and the same offsets, describe a whole family
of operations, distinguished only by which side of the assignment the
concatenated axis sits on and which components are present:
| Spec | Operation |
|---|---|
a; b => a^b |
concatenate: place two sources end to end |
a^b => a |
extract the prefix (slice) |
a^b => b |
extract the suffix (slice) |
a => a^b |
write the prefix, leave the suffix (partial update) |
b => a^b |
write the suffix, leave the prefix (partial update) |
a^b^c => b |
extract the middle |
b => a^b^c |
write the middle |
Reading the index along the injections — component to its offset position in the whole — builds the result: concatenation, and the partial updates that write one slice. Reading it against the injections — restricting the whole back to one summand’s range — takes a part out: the slices and extractions. These are the two directions the title names. A coproduct does not come with projections the way a product does, so “the other direction” is not a categorical projection; it is the partial inverse of an injection, the restriction of the joined axis to one component’s image. Split is concatenation read backwards, and that is all the relation there is between them: one structure, the disjoint union with its offsets, traversed two ways.
The gradients make the swap literal. The gradient of a gather is a
scatter: differentiate a; b => a^b and the incoming
gradient on the joined axis must be distributed back to a
and b, each component reading its slice — which is the
backward direction, an extraction per source. The gradient of a
slice is the reverse: differentiate a^b => a and the
prefix’s gradient must be placed back into the prefix slice of the
source’s gradient, zero elsewhere — the forward direction, a
placement. So concatenation and split are not only two directions of one
structure; each is the gradient of the other. The four operations —
concatenate, split, and their two gradients — are the injections and
their reverses, nothing more.
In the assignment IR this reversal has a name, and the reason it
needs a dedicated one — rather than the slot permutation the second post
used for ordinary gradients — is concrete. Every accumulating operation
in the IR writes a single left-hand side: a Unop,
Binop, or Block reads its operands and
accumulates into one tensor. The forward concatenation
a; b => a^b fits that mould as a Block —
its Concat sits on the single written result and selects,
per position, which one of the several read operands is live (at most
one, by the disjoint cover). Its gradient does not fit. The incoming
gradient on the joined axis has to be scattered back to several
tensors, a’s gradient and b’s, with the
Concat now choosing per position which one to
write. That is a masked array of write targets, and no
single-left-hand-side form expresses it; nor does any permutation of
buffers within such a form, since a permutation relabels the one write
target it has and cannot conjure a second. The IR provides exactly one
construct whose written-to side is an array — Rev_sides —
and it works by reuse, not re-derivation: it takes the forward
Block, makes that block’s single left-hand side the tensor
read from and its operand array the masked tensors written to, and
carries the Concat, the offsets, and the coupled factor
over untouched. A gather selects among reads and writes one; its
gradient reads one and selects among writes; the move between them swaps
which side is the array, and that is the move a permutation cannot make
and Rev_sides is.
Block and Rev_sides are the only
accumulating operations the IR lets carry concatenated axes, and the
reason is the disjoint cover. A Concat index is
well-defined as a write only if no result position is claimed by more
than one source: the sources’ slices must not overlap. The lowering
enforces this directly. As it walks the coupled factor it determines,
for each position, which component’s range is being traversed, admits
only the source whose component is active there, and resolves the
Concat to that component’s iterator offset by its
cumulative size. If two sources were ever viable at one position — if
their slices overlapped — the operation would try to combine both into
one cell, and the lowering rejects it as ambiguous. It is allowed for
no source to be active at a position: a concatenation may leave
gaps, which is how a partial update writes one slice and leaves the rest
of the target untouched. At most one source per position — exactly one
where the slices cover the result, none where they leave a gap — is the
coproduct’s universal property restated as a code-generation check.
To see the surface syntax, take the single-source slice
a^b => a in %cd. The forward extraction
reads the prefix of the concatenated source into its whole output:
out =:+ source ~logic:"a^b => a"
The leading =: clears out to the
accumulator’s neutral element and + is the accumulating
operator — though on a write-once projection like this slice, where
every output cell is assigned exactly once, the accumulation has nothing
to do and the line coincides with a plain set,
out =: source. The lowering knows this: when the projection
is injective it emits a set and skips the accumulation
altogether. (The colon-first spelling is the one the second post traced
to OCaml fixing an assignment operator’s meaning by its first character,
which is why there is no +=.) The accumulation earns its
keep in the gradient, which adds rather than erases because
source is shared with whatever else reads it:
source.grad =+ out.grad ~logic:"a => a^b"
Both of these are Blocks — a single-source slice and a
single-target partial write — and ~logic builds each
projection straight from its spec; neither is reversed.
Rev_sides is what the gradient builder reaches for in the
case of the previous paragraph — the multi-source concatenation whose
gradient scatters to an array of tensors. An operation is defined by a
pair of builders, a forward op_asn and a backward
grad_asn, and both are handed the same
projections (the gradient builder also receiving the
incoming gradient and the operand tensors). The gradient is therefore
not re-derived from a written spec: grad_asn already holds
the forward’s projections, and for a concatenation it builds the
Rev_sides-tagged assignment directly from them,
array-valued write side and all. Only the many-target scatter truly
requires the reversal — a single-target gradient could be a
Block — though the ++^ combinator builds every
concatenation gradient as a Rev_sides, the one-source case
included.
The empty summand: dimension zero and
discardable_vars
Concatenation extends the dimension order the first three posts were built on.
The first post’s order had a bottom: 1_\emptyset, the claim-free size-one axis, the thing a scalar carries and the thing rank-broadening inserts. It is the bottom of broadcasting — the unit that a smaller operand becomes when it has to fit a larger one, the identity of the operation that fills an axis by replication. In the algebra of sizes under broadcasting, 1_\emptyset is the unit of the product: replicating along a size-one claim-free axis changes nothing, A \times 1 \cong A.
Concatenation has a different unit, and it is not one but zero. The unit of a coproduct is the empty summand: A \sqcup \emptyset \cong A, joining an empty range to a range leaves it unchanged. A length-zero axis is the empty range, and so a concatenation that includes an empty component is a concatenation with one fewer component — the empty slice contributes no positions, places nothing, and the offsets of everything after it are unmoved. For the coproduct to have a unit at all, the size domain must be allowed to reach zero, which the first post’s domain (sizes \geq 1, with closing guessing unforced axes to 1) deliberately could not.
So concatenation requires a controlled extension: in concatenation
contexts, certain dimension variables are permitted to resolve to
zero, a size that is otherwise illegal. These are the
components that a specification may discard or leave empty, and the
system tracks them as a set of discardable variables
(discardable_vars in the code). Permitting that zero is a
deliberate exception, confined to exactly these variables and named for
what licenses it: a variable is discardable precisely when its component
can be dropped. When closing guesses unforced variables to their minimal
value, an ordinary variable guesses to 1 (the product unit, the first post’s policy)
but a discardable variable guesses to 0
(the coproduct unit). The two bottoms sit side by side, each the
identity of its own operation, and which one an axis falls to is decided
by whether it is a discardable summand of a concatenation.
It is worth being exact about what “two bottoms” means, because they do not live in one order. The first post’s order is the broadcast order: a d_1-axis is below a d_2-axis when the first broadcasts into the second, which puts 1 at the bottom — a unit axis broadcasts into anything — and leaves distinct concrete sizes mutually incomparable. Zero does not slot in beneath 1 there. A length-zero axis broadcasts into nothing and nothing broadcasts into it: you can neither replicate an empty axis to fill a populated one nor collapse a populated one to empty. Under broadcasting 0 is not lower than 1, it is incomparable to everything, off the order. The two units belong to two structures sharing one carrier — broadcasting with unit 1, concatenation with unit 0 — so closing an unforced axis is not “descend to the bottom of the lattice” but “guess the unit of the operation that owns it”: 1 where broadcasting left it free, 0 where a concatenation left a component free as the empty summand. The discardable set records which operation owns which free axis.
Which components are discardable is a real question. Consider what
“discardable” should mean. In a^b => a, the result is
just the prefix; the suffix b appears in the source spec
but contributes nothing to the result, so b may be empty —
it is the part being sliced away, and an extraction of the prefix should
work whatever the suffix’s size, including zero. In
a^b => b symmetrically a is discardable.
But in a^b => a^b neither is: both components appear on
both sides, both carry real data through, neither may vanish. The rule
has to distinguish “this component is projected away on the other side”
from “this component is carried through.”
The precise condition is a quantifier alternation. A variable v, appearing in one component of a
Concat on one side of the spec, is discardable when:
for every shape on the other side, there exists an axis such
that for every component of that axis, the complement of v’s component intersects it — where
the complement is the union of variables from the other
components of v’s own concatenation.
Read operationally: v’s component is
discardable exactly when, on the other side, v’s siblings already cover some axis
completely, so that v itself is
redundant there and may be dropped to nothing. The worked cases:
a^b => a: the complement ofbis \{a\}, and on the result side the single axisais covered by \{a\} — sobis discardable. (Symmetricallyaina^b => b.)a^b => a^b: the complement ofbis \{a\}, but the result axis is itselfa^b, whose components areaandbseparately; \{a\} covers theacomponent but not thebcomponent, so the “for every component” clause fails. Neither variable is discardable — correct, both carry through.a; b => a^b: two sources,aalone andbalone. Forbto be discardable its complement \{a\} would have to cover an axis of every source, but the second source is justb, which \{a\} does not touch. Neither is discardable.a; b => a^c: herecis not discardable, because its complement \{a\} fails to cover the second sourceb. The slotcis left unconstrained, not redundant, and must be solved rather than zeroed.a; b => a^b^c:c’s complement is \{a, b\}, and each source —aalone,balone — is an axis fully covered by \{a, b\}, socis discardable. The result names a third component that no source fills, and it falls to the empty summand: the spec computesa^bwith a vestigial empty slot.
The alternation — for all shapes, there exists an axis, for all
components — is doing exactly the work of asking whether a summand is
already accounted for elsewhere and may therefore be the empty summand.
It is the syntactic test for “this is a unit of the coproduct here,” and
it is what lets one notation express both a concatenation (every
component carries through, nothing discardable) and a slice or partial
update (the unwritten components discardable to zero) without a separate
operator for each. Specifications with integer-sized components —
3^a => a, skip three at the front;
a => a^3, write all but the last three — are the same
idea with the discarded summand’s size given rather than inferred: the
constant component becomes a fresh internal symbol that the projection
skips, a known-size empty-on-the-result slice.
The a; b => a^b^c case is worth pausing on as a
design question rather than a derivation, because it is the rule at its
most permissive. The spec is accepted and c falls to
nothing, which is graceful when the empty slot is meant — a fixed
template prefix^middle^suffix reused in a call that
supplies only two of the three parts, the third defaulting to empty —
and it is the very mechanism that lets the ^ notation cover
slices and partial updates with no new operator. But it also accepts a
typo: a user who wrote c expecting a third source and
forgot to pass it gets a silently empty component rather than an error.
Whether that should stay silent, warn, or demand that an intentionally
empty slot be marked explicitly — some annotation separating “leave this
empty” from “a source belongs here” — is open, and the coproduct reading
is neutral on it, since an empty summand is well-defined either way. The
decision is about how much the surface notation should let inference
guess on the user’s behalf: the same tension the first post’s closing
policy raised, now with 0 rather than
1 as the value being guessed into
existence.
Operationally, a discardable variable is a concatenation component the operation does not iterate. A component is discardable just when it is projected away on the other side of the spec — written in the notation but contributing no indexing to the computation — and an axis the operation never reads needs no extent, so its size is free to fall to zero. “May be the empty summand” and “is not iterated here” are one fact seen from the size end and the projection end. The size is still shared across the program, so a discardable variable can be forced positive from elsewhere: if the same variable indexes a real axis in another operation it takes that size there, while the operation that discards it simply never loops over it. Discardability is a verdict on an occurrence in a spec, not on the variable globally.
There is care needed in the solver so that this extension does not
destabilize the rest. The first post’s closing policy guessed unforced
axes to the bottom of the order; now there are two bottoms, and guessing
the wrong one is a bug, not a broadcast. So the discardable set is
computed up front from the specification’s structure (the quantifier
test above, run over the spec’s two sides) and consulted at exactly the
closing stages where minimal-value guessing happens, which guess
discardable variables to 0 and
everything else to 1. And unification
of Concat terms is mostly structural — two concatenations
of equal length unify component-wise — with one inequality case held
open: a concatenation almost all of whose components are discardable can
be left as an inequality rather than forced to an equation, so the
solver can still infer the one live component from context. The
machinery is small, but it is the first place the size domain has more
than one floor, and the second floor is there because the coproduct
needs a unit.
Block matrices: the coproduct in more than one axis
Nothing restricts a single operation to one coupled factor. A specification may concatenate along several axes at once, and when it does, the coproduct structure reaches into more than one factor of the product space — which is exactly a block matrix and its relatives.
Take a result whose two axes are each a concatenation: a 2\times2 block matrix assembled from four
sub-blocks is a result indexed r^s, c^t (rows split into
r then s, columns into c then
t), with each source block written to one (row-slice,
column-slice) rectangle. Here two factors are coupled, the row factor
and the column factor, and the active component must be chosen
independently in each. The disjoint-cover invariant now lives in two
dimensions: the rectangles tile the result, and at each point exactly
one block is active — the pair of active row-component and active
column-component picks it out. The lowering handles the multi-axis case
by the same machinery as the single-axis one, the active-component test
applied per concatenated axis and the offsets accumulated per axis; the
no-overlap check that guarded one coupled factor now guards the pair,
rejecting any position where two rectangles are simultaneously
viable.
The case worth dwelling on is when the concatenations across different axes share symbols. If the row split and the column split are driven by the same iterators rather than independent ones, the active rectangle is constrained to the diagonal — the operation writes a block-diagonal or banded result, the off-diagonal blocks never active because no point makes two unrelated components simultaneously live. Disjoint symbols across the concatenated axes give a full block tensor; overlapping symbols give a partially diagonal one. That a single mechanism — couple the axes, tile the factors, admit at most one summand per cell — produces both block assembly and banded structure, with the difference being only whether the couplings share iterators, is the clearest measure of how far the coproduct reading reaches. It is the same observation the convolution post made about the affine index, that one enriched term covered convolution, striding, interleaving, and their inverses by varying a coefficient; here one coupled-factor mechanism covers concatenation, slicing, block matrices, and banded tensors by varying how many axes are coupled and whether they share symbols.
(This is a statement about the operation, not about any particular surface syntax. What notation should most conveniently describe a block assembly — whether the natural thing for a user to write is a blocking spec or a stacking construct — is a separate question, returned to below. The machinery underneath is concatenation along one or more axes, however it is spelled.)
A later seam
The convolution post gave up one property to get the affine index: flatness of the solve. Its index could name loop variables the solver had not yet minted, so the solve ran in two strata — fix the base co-iteration classes, then evaluate the derived terms against them — and a check between two derived terms could fire only after the stratum that evaluated its operands, so a conflict could surface later than where the constraint was written. Concatenation continues down this road.
The connected-component decomposition is a structural pass that runs after the co-iteration union-find, on its output: it cannot group symbols into factors until the symbols exist. And the disjoint-cover and ambiguity checks fire later still, at lowering, when the offsets are concrete and the lowering can ask, position by position, which component is active — an overlap is detected there, when two components turn out viable at one cell, not when the spec was parsed. So the monotonic-core-with-frozen-seams picture from the convolution post (where padding accumulated under monotonic rules and was committed once at finalization) gains another seam: the coupling and its tiling check are resolved past the point where shapes are closed, on the far side of the same finalization boundary. The core stays monotonic; the non-monotonic commitments cluster at the seams; concatenation adds one.
Reaching the surface, and a note on what to build downstream
At the surface, concatenation is the einsum notation of the
convolution post with ^ joining axis components. Under
%op the operator ++^ expands to
concat (also spelled concat_sum): it assembles
a Block from an array of sources against a spec —
(a, b, c) ++^ "x, ...; y, ...; z, ... => x^y^z, ..."
joins three tensors along their first axis — accumulating by sum, as the
name records; its Rev_sides gradient likewise adds into the
source gradients. The capture list from the convolution post carries
over unchanged: ++^ takes spec labels whose inferred sizes
are bound for reuse, the same variable_ref mechanism, now
reporting the size a component resolved to. Single-source members of the
family — slices and partial updates — need no special operator; they are
ordinary %cd assignments whose ~logic carries
a ^ spec, as with the
out =:+ source ~logic:"a^b => a" above. And summed
placement is only the default: because an operation is nothing but an
op_asn/grad_asn pair of %cd code
over a projections, a user can run a concatenation’s
coupled projection through arbitrary arithmetic instead of an identity
copy, fusing the join with a computation in one pass. %cd
gives that directly.
That leaves stacking, which is a near neighbour of concatenation and
a good illustration of what the coproduct reading buys downstream, but
is not a concatenation. Concatenation extends an existing axis:
it joins length-A and length-B data into one
length-(A+B) axis, and the components may differ in size.
Stacking introduces a new leading axis: it takes N
equally-shaped tensors and produces one tensor with a fresh axis of size
N, the components stacked along it. In coproduct terms the
difference is sharp. Concatenation is a coproduct with possibly-unequal
summands glued into an existing factor. Stacking is the regular
case — N summands all of the same size — and crucially it
makes a new factor rather than enlarging one, so its iteration
is N × (component shape), a genuine product of the new axis
with the component, where concatenation’s is a sum within one
factor.
The implementation question is how to build stacking on the machinery
just developed, and the coproduct reading points to an answer. One could
desugar a stack into a concatenation through the einsum path — give each
component a fresh size-one axis and concatenate along it, N
unit slices tiling a new length-N axis — and the
documentation at one point described doing exactly that. But that routes
a regular, equal-sized, new-axis operation through the irregular,
possibly-unequal, existing-axis spec language, and overloads the einsum
notation to carry it. The cleaner design, which the structure points to,
is to give stacking its own shape-logic entry — it is a
different operation, with a regularity the spec language does not need
to express — while having it share the projection
machinery with concatenation, since a stack of N
equal components along a fresh axis is precisely the regular coupled
factor, the same connected-component-and-offsets code with all offsets a
multiple of one component size. A dedicated entry for the size logic,
shared coupling for the projections: the operations are siblings in
projection and distinct in shape, and the code follows that exactly
rather than forcing one to wear the other’s notation. (Tensor literals
fall out as the scalar base case of stacking, the way the convolution
post observed the affine index already contained the iterator and the
fixed index as its degenerate forms — a literal is a stack of
zero-dimensional components, the brackets that separate output, input,
and batch axes being stacks along each kind.)
A surface detail of stacking is worth recording because it is a
parsing question the coproduct framing does not settle on its own: the
bracket that introduces a new input axis is the parenthesis-comma form
(a, b), which collides syntactically with an ordinary
multi-argument application. The disambiguation is contextual and rests
on a structural fact about OCANNL shapes — input axes are always nested
within output axes — so the parser can tell a stacking construct in axis
position from an argument list in application position by where it sits.
The details are not yet pinned down, and they are a syntax concern
rather than a shape-inference one; the article’s job is the machinery,
and the machinery does not care whether a given stack was written with
parentheses, brackets, or array bars.
Where this leaves the series
Each of the first three posts found a familiar operation hiding inside a humbler one and reused the core rather than extending it. Broadcasting was an order, and rank-polymorphism lived in the order with no type schemes added. Contraction was emergent, read off an output index map that omitted a loop variable, with no “reduce” primitive. Convolution was a contraction, the same loop nest with one index made affine, and striding and interleaving and their gradients came along for free in the same enriched term. The refrain was: the construct populates a corner the machinery already had.
Concatenation is the one that enlarged the machinery, and the enlargement can be named precisely. The iteration domain, a pure product of ranges through three posts, gains its first coproduct: one factor becomes a disjoint union, its components coupled into sequential traversal, their offsets the coproduct’s injections, their non-overlap its at-most-one-summand condition, and the size domain extended to zero — where the first post’s order stopped at one — to give the coproduct a unit. Once the domain has a coproduct, the rest is two directions through it. Concatenation runs along the injections and places components in; split runs against them and reads a component out; the gradient of each is the other, the same offsets with the assignment’s sides reversed. Block matrices are the coproduct in more than one factor at once; banded tensors are that with the couplings sharing iterators. One coupled factor, read forwards and backwards, is the whole of what concatenation added to a domain that until now knew only products.
OCANNL is open source, at github.com/ahrefs/ocannl.