Convolution Is a Contraction: Indexing and Initialization in OCANNL
Convolution Is a Contraction
Here is the loop nest a one-dimensional convolution compiles to, one input channel and one output channel, so that only the slide is in play:
for o in 0 .. O-1:
out[o] = 0
for k in 0 .. K-1:
out[o] += x[o + k] * kernel[k]
There is a loop for the output position and a loop for the kernel
position. There is none for the input’s axis: the input is the largest
array in sight, yet it is never traversed, only read at
o + k, a position computed from the two loops that do
exist. Set that nest beside the one a matrix-vector product compiles
to:
for i in 0 .. I-1:
out[i] = 0
for j in 0 .. J-1:
out[i] += a[i, j] * v[j]
The two are the same nest. Both accumulate a sum of products over an
axis the output does not have; both pre-clear the output cell because
that axis is summed away. They differ in one place: a[i, j]
reads at a pair of loop variables, x[o + k] at their sum.
This post is about that one difference.
Two companion pieces built the machinery underneath. Broadcasting Is an Order developed shape inference: what a shape is, and how a solver fills in the sizes. Contraction Is Emergent developed projection inference: how the same constraints, re-read as an equivalence over axes rather than an order over sizes, become the loop nest of an operation, with reductions read off the result rather than declared. Both deferred two constructs: strided and convolutional indexing, and axis concatenation. This post takes the first. Concatenation changes the shape of the iteration domain and is left to a sequel; convolution leaves the domain alone and changes only what is read inside it.
The third index form
The second post compiled an operation to a product space — a
Cartesian product of integer ranges, one loop variable per factor — and,
for each tensor, an index map sending a product point to a multi-index
into that tensor’s array. Each component of an index map took one of two
forms: Iter(s), the axis driven by loop variable
s, or Fix(c), the axis pinned to the constant
c. The matrix-vector product uses only these: output
(Iter i), matrix (Iter i, Iter j), vector
(Iter j), with j shared, so the product space
is {i, j} and the output’s omission of j is
what makes j a reduction axis. The user wrote
a * v; the shared axis, the contraction, and the pre-clear
were inferred.
Convolution needs one more form:
Affine(Σ cᵢ·sᵢ + o): the axis is an affine combination of loop variables, with integer coefficients and an integer offset.
The convolution’s input map is (Affine(1·o + 1·k)) — the
sum of two loops that exist for reasons unrelated to the slide:
o because the output is traversed, k because
the kernel is summed.
The new form contains the old two: Iter(s) is
Affine(1·s + 0), Fix(c) is
Affine(0 + c). OCANNL keeps all three as separate variants
and forbids an Affine carrying a single unit-coefficient
symbol or no symbols at all — such an index must be written as the
Iter or Fix it is. Convolution adds no form
that was not already the general shape of an index; it populates an
interior the core left empty. The product space stays flat: one factor
per iterated axis, each a single range. The only change is that one
index-map component now names two factors instead of one.
The reduction characterization carries over verbatim. A product axis
is a reduction axis exactly when the output map is independent of its
loop variable. The convolution’s output map (Iter o) omits
k, so k is summed, by the predicate that
summed j above. The affine index uses
k to address the input, but addressing an operand with a
loop variable is not mentioning it in the output, and only the latter
governs reduction. The kernel is contracted because the output omits it;
the arithmetic, the accumulation, and the reduction are the
contraction’s, untouched. Only the input’s address is new.
One term, two readings
The affine index comes from an enriched constraint, as every projection in the second post came from a constraint re-read. The dimension term was a size and a basis, with a per-axis identifier added in the second post for axis identity. We add a constructor:
Affine { stride; over; conv; stride_offset }
over is the transformed dimension — the output axis;
stride > 0 is its coefficient;
stride_offset, with
0 ≤ stride_offset < stride, selects a position within a
stride window; conv is an optional
{ dilation; kernel; use_padding }, absent for pure striding
and present to add the kernel. Two solvers read this one term, and
reading it from one source is what keeps the input’s size and address
from disagreeing.
The size solver asks how big the input axis is, given the output. For
pure striding, input = stride · output. For valid
convolution (use_padding = false),
input = stride · (output − 1) + span, where
span = 1 + (kernel − 1) · dilation is the range of input
positions a dilated kernel covers: the largest input index reached is
(output − 1)·stride + (kernel − 1)·dilation, at the last
output and kernel positions, and the axis is one longer than that.
Solving for the output,
output = (input − span) / stride + 1, requires
input − span non-negative and divisible by the stride; an
image and kernel that do not tile leave no integer output, and the
constraint fails. For padded convolution
(use_padding = true), input = stride · output
again: the padding absorbs the kernel’s overrun, so the kernel enters
the address but not the size. The solver simplifies where it can — a
stride = 1 term over a known dimension is that dimension, a
valid-mode term with a known kernel evaluates the span — the same solver
the first post built for broadcasting, now with more to compute than an
order.
The projection solver asks which input cell a product point reads. The same term evaluates to
stride · output_iterator + dilation · kernel_iterator + stride_offset − padding
the kernel term and padding present only in convolution mode:
coefficient stride on the output loop,
dilation on the kernel loop, a constant offset.
stride_offset is where the two readings come apart most
cleanly. It appears in no size relation; the same input is read at every
offset from 0 to stride − 1 with no size
changing. A size concern would grow the tensor; an addressing concern
moves the read.
Solve, then evaluate
Writing Affine(1·o + 1·k) for the input’s index assumes
o and k already exist — the actual loop
variables minted for the output and kernel axes. But minting those is
the solver’s job, and the input’s index is something the solver is meant
to produce: the index is a function of results the solve has not yet
computed.
So the projection solve runs in two strata. The first is the core
solver of the second post, unchanged: union-find the base classes (here
the k-class, shared by the input’s affine expression and
the kernel, and the o-class, shared by the expression and
the output), pin any class forced to a constant, and mint one fresh loop
variable per iterated class — except for an axis whose index will come
from an affine term, which gets no loop of its own, its index being
computed rather than iterated. The second stratum walks the deferred
affine terms and evaluates each against the loop variables the first
produced.
The second post noted that its core projection language had no variable form: every index was an atom the union-find settled, leaving nothing to stand for an unknown. Here a variable form returns. An affine index names loop variables the first stratum has not yet minted, so the solver carries a projection variable and a small binding environment to hold the not-yet-known iterator until the second stratum resolves it. The variable was dispensable exactly as long as indices were atoms.
Two affine terms can be required to denote the same index, and whether they do cannot be checked until both are evaluated; the solver defers such checks to the second stratum. The core solver, the second post observed, never resolves an ambiguity — it succeeds or reports a conflict — which made it more principal than shape inference. The affine extension keeps that; there is still nothing to guess. What it gives up is flatness: a check can fire only after the stratum that evaluates its operands, so a conflict can surface there rather than where the constraint was written.
Padding
A padded convolution reads outside its input: at output
0, kernel 0, the address is
0 − padding, left of the array. The buffer is widened to
hold the overrun, and the widening is never written down — it is
inferred, in the same sense the contraction was.
OCANNL separates the two notions of size this requires. Shapes, shape
inference, and tensors carry sizes without padding; the underlying
buffers carry it. A convolution’s input shape is
stride · output; the buffer is larger by the
padding margins. The margin is computed during projection inference,
keyed by the input axis’s projection identifier, as the running maximum
of the kernel extents of every operation addressing that axis: an axis
read by a 3-wide kernel and elsewhere a 5-wide one is padded for the 5,
and the maximum is what makes every reader’s overrun fit. For a kernel
of size k the margin splits
right = (k + 1) / 2, left = k − right.
That maximum tightens monotonically as constraints are processed and is committed once, at finalization — the boundary at which the first post substituted unsolved shape variables by their bounds, and the second required shapes closed before projections could be read. Inference in OCANNL is monotonic in its core rules and non-monotonic only in the large: something must decide when a shape is final, because a tensor cannot be allocated until it is. Padding is one more quantity accumulated under the monotonic rules and frozen at that seam; after the freeze the buffer can be allocated, before it the margin can still grow to fit another reader.
Striding without a kernel
Drop the kernel and the same machinery describes plain strided
access. The term
Affine { stride; over; conv = None; stride_offset } has
size reading input = stride · output and index reading
stride · output_iterator + stride_offset — one loop
variable scaled by the stride, plus an offset, no second iterator and no
padding. This is the smallest affine index: a single coefficient and a
constant. It is enough to express operations that have nothing to do
with convolution.
Interleaving is the clearest example. Given two tensors of the same length, produce one twice as long, alternating their elements: the first tensor’s values land in the even positions, the second’s in the odd. In OCANNL it is two strided copies:
t =:+ id t1 ~logic:"…, i => …, 2*i";
t =+ id t2 ~logic:"…, i => …, 2*i + 1"
Each line copies a source cell i to a result cell at an
affine position. The first writes to 2*i — stride two,
offset zero. The second writes to 2*i + 1 — stride two,
offset one. The stride spaces the writes out; the offset is what makes
the two copies miss each other and tile the result exactly. There is no
kernel, no contraction, no padding; the whole content of the operation
is in the two affine output indices, and they differ only in the
constant the first section called stride_offset.
Two things are worth drawing out. First, the affine index here sits on the side written to, not the side read — convolution addressed an operand, interleaving addresses the result. The form is indifferent to which: an index is an index, whether it places a read or a write. Second, the offset, which in a strided convolution merely chose a phase within each window and never mattered to a result, is here load-bearing: it is the entire difference between the even copy and the odd one. The same constant that was a projection-time afterthought for convolution is, for interleaving, the operation.
The inverses are the same form read the other way. Extracting the
even-indexed elements is a copy from 2*i to i;
the odd elements, from 2*i + 1 to i. And the
gradients are the inverses again: the gradient of the even-extraction
scatters i back to 2*i, returning the strided
write the forward pass undid. A whole small family of layouts —
interleave, its two projections, their gradients — is generated by one
coefficient and one offset, varied.
What the notation carries
The surface syntax is the einsum notation of the second post with one
addition: a label position may hold an affine expression. An expression
like stride*oh+kh triggers multi-character mode, where axes
are comma-separated identifiers and stride,
dilation may be integer literals or, under the syntax
extensions, spliced-in integer variables.
"... | stride*oh<+kh, stride*ow<+kw, ic -> oc; kh, kw, ic -> oc => ... | oh, ow, oc"
is a 2-D convolution: spatial input axes as affine expressions,
ic the contracted input channel, oc the output
channel, oh, ow, oc the result. The < after
the output label marks valid mode; written =, or omitted
where use_padding is in scope, it marks padded mode. The
notation is the affine index written down: input position equals
stride times the output iterator plus the kernel
iterator.
The product space stayed flat throughout: one factor per axis, each a single range. The construct that breaks this is concatenation, where several axes share one factor and iterate within it in sequence; that is the sequel. Two smaller pieces of OCANNL’s indexing fit here first, because each reuses machinery the affine index already introduced.
Part 2: Vectorized Operations and Initialization
A vectorized operation writes several output cells per step, the
converse of the contraction’s several reads per write. Initialization,
where a tensor’s cells are first filled, is where shapes are seeded from
data. Where convolution enriched the dimension term with an
affine constructor, these two enrich the row constraint instead
— with Total_elems and Exact, the forms that
say how many elements a row has, or which axes it has exactly. The two
enrichments are siblings, growing different parts of the same constraint
language; they touch at one point, a shared notion of a strided count,
reached below.
A vectorized operation writes several cells per step
The map-reduce of the second post read one value per right-hand cell, combined them with a scalar function, and wrote one value per output cell. A vectorized operation breaks the last symmetry: it reads one cell and writes several. OCANNL has one such operation at present, the conversion from a 128-bit random word to a block of typed numbers, and its shape is the cleanest illustration of the idea.
The operation is uint4x32_to_prec_uniform. Its input is
a tensor of uint4x32 cells — each a 128-bit value, the
output of a counter-based random generator — and its output is a tensor
of floats (or other typed numbers) at some target precision, drawn
uniformly. One 128-bit word yields as many output numbers as fit:
sixteen bytes, eight half-precision floats, four single-precision, two
double, one if the target is itself uint4x32. The count is
16 / prec_in_bytes(target). The vectorized step reads one
input cell and writes that many output cells.
Shape: an element-count relation
The output has more elements than the input, by exactly the precision
ratio, and this is a relation across rows rather than within an
axis. The convolution related one input dimension to one output
dimension, input_dim = stride · output_dim; here the
relation is between the operation’s two ends as wholes: the output’s
total elements are coeff · (input's total elements), with
coeff the precision ratio. That needs a different
constraint form than a dimension equation. The shape logic emits two.
The input’s rows are pinned to a single fresh variable,
Exact [Var v] — the operation assumes a single input axis,
so as not to force minimum sizes on the output. The output’s rows carry
Total_elems { numerator = Strided_var { coeff; var = v; denom = 1 } }:
the output has coeff · v elements in total, where
v is the input’s element count.
Total_elems is the row constraint that says “these rows,
inclusive of their row variable, have this many elements,” and
Strided_var is its numerator when the count is not yet a
literal but a coefficient times an unknown — a placeholder for “the
element count is coeff · v, for a v to be
solved.” This is the one point of contact with convolution promised
above: the coeff of a Strided_var is a stride,
the same kind of multiplicative coefficient the affine dimension
carried, but here it scales a whole row’s element count rather than a
single axis. The constraint solver resolves v from
whichever side is known first — an input element count fixes the
output’s, or a required output count fixes the input’s — and the
coefficient is forced at the stage that resolves precision-derived
coefficients, the second post’s stage 2. On the shape side a vectorized
operation is a tensor whose two ends are related by a multiplicative
element-count constraint — the same row-constraint form a reshape uses,
which we meet again under initialization.
Projection: one read, several writes
On the projection side the operation iterates a product space as any
operation does, but its assignment is not the map-reduce’s. It lowers to
a dedicated form, Set_from_vec, carrying the output index,
the input read, the vector operation, and a length — the
number of output cells the one read produces. Where the second post’s
accumulation read several cells and wrote one, this writes
length contiguous cells from one read.
The length is computed at lowering, from the same
precision ratio the shape constraint used. The index maps are otherwise
ordinary — the product iterators substitute into the output and input
indices as usual — and the construct deliberately does not admit the
coupled indices of the next post: a concatenated index has no meaning
for a vectorized write, and the lowering rejects it. The vectorized
operation adds no index form, only a fan-out at the leaf of the loop
body, sized by the ratio that sized its shape.
Surface
The conversion is a unary primitive, written
uint4x32_to_prec_uniform in the %cd and
%op notations, classified “dedicated” rather than pointwise
because of the element-count change. It is rarely written directly. Its
role is the last step of random initialization: a counter-based
generator, threefry4x32 (the binary primitive
^^^^), produces uint4x32 words, and this
conversion unpacks them into uniform numbers at the tensor’s precision.
How those words are produced, and why the result is deterministic, we
return to once initialization is on the table.
Initialization seeds shapes from data
A tensor’s cells must start somewhere. In OCANNL the starting value
is a Fetch — a reset of an array by a computation or a data
read — or a Data init carrying an array literal. Both are
terminal shape logic: a leaf of the expression graph, with no
sub-tensor to take a shape from, so whatever shape information exists
must come from the initializer itself. This is the from-usage inference
of the first post seen from its other end. There, a leaf’s shape was
forced upward by its uses; here, an initializer can pin a leaf’s shape
from below, and the two meet in the closing policy.
Shape: pin exactly, or constrain the count
The terminal initializers split by how much they say about the shape,
and the split is exactly the
Exact-versus-Total_elems distinction.
An initializer that knows its axes emits Exact.
Keep_shape_no_padding and Padded carry an
array whose dimensions are taken verbatim,
Exact [d₁; d₂; …], pinning each axis. Slice —
taking one batch index of an existing tensor — emits the sliced tensor’s
dimensions with the leading axis dropped, again Exact.
These leave nothing to inference: the leaf’s shape is the data’s
shape.
An initializer that knows only its size emits
Total_elems. Reshape carries an array but lets
its elements be laid into whatever axes inference assigns, so it
constrains only the product:
Total_elems { numerator = Num_elems n }, where
n is the element count. Constant_fill — a
small array unrolled into the cells, the rightmost axis contiguous —
does the same. These are shape-polymorphic data: the same numbers fill a
vector, a matrix, or a batch of either, as the surrounding computation
decides, and Total_elems is what lets the count be honored
while the axes stay free. It is the row constraint of the vectorized
operation again, with a bare count for a numerator instead of a
coefficient times a variable.
The remaining fetch-ops — Constant,
Constant_bits, Range_over_offsets,
Embed_symbol, Embed_dim,
Embed_self_id — assert no shape at all; they mark the shape
terminal and let the closing policy of the first post finish it, upward
to the uses’ least upper bound, or downward to the unit shape if nothing
pins it. A constant fills whatever shape it lands in, exactly as the
scalar that opened the first post did.
Projection: the index is the data’s layout
Initializers do not take part in the operation-level projection
inference of the second post — there is no contraction to infer, no
operand to align. They lower directly to a loop over the array’s own
dimensions, writing each cell: Constant zeroes or fills,
Constant_fill unrolls its literal, Slice reads
the source at the fixed batch index prepended to the running
indices.
One is worth singling out, because it is where an index becomes data.
Range_over_offsets fills each cell with its own linear
offset — how many cells from the start it lies, in the tensor’s logical
layout. Computing that offset is the inverse of indexing: given the
per-axis indices and the dimensions, collapse them to a single number by
the row-major strides. OCANNL computes it by reflecting the projection —
folding the index array against the dimensions into a strided sum — and
embedding the result as the cell’s value. The same arithmetic that an
index map runs forward to address a cell is run backward here to label
it. (This reflection does not yet handle the coupled indices of the next
post; it is defined for the iterator, fixed, and affine forms of the
first part, which is all an offset range needs.)
Surface: literals, inline declarations, and the unit boundary
Two surface mechanisms introduce initialized tensors. The first is
the literal. An array literal uses OCaml’s list, tuple, and array syntax
to separate the three axis kinds: a list [ … ; … ] for
output axes, a tuple ( … , … ) for input axes, an array
[| … |] for batch axes, nesting for higher rank.
[ (1, 2, 3); (4, 5, 6) ] is a matrix taking
three-dimensional input to two-dimensional output. A bare number is a
scalar. An OCaml type annotation on an axis container names that axis’s
basis — ([ 2.0 ] : rgb) tags the output axis
rgb — the same dimension basis the first post used to keep
incompatible axes from fusing. A literal written this way gets its exact
shape from parsing; there is no variable for inference to resolve.
The second is the inline declaration, the record-brace syntax.
{ w } introduces a tensor named w whose shape
is left to inference; { w = expr } supplies an
initialization expression; further labelled fields supply output
dimensions or other parameters, as in { b; o = [hid_dim] }.
Under %op a declaration is a parameter, differentiable,
randomly initialized by default when no initialization expression is
given. Under %cd it is a non-differentiable intermediate,
self-referential, with no separate initializer. The distinction the
first post drew at the closing policy — a parameter errors if nothing
pins its shape, an ordinary leaf guesses the unit — is set here, at the
declaration.
Where these declarations take effect is governed by the unit
parameter the first post described. In a function
let layer = make () in …, the code before the
() runs once when make () is evaluated and
mints the parameter tensors; the body after it builds a fresh expression
on each application, reusing those same parameter values. This is an
OCaml-evaluation boundary: it decides when Tensor.t values
are constructed, and so whether two applications share a parameter or
each get their own. It says nothing about when any compiled code runs.
Execution in OCANNL is explicit and separate — a tensor’s forward code,
its initialization, and its gradient updates are each compiled to
routines the caller runs by hand — and the unit parameter is invisible
to all of it. What the unit controls is the OCaml-level sharing of
tensor values; what runs, and when, is a later and independent
matter.
What the tensor carries
A Tensor.t holds, besides its shape and its forward
code, a set of tensors called its params — the descendants
whose own forward code is not folded into this tensor’s forward
code, because it is meant for initialization rather than recomputation.
An ordinary subexpression is embedded: its forward code becomes part of
the enclosing forward code and runs whenever the forward routine runs. A
parameter is held out: its forward code — a literal, or the random
initializer built below — is collected separately, and what the
enclosing computation embeds is only a read of the parameter’s array.
Gathering that held-out code, transitively through parameters of
parameters, produces an initialization computation, which is compiled
and run as its own routine; a skip-check on what is already initialized
keeps a parameter from being seeded twice across routines that share it.
The params set is thus a statement about where code
goes — held out of the forward routine, gathered into the
initialization routine — not about when anything runs; the staging is
explicit and in the caller’s hands.
The set carries a second reading, the one a user of a model sees. The
descendants that need initializing before the forward code can run are,
read another way, the trainable parameters of the expression when it is
taken to be a model. There is no separate notion of “the parameters of a
model”: a model in OCANNL is just a tensor, and its parameters are the
members of its params set. The two meanings coincide
because they are one fact about a leaf — that it is held out of the
forward code, its value supplied separately and then only read, rather
than recomputed inline. Held out for initialization and exposed for
training are the same property under two names.
Random initialization, assembled from the parts
The default initializer for a parameter — uniform random values — is not a primitive. It is an expression built from pieces already on the table, and seeing how it is built shows where the determinism comes from. Stripped to its skeleton, the uniform generator is
uint4x32_to_prec_uniform
(threefry4x32
(threefry4x32 (embed_self_id ()) (random_seed ()))
range_over_offsets)
Read it inside out. embed_self_id is the
Embed_self_id fetch-op of the initialization subject: a
leaf that fills its cell with the tensor’s own integer id.
random_seed is a global seed tensor, shared across the
program. The inner threefry4x32 — the counter-based hash,
the binary primitive ^^^^ — combines the two, hashing the
global seed under the tensor’s id to produce a sub-key unique to that
tensor. The outer threefry4x32 combines that sub-key with
range_over_offsets, the fetch-op that fills each cell with
its own linear position: so each cell is hashed under its own offset,
giving every cell of the tensor a distinct value from one sub-key.
Finally uint4x32_to_prec_uniform, the vectorized operation
of the first subject, unpacks each 128-bit hash into typed uniform
numbers.
Every part is something the article has already introduced — two leaf fetch-ops, the threefry primitive, the vectorized unpack — composed into an ordinary tensor expression; nothing is special-cased. The determinism falls out of the composition. The key is split by the tensor’s id and the counter is the cell’s offset, so the values a parameter receives are a pure function of its identity and the global seed: no mutable random state threaded through the program, no seed argument passed down into the layers that build parameters. A tensor’s id is already in scope — it is the tensor — so the splitting is implicit, and re-running with the same seed reproduces every cell exactly. Counter-based generation keyed by identity is what makes pseudorandomness both deterministic and argument-free.
Closing
Three extensions, three reuses. The affine index is a function of
loop variables, read from one term by two solvers and resolved in a
second stratum. The vectorized operation is a fan-out at the loop leaf,
sized by the same kind of count constraint a reshape uses.
Initialization is from-usage inference run from the data end, pinning a
leaf with Exact or constraining its count with
Total_elems, lowering an index backward into data where an
offset range needs it, and held out of the forward code in the tensor’s
params set — the same leaves that are a model’s trainable
parameters. None adds a mechanism the first two posts did not already
contain; each populates a corner the core left empty. Random
initialization adds nothing further at all: it is those corners composed
— two leaf fetch-ops, a counter-based hash, and the vectorized unpack —
into an expression whose determinism comes from keying the hash on a
tensor’s identity, no state and no arguments threaded through. The
construct that genuinely enlarges the machinery — coupling several axes
into one iteration factor, and reversing a gather into a scatter — is
concatenation, and it is the sequel.
OCANNL is open source, at github.com/ahrefs/ocannl.