Merge and Selection: The Residual Stream as a Constraint Grammar, and the Two Readings It Cannot Reconcile
Łukasz Stafiniak and Claude (Opus 4.7)
Abstract
The dominant way of relating linguistics to the Transformer has been to probe for recoverable syntax. This essay asks a different question: whether the architecture’s core update — attention plus residual addition, iterated over layers — is itself a grammatical operation a syntactician would recognize. We argue it is a monotone merge: because the residual stream only ever adds, never overwrites, it performs the join of a constraint grammar rather than the destructive rewrite of a phrase-structure grammar — which makes Head-driven Phrase Structure Grammar, not context-free grammar, the right point of comparison, and Gradient Symbolic Computation the right account of how a continuous vector can carry a discrete feature structure at all. The resemblance is to unification’s product — an accumulated feature structure — more than to its engine: classical unification is driven by the structures themselves, which match or clash by what they are, whereas in the Transformer the combination at each position is directed by that position’s query — a target selecting which sources to gather, without itself entering the weighted sum — while the gathered structure is built from the source material by the MLP. The operation thus has the monotone shape of a join without the structure-driven matching the word unification names, and the three-way split it reveals — target directs, sources supply, MLP assembles — is the selection-from-merge factorization the essay turns on. We develop both correspondences in full, with the apparatus each requires: tensor-product representations and the Harmony-with-scheduled-quantization dynamics on the GSC side, and the explicit-role-binding program on the side of how a Transformer represents who-combined-with-what. The correspondence then meets a fault line that is the real subject of the essay. HPSG is head-driven: it rests on two principles, the unification of feature structures and the projection-and-selection of a single head per combination, which coexist peacefully because each combination is a headed merge of determinate daughters. The Transformer supplies a clean analogue of each — the residual write as the merge, the query/key inner product as the selection — but the two cannot be reconciled, for a single reason: in a Transformer, the selector is not part of the value space. The query that governs each combination is never among the value vectors the merge sums, so the head — the one element a headed grammar must deposit into the mother — is categorically a selector over the soup rather than a constituent in it, and the same exclusion produces the binding problem of stacked attention as its other face. We argue that this irreconcilability is the structure of the object rather than a defect to be repaired: a Transformer layer is a constraint grammar that runs selection and merge at every position at once, and the discrete parse, if there is one, exists only as the quantization limit of a gradient blend the dynamics may or may not drive to a corner — and even at its sharpest, a tree without its heads. A companion essay takes the same fracture as the seed of a more general claim about where the architecture sits among grammatical formalisms; here we establish the fracture itself, and the two bodies of theory — HPSG and GSC — that bring it into focus.
1. Introduction
The dominant interface between linguistics and the Transformer has been probing: train a classifier on hidden states, ask whether syntactic information is recoverable, report that it is. This is worth knowing, and the accumulated finding — that syntactic structure is robustly recoverable from the hidden states of trained language models, often more robustly than semantic structure — is real. But it answers a question adjacent to the one we want. A probe that recovers a parse tree from layer eight establishes that the information is present at layer eight; it does not establish that any layer built that tree by an operation a syntactician would recognize, any more than recovering a temperature from a column of mercury means the mercury computed one. The question this essay asks is whether the Transformer’s core update — attention plus residual addition, iterated — is itself a grammatical operation, separately from whether the trained network has absorbed syntax into its weights.
We argue that it is, and that identifying which operation it is settles a surprising amount. The single most determining fact about the architecture is that the residual stream is never overwritten, only added to: information at a position accumulates monotonically up the stack. That one property, taken seriously, selects against an entire grammatical tradition and toward another — against the destructive rewriting of phrase-structure grammar, and toward the monotone, accumulative side of a constraint grammar, the side that builds a feature structure by adding constraints rather than by replacing symbols. It makes Head-driven Phrase Structure Grammar the right comparison, and it forces the question of how a continuous vector can carry a discrete feature structure at all, which is the question Gradient Symbolic Computation was built to answer. The first half of this essay develops that correspondence with the care it needs — including the respect in which it falls short of unification proper, which we reach in §3.
The second half is about where the correspondence breaks. HPSG is head-driven: a phrase is the projection of a single selected head, and the head’s valence governs what combines with it. The Transformer has an analogue of head-selection — the query/key mechanism, which we will read as a valence specification — distinct from its analogue of merge. But the selector that does the governing is the query, and the query is never written into the structure the merge builds: it drives the combination and then evaporates. So the head governs without depositing itself, and the structure a layer accumulates is a headed grammar’s product with no record of which position was its head. We will argue that this is the structure of the object rather than a defect to be patched — that a Transformer layer is a constraint grammar in which selection and merge run at every position simultaneously, and the discrete parse exists only as the quantization limit of a gradient blend the dynamics need not reach.
A note on what we take and what we add. We take three things from prior work and claim none of them: the tensor-product representation as the way to embed role-filler structure in a vector (Smolensky); the reading of attention as a subject querying for an object, together with the binding-symmetric character of the residual sum — its inability, on its own, to record which position attended to which — that the explicit-binding program (the TP-Transformer and its kin) states as the binding problem; and the dynamics of Gradient Symbolic Computation, in which a scheduled quantization term drives a continuous state toward discrete structure under an optimization or a sampling regime. Section 2 lays out each in the detail it needs. What we add is the grammatical reading built on them: that the residual merge and the query/key selection are HPSG’s two principles; that they cannot reconcile in a Transformer because the head is the selector, which drives each combination from query-space but is never written into the structure the combination builds; and that the mainstream architecture’s declining of the explicit-binding fix is evidence the object does not need committed bindings to work. The essay is written for readers who will recognize both a biaffine attention scorer and a Head Feature Principle but assumes neither. The contribution is conceptual: we claim the mapping is exact enough to carry concepts between the fields, and we locate, in the fracture, a place where the architecture’s behavior and a long-standing open question in the linguistics turn out to be the same thing seen twice.
2. Background
2.1 HPSG in one page
Head-driven Phrase Structure Grammar is a lexicalized, constraint-based, surface-oriented grammar formalism. Three of its commitments matter here.
First, it is lexicalized to an extreme degree. A small
number of phrase-structure schemata combine with a large, richly
specified lexicon, so that most of the grammatical work is done by the
feature structures associated with words rather than by an inventory of
rewrite rules. Where a context-free grammar has many productions —
VP → V NP, VP → V NP PP, and so on — HPSG has
few schemata and pushes the combinatorics into lexical entries that
specify what each word requires.
Second, linguistic objects are feature structures — attribute-value matrices bundling phonological, syntactic, and semantic information — and the combinatory operation over them is unification: two feature structures combine into the least structure consistent with both, and combination fails if they carry contradictory values. Unification is monotone, in that it only ever adds constraints, and partial, in that it can be undefined. This is the formal heart of the framework and the thing that distinguishes a constraint grammar from a rewrite grammar: a phrase is not produced by replacing a nonterminal with a string of symbols, but by merging descriptions until a consistent whole emerges, or a clash blocks it.
Third, and giving the framework its name, headedness is governed by
the Head Feature Principle: the head value of a headed
phrase is structure-shared with the head value of its head daughter, so
that headed phrases are projections of their head daughter. A
verb phrase is a projection of its verb; the verb’s head features
percolate to the phrase. Selection is the complement of projection — the
head subcategorizes for its arguments, carrying a valence
specification (an ARG-ST or SUBCAT list) that
says what must combine with it — and a head is saturated when
its valence requirements are all discharged.
These two principles — unification of feature structures, and head-projection-with-selection — are both load-bearing, and they do different jobs. Unification says how parts combine; the Head Feature Principle says which part governs. In HPSG they coexist without tension because headedness is a property of a combination: when two daughters combine, exactly one is the head, fixed by the schema and the lexical entries. The whole second half of this essay is about the fact that this peaceful coexistence does not survive transplantation into a Transformer, which has no notion of “the combination” — only positions, all merging at once.
2.2 The neural HPSG parser, and what its cost reveals
Recent neural HPSG parsers make the head machinery concrete in a way that will be useful. The approach unifies constituency and dependency parsing under a single joint-span structure, training one encoder against both a chart objective (for constituents) and a dependency-head objective, and decoding both trees at once. The architecture is, tellingly, a self-attentional encoder feeding two scorers: a span scorer and a biaffine attention scorer that assigns a score to every candidate head-modifier arc. Headedness, in this engineering, is an explicit prediction — a dedicated scorer estimates how good each position is as the head of each span.
What is instructive is the cost. Decoding either tree form alone is the familiar O(n³) of chart parsing. Joint decoding that must respect the Head Feature Principle — every span’s head consistent with the dependency structure — rises to O(n⁵), and the contribution that brings it back to O(n³) does so precisely by adding a dedicated head scorer, so that headedness need not be recomputed combinatorially at every span. The lesson is not about efficiency. It is that headedness is expensive because it is a choice made per combination: the grammar must decide, for each way of splitting a span, which sub-span projects. The structural fact stated in that work is exact: the head of a phrase serves as a dependency modifier for words outside the span but a dependency head for words inside it. Headedness is relational and directional — the same word is head looking inward and dependent looking outward. The Transformer will both vindicate this and violate it.
2.3 Tensor-product representations and Gradient Symbolic Computation
A feature structure is discrete; a vector is continuous. The bridge is tensor-product representations. Bind each filler — a symbol — to its role — a structural position — by a tensor product, and sum the bindings, so that an entire structured object becomes a single vector from which fillers can be recovered by unbinding against the roles. A tensor-product representation lets a connectionist network carry symbol structures exactly rather than approximately: given the right roles, the binding of “agent” to “dog” and “patient” to “cat” sums to a vector that is provably distinct from the one binding “agent” to “cat,” and either is recoverable. This is the representational substrate beneath everything that follows.
Gradient Symbolic Computation is what happens when the filler activations are allowed to be continuous and the combinatory process is recast as optimization. A structure is represented as a blend of partially active symbols occupying (blends of) positions; well-formedness is Harmony, a sum of weighted soft-constraint satisfactions; and processing is gradient ascent in Harmony. This is Harmonic Grammar — the weighted, numerical relative of Optimality Theory — equipped with gradient representations and a dynamical implementation. By its proponents’ own framing it is an energy-based system: Harmony is negative energy, and a GSC network is a continuous dynamical system seeking Harmony maxima, kin to a Hopfield network.
The detail that will matter is what the dynamics actually optimize. The quantity is a total Harmony with two terms, ℋ_q = H + qQ. The first, the grammatical Harmony H, scores well-formedness — the soft constraints — and its unconstrained maximum is in general a blend, a superposition of partial structures rather than a single discrete one. The second, the quantization Harmony Q, is exactly zero on “the grid” — the set of vectors that encode a fully discrete structure, one symbol per role — and strictly negative off it; adding qQ therefore leaves the relative Harmony of the discrete candidates unchanged while penalizing every non-discrete state, and its weight q measures the strength of the demand for discreteness. The decisive move is that q is scheduled: to produce a discretely interpretable output, q is raised over the course of processing, so the system begins in a permissive blend and is driven toward a discrete grid point as q grows. A second parameter, the temperature T, controls randomness, and the framework’s two regimes are the two limits of the resulting diffusion. In the optimization regime, q is held and T is cooled toward zero on a slow schedule; the dynamics then converge, by a standard simulated-annealing result, to the global maximum of ℋ_q — a single structure. In the sampling regime, T is held and q is sent to infinity; the equilibrium distribution then concentrates on the grid points with probability proportional to e^{H/T} — a Boltzmann distribution over discrete structures, which is Probabilistic, or Maximum-Entropy, Harmonic Grammar. The same continuous dynamics, under different (q, T) schedules, either select one structure or sample from a distribution over them.
The application closest to our concern is GSC as a model of incremental parsing. Cho, Goldrick, and Smolensky build a continuous-state, continuous-time dynamical system that processes a sentence word by word and must solve two opposed problems: keep multiple interpretations alive under temporary ambiguity, yet ultimately reject the ones inconsistent with later context. Their parser moves to a non-discrete blend state that then evolves toward discrete states encoding globally coherent interpretations, under the scheduled increase of q. When the schedule is wrong — discreteness imposed too early — the parser commits prematurely and makes errors that mirror human garden-path and local-coherence effects.
One distinction must be kept sharp, because the fracture of §5 turns on it. The intermediate blend state — a single continuous vector partway through computation — is not a probability distribution over parses; it is a genuine point in the continuous space, a superposition in the literal vector sense, not a mixture in the probabilistic sense. The Boltzmann distribution lives at a different level: it is the distribution of discrete outputs the stochastic dynamics produce across runs in the sampling regime. “Blend” and “distribution over structures” name different objects — the blend is the instantaneous state, the distribution is the output ensemble. And whether discreteness should be enforced in the output at all — the “output gradience” question — is, by the framework’s own account, unresolved: some analyses require fully discrete output, others posit residual gradient activity in surface representations, with no settled criterion. We note this only to set it aside; it will return, briefly, at the end.
2.4 The explicit-binding program
There is a third body of work we lean on, concerning how a vector can record who combined with what. The tensor-product representation solves this in principle, but a standard Transformer does not use it: its attention sums value vectors without binding them to roles. A sustained research program has proposed adding explicit role-filler binding and unbinding to neural models to recover the structure that plain summation loses. The Tensor Product Generation Network generates sentences by emitting, at each step, an unbinding of one learned role; TP-N2F combines binding and unbinding for program synthesis; the TP-Transformer binds each retrieved value to a role vector inside the attention mechanism itself. These models descend from Smolensky’s original variable-binding work and they demonstrably improve interpretability and structural generalization.
Two findings from this program are load-bearing for us, and both are
theirs, not ours. The first is the reading of attention itself: the
TP-Transformer’s authors describe a cell as a subject that
queries the other cells for an object — selection by another
name, and the seed of the query-as-valence reading we develop in §4. The
second is the binding problem they state formally in motivating the fix.
In a stack of plain attention, a layer in which cell a attends
to b and cell c attends to d, feeding a
higher cell that attends to both, produces a residual sum that is
symmetric in a way that cannot record the pairing of who
attended to whom: the representation that should encode
(a/b)/(c/d) is indistinguishable from the one for
(a/d)/(c/b). The merge has lost the binding. The reason is
structural and worth naming exactly, because an earlier essay in this
series, “The Algebraic Mind Meets the Neural World,” traced it to
Smolensky’s framing: attention is approximate tensor-product
unbinding — the query is a role-address, the value the filler
retrieved — but the value projection v_j = W_V x_j is a
function of the source alone and carries no trace of the key it matched,
so the weighted sum scales each value and discards the routing, leaving
a bag of fillers with the pairing erased. Smolensky’s complaint about
first-generation systems is the same observation one layer up: the
compositional graph is used within a layer and then not
encoded in the activation passed onward, so the binding has to
be recomputed from scratch each layer and is never carried.
A qualification sharpens this rather than softening it, and it ties
the argument to the architectures the companion essay cares about. The
total collapse of (a/b)/(c/d) into (a/d)/(c/b)
requires that the swapped fillers traverse identical maps to
the higher cell. In a standard stacked Transformer they need not: each
layer has its own W_V, W_O, so a filler merged
at one depth passes through a different composition than one merged at
another, and depth-of-incorporation works as an implicit timestamp — a
crude positional role — that partially breaks the symmetry and lets
distinct layers build distinct feature structures. In a looped
or weight-tied Transformer, every iteration applies the same
projections, that timestamp vanishes, and the bag semantics holds
cleanly and globally. So the binding-symmetric reading is exact
precisely for the looped and equilibrium models — the ones “The Given
and the Found” treats as the architecturally interesting case — and
holds in a stacked model only modulo the implicit depth-role, which is
itself the crudest “slot-i” binding rather than a grammatical
one. A third detail will matter in §6: when the generation network was
made to induce its binding roles unsupervised, the roles it
discovered came out as a hybrid of positional and
syntactic/semantic types — neither purely “slot i in the
sequence” nor purely “grammatical function,” but a blend of the two.
We now have the three pieces — monotone merge, head selection, and gradient blends with a quantization problem — and can state the correspondence.
3. The residual stream is a join, not a rewrite
Begin with the operation that defines the architecture. Write the
residual stream at layer ℓ and position i as h_i^ℓ. The
update is, schematically,
h_i^{ℓ+1} = h_i^ℓ + Attn_i(h^ℓ) + MLP(...),
and the defining feature is the leading h_i^ℓ +: the
stream is never overwritten, only added to. Information at a position
accumulates up the stack. Whatever was true of h_i^ℓ
remains, in a linear-readout sense, recoverable from
h_i^{ℓ+1} modulo what later layers actively cancel; the
default is accumulation, not replacement.
What this fact rules out is sharper than what it rules in. A
phrase-structure rewrite grammar applying A → B C
consumes A: the symbol is gone, replaced by its
expansion, and rewriting is in this sense non-monotone — it destroys its
input. The residual stream cannot work that way; it only accumulates. So
monotonicity excludes the entire rewrite tradition and places the
architecture somewhere in the constraint-grammar family — the
formalisms that build a structure by adding constraints that never
retract, the feature-structure-building grammars rather than the
symbol-replacing ones. That is a claim about a family, not a single
member, and the family is large: HPSG, LFG, Word Grammar, and others all
build monotonically. Where the architecture sits among the constraint
grammars is the question the companion essay takes up, and its answer is
not the obvious one — that it sits at no single member, cutting across
what each of them bundles together. Here we deliberately start from the
most prominent member, Head-driven Phrase Structure Grammar — the
canonical head-driven unification grammar, the one a syntactician
reaches for first — precisely so that the place where it fails
to fit becomes diagnostic. We are not claiming HPSG is the closest
grammar to a Transformer; we are using its prominence and its sharp
two-principle structure as the cleanest instrument for finding the fault
line, and reading off, from where the instrument breaks, what a better
fit would have to be.
Three features make HPSG the right exemplar to start from, beyond
mere prominence. Representations are bundles, not atoms. A
residual vector is a superposition of many features in near-orthogonal
directions; the superposition hypothesis of mechanistic interpretability
is, read in this light, the claim that the stream carries an
attribute-value bundle rather than a single categorial label — an HPSG
feature matrix, not a bare NP/VP symbol.
One schema, rich lexicon. The Transformer applies one
attention-plus-MLP schema at every position and every layer, with all
the content in the learned per-token representations and the weights
that read them — radical lexicalism, which is HPSG’s signature.
Combination is non-local. Attention reaches across arbitrary
distance, as HPSG handles non-local dependencies through its feature
machinery (the percolation of a gap through the SLASH
feature) rather than over contiguous spans. What the residual stream
builds, across the depth of the stack, is thus something like a feature
structure that only grows — and, as §5 will press, an unordered
one, since the merge records no head and so imposes no governance
ordering on what it gathers. HPSG is the grammar whose vocabulary lets
us say all of this.
So the residual stream constructs feature structures by monotone
merge — but “merge” here is implemented by attention, and attention is a
softmax-weighted average, and that is where the correspondence meets its
first obstacle. Unification is exact and can clash:
combining [NUM: sg] with [NUM: pl] fails, and
that failure is where a constraint grammar gets its discriminating
power, since ruling combinations out is most of what a grammar does.
Attention has no clash. It returns a convex combination of value
vectors, always defined, never failing — a lossy blend, not an exact
join. The residual stream performs a soft, failure-free
approximation of unification, which is not unification in the
strict sense.
The repair is GSC, and this is why the framework is structural rather than optional decoration. Replace exact unification under hard constraints with Harmony maximization under soft, weighted constraints. A combination that would clash in classical HPSG becomes, in Harmonic Grammar, a combination of low Harmony — strongly dispreferred but not undefined. A soft constraint merely down-weighted is what a learned attention pattern plus MLP can implement: no hard failure, only a low-Harmony region the dynamics avoid. On this reading the lossy blend stops being a defect and becomes the design. The stream computes a gradient feature structure, and the structure it settles toward is the maximum-Harmony one given the soft constraints the weights encode. The Transformer, read this far, is a Harmonic Grammar parser whose constraints are learned rather than hand-specified and whose representations are gradient by construction; the blend is the native state of such a parser.
The second obstacle is deeper, and GSC does not dissolve it; it is the one that should keep us from saying flatly that the operation is unification. In a unification grammar the dynamics and the structure are the same thing: two feature structures match or clash by what they are, so the structures drive their own combination and constitute its result at once — the engine and the product are one. The Transformer pulls these apart into three separate things. The target position issues a query, and that query drives the combination — it selects which other positions are gathered — but the query does not participate in the weighted sum; the sum is over the source values, so the position directing the merge contributes the direction, not the material. The material comes from the sources. And the feature structure proper is then built out of that gathered source material by the MLP. So what would be, in unification, a single structure-sensitive operation is here factored into a target-query that catalyzes selection without entering the result, a set of sources that supply content without directing the gather, and an MLP that assembles the structure from what was gathered. The driver does not appear in the product, the product is built from elsewhere, and the assembly is a separate sublayer. The Transformer has unification’s product — an accumulated feature structure — without unification’s engine, the structure-driven matching a syntactician means by the word. This three-way split is not incidental: the target-query that catalyzes selection is the selection principle of §4, the source-aggregation the merge, so the very respect in which the operation is not unification is the factorization the rest of the essay turns on, visible already in a single layer.
If the essay stopped here it would be a clean “Transformers are Harmonic Grammar parsers” thesis. It does not stop here, because the head has not yet entered, and the head is where the picture fractures.
4. Two analogues: the query selects, the stream merges
HPSG runs on two principles. Section 3 cashed out the merge — the residual write as a monotone join under soft constraints, a Harmony-maximizing build rather than unification proper. But the framework is head-driven, and we have said nothing about the head. The Transformer turns out to have a second, independent grammatical analogue — a different operation, distinct from the first.
Consider the query/key mechanism on its own, setting the residual
stream aside. At position i the model computes a query vector
q_i and compares it against the keys k_j of
all positions. The query is, functionally, a specification of what i
is looking for — it defines, in the key space, the kind of content
position i wants to combine with, and attention weight is high where a
key matches the specification. This is, close to literally, a
valence specification. An HPSG head carries an
ARG-ST list saying what it must combine with; the query
says what this position seeks. Attention, on this reading, is the head
selecting its arguments — scanning the available positions for
fillers that satisfy its valence and binding to them in proportion to
fit. The key/value distinction even mirrors the selection/content
distinction: the key advertises what kind of thing I am, for
selection purposes, the value carries the content I contribute
once selected.
This is the selection analogue, and it is grounded independently of the merge analogue. It is not new with us — the TP-Transformer’s authors read attention in just these terms, the subject querying for an object — and the neural HPSG parser of §2.2 makes it concrete on the supervised side: its biaffine scorer assigns a score to each head-modifier pair, a learned valence-satisfaction score computed by the same bilinear form that ungated attention uses. The parser’s explicit head scorer does, as a supervised side task, what we claim the query/key mechanism does implicitly — estimate headedness and argument-fit. Selection, on this reading, describes what the query/key inner product computes.
So the Transformer has two grammatical analogues, each independently motivated:
- Merge (§3): the residual write is a monotone join; the stream accumulates a gradient feature structure; the structure it settles toward is the maximum-Harmony one. This is the unification principle of HPSG.
- Selection (§4): the query is a valence specification; attention is the head scanning for and gathering the arguments that satisfy it; the key/value split is the selection/content split. This is the head-projection principle of HPSG.
In HPSG these two are the same grammar seen from two sides, reconciled by the fact that headedness is a property of each combination. The question is whether they reconcile in the Transformer. They do not, and §5 examines how.
5. Why the two analogues cannot be reconciled
The incompatibility is structural, but it is not where one might first place it. The obvious candidate is parallelism — a Transformer updates all n positions at once, where an HPSG derivation combines two daughters at a time — but this is a false lead, and saying why sharpens what the real fracture is. HPSG parsing is not inherently sequential: a chart parser builds, at each stage, all the edges that can be built from the edges already present, and many non-conflicting merges fire in parallel. So a Transformer layer maps cleanly onto a stage of parallel parsing — the batch of all merges applicable to the current representations — and depth onto the succession of stages, each layer combining what the layer below produced. Read this way, the n simultaneous combinations are no embarrassment at all; they are exactly the parallel non-conflicting merges a chart parser performs at one stage. Parallelism reconciles.
What does not reconcile is the character of each individual combination, and locating it exactly matters, because one apparent problem is not one. In HPSG each merge in the parallel batch is a headed combination of determinate daughters: one or more daughters, exactly one of them the head, projecting to a mother. (Binarity is not the issue — HPSG schemata are usually binary for parsimony, but the head-complement schema can discharge several complements at once; what matters is that one daughter is the head whose features project.) A Transformer’s per-position update departs from this in two respects, but only one of them is a fracture.
The first is that position i absorbs a softmax-weighted blend of all positions rather than a combination of identified daughters — no position is definitely in or out. This looks like a defect and is not one: indeterminate, soft membership is exactly what a gradient blend is for, a legitimate pre-quantization state that the GSC machinery of §2.3 resolves toward discreteness as the effective q rises. A soft mixture of candidate daughters is a blend on its way to a determinate merge, not a violation of the grammar; §6 shows it quantizing. Indeterminacy is remediable, and the framework we have already imported is what remedies it.
The second respect is the fracture, and it has a single source that the rest of this section unpacks: in a Transformer, the selector is not part of the value space. The query that drives each combination is never among the value vectors that get summed, and from that one exclusion both the head’s absence and the bag problem follow.
A Transformer layer combines by summing value vectors into the residual stream, and a sum deposits content without attaching it to a role: the result is a soup of superimposed contributions. This is not a defect for the architecture’s own purposes — it is what makes it work. Two operations read the soup and impose structure on it. Attention’s queries reach in and pick out the contributions a position needs; selection-out is what attention is good at. And the MLP binds: it reads its input as a combination of subspaces and writes a new representation in which the selected subspaces become axes of variation — a learned re-encoding that does attach content to roles, in the directions of its output space. So binding is not absent from the layer; it is the MLP’s job. What the merge itself does not do is carry the binding — the sum, taken alone, records no relation among the daughters it superimposes; whatever roles get assigned are assigned downstream, by the MLP, out of the soup the sum produced.
Both of HPSG’s principles fail on this one fact, which is why they are not two failures. Read as selection, the query makes position i a governor that reaches into the soup and pulls structures out — but pulling-out is the opposite of projecting-in; the head selects without depositing itself, so its governing the combination leaves no governing trace in the combination. Read as merge, the sum gathers the source values but not the query that gathered them.
How strong the resulting headlessness is depends on one architectural
detail. A block has two residual paths, around attention and around the
MLP, so position i’s content h_i reaches the next layer
through the MLP, not only past it — which means the head’s
content is not lost: an MLP could read the embedding in a
head-typed direction and re-project it. The strong claim, that the head
is genuinely unrecoverable, holds only in a variant without the
MLP skip, where the merged soup is all the MLP gets; we do not make it
about the actual architecture. The precise claim we do make is narrower:
the query does not participate in what is built. The
content may ride the skip forward, but i’s selection act — which
daughters it gathered, as their governor — is consumed in the softmax
and written nowhere, recoverable from no later sublayer because it was
deposited in no space, value or residual. What is irrecoverably absent
is not the head’s content but the head’s relation to this
combination.
The one root cause — that the selector is not in the value space — surfaces in two distinct ways, and it is worth separating them because they have different fates. The first is the missing selection relation: the query that made i the governor of this combination is consumed in forming the attention weights and written into no representation, so the structure carries the gathered daughters (and, via the skip, i’s own content) but no record of i-having-selected-them-as-their-head. The second is the bag problem: even among the daughters that the sum does deposit, the pairing relating them is lost, so distinct structures collapse to the same vector. Both follow from the selector’s exclusion from value-space, and both are reparable in principle by writing the selector in — but the canonical fix, as we will see, reached only the second.
And the culprit is the exclusion, not the addition. A sum can carry a
binding if its summands are tagged: were the selector in the value
space, the merge — commutative as it is — would still tell
a/b from b/a, not by any clean orthogonality
of roles but because the selection logic is directional. The query-key
strength with which a gathers b differs from
the one with which b gathers a, and once the
selector rides in the summed values that asymmetry comes through. The
binding would be indirect, through the selection strengths rather than
role geometry — which is why the fix is to admit the selector to the
sum, not to abandon the sum.
The bag problem is not a loose analogy; it is the binding
problem of stacked attention that the TP-Transformer stated
formally (§2.4). Their worked case is ours: the residual sum
z_a + z_c + o(v(z_b)) + o(v(z_d)) cannot record the pairing
of who attended to whom, so (a/b)/(c/d) and
(a/d)/(c/b) collapse to the same vector — exactly modulo
the cross-layer differences that distinct per-layer projections
introduce, which let depth-of-incorporation serve as an implicit
timestamp and partially break the collapse in a stacked model, though
not in a weight-tied one. Nor does the block’s other sublayer repair it:
the MLP is position-wise, a function of the merged residual at one
position alone, so it receives the soup already summed and binds over
whatever pairing survived the sum — and when the pairing did not
survive, identical sums yield identical MLP outputs by construction,
whatever the MLP’s capacity. This is the bag manifestation, and it is
reparable: writing an explicit role onto each retrieved value before
summing, as the explicit-binding program does, restores the pairing in
the representation the MLP receives, and the MLP can then bind it. The
missing selection relation is the manifestation no such fix reaches,
because what is absent is not a lost pairing among summands but the
query itself — the act of governance — which no role written onto the
daughters supplies.
The decisive fact is what the field did with that available fix. It was offered repeatedly, by the whole explicit-binding program of §2.4, and these models work. And the mainstream architecture declined them: standard attention, the thing that actually scaled, kept the binding-ambiguous merge and added no explicit role-binding. The bag problem was not solved in the dominant Transformer; it was lived with — repeatedly diagnosed, repeatedly offered a cure, the cure repeatedly left on the shelf. The architecture treats the pairing of who-attended-to-whom as something a later query can usually recover well enough from context, rather than as something the representation must commit to at the merge — and it scaled anyway. That the field could leave the cure on the shelf is itself evidence about the object: a Transformer does not need committed bindings to work, recovering the pairing from context when it matters rather than fixing it in the representation at the merge.
It is worth noting which fix the program offered, because it
aimed at the reparable manifestation and reveals the other by contrast.
The TP-Transformer binds a target-based role onto a
source-based value: per head, the blended value gathered from
the sources is bound (by a Hadamard contraction of the tensor product)
to a role vector computed from the target, which the network learns to
make subject-specific. This tags each gathered dependent with the
identity of the position that gathered it, restoring the pairing and
dissolving the bag problem. But the role derived from the
target rides on its dependents, while the target’s own content enters
the next layer only as undifferentiated skip carry-over — present, but
not marked as the head of this combination, carrying no record of which
daughters it governed. The fix repairs the bag and reaches the selection
relation only obliquely. A different and cheaper remedy — ours, and
conjectural — addresses the other manifestation. What it lacks is the
selection act itself, and that act already has a vector: the
query q_i is exactly the specification of what i sought as
governor, the governance relation in the one form the architecture
computes it, and it is discarded after forming the attention weights. So
the fix is not to learn a new projection of the head’s content, which
the skip already carries, but to write the query into
value-space — depositing q_i (or a fixed reshaping of
it) into i’s residual in a head-typed direction, as a marked summand
recording i as the governor of this combination. This adds no trainable
parameters, where the TP-Transformer’s role-binding adds a projection
matrix per head: their learned role map repairs the bag; re-using the
discarded query records the selection for free. We do not claim the
architecture should have either, only that the canonical fix addressed
the bag and left the selection relation unrecorded.
So the residual stream is a soup of summed contributions from which selectors recover what they need — a workable and powerful arrangement, but one in which the selector that does the gathering is never itself gathered in. That is the whole fracture: not parallelism, not the softness of the blend, not the additive merge as such, but that the selector is held out of the value space — and the head, which is a selector, is the constituent the architecture therefore cannot deposit into the structure it builds.
One might hope the directional asymmetry of a decoder rescues uniqueness — that causal masking, by letting position i attend only to j ≤ i, picks out a distinguished head. It helps at one point and nowhere else. The final position of a decoder, the one read out for the next token, is a privileged merge destination: everything flows into it and nothing flows out, and its projection spine is the one that survives to the output. There, and only there, the selection reading recovers a unique head — the generation site is the syntactic head of the whole constructed structure. But this is a boundary condition, not a parse. It distinguishes one head for the entire sequence at the moment of emission and says nothing about the n−1 interior positions, each of which remains a head-and-dependent in superposition. The asymmetry gives a root, not a tree.
6. The blend is the object
The natural conclusion is that we have been demanding the wrong thing of the analogy, not that it fails. We wanted a Transformer layer to compute a parse — a discrete tree with one head per node — and found it computes n superposed views merged under soft constraints. The softness has a name in the framework we imported to handle the continuity: it is a gradient blend, a conjunctive superposition of partial structures, what GSC says a continuous parser occupies before quantization. That softness is the part GSC dissolves; the head-absence, as we will see, is not a blend and does not dissolve.
Read the layer stack as a GSC incremental parser and the soft half of the fracture resolves into a known phenomenon rather than a contradiction. Early layers occupy a high-entropy blend: many features active, many candidate daughters superposed at each position. This is the Cho–Goldrick–Smolensky blend state — the contextually required maintenance of multiple interpretations under local ambiguity, in the literal vector-superposition sense, the instantaneous state and not yet any distribution over trees. As computation proceeds — across layers in a feedforward stack, or across iterations in a looped or equilibrium model — the dynamics may drive the blend toward a discrete corner: at each position a definite set of daughters wins, the superposition collapses, a definite combination quantizes out. Whether it does so depends on the analogue of GSC’s scheduled discreteness weight q — how strongly, and how early, the learned dynamics penalize non-discrete states. Depth, or iteration count, is the Transformer’s schedule variable, the budget over which an effective q can rise. But notice what quantization delivers and what it does not. It sharpens which daughters a position gathers; it cannot record the head’s governance of the combination, because what would record it — the query — was consumed in forming the gather and never written into the sum to be sharpened. (The head’s content may ride forward on the skip connection, as §5 noted, but not marked as the governor of this combination; quantization sharpens the daughters, not that marking.) So the cleanest quantized structure a Transformer can reach is a definite combination of definite daughters that is still headless: a sharp tree whose every node is the merge of identified parts, with no part recorded as the governor that projected it. The headlessness is orthogonal to the schedule, and that is exactly why it is the structural finding rather than an artifact of incomplete settling.
This reframing does three things. First, it dissolves the soft half of the §5 incompatibility without choosing a side. Selection and merge are not competing accounts of one discrete operation; they are the two Harmony components. Selection — the query/key valence-matching — is the grammatical Harmony, the soft constraints that prefer certain head-argument bindings. Merge — the monotone residual accumulation — is the substrate on which Harmony is maximized, the continuous medium in which partial structures superpose. A soft blend of candidate daughters is what maximizing grammatical Harmony without quantization yields, just as the grammar component of the Cho–Goldrick–Smolensky parser, alone, prefers a blend; raising q sharpens the daughters toward a definite combination. What this dissolution does not reach is the head: the grammatical Harmony scores bindings among the gathered sources, but the selector that drove the gathering is not among them, so no setting of q promotes it into the structure. The soft half resolves; the headless half is invariant under the schedule.
Second, the hybrid-roles finding falls into place as a prediction. When the explicit-binding program induced roles unsupervised, the roles it found were a blend of positional and syntactic/semantic types (§2.4). That is what this picture predicts: a gradient feature structure under soft constraints has no reason to factor cleanly into position and category, and will carry a superposed mixture of both. The roles came out hybrid because the substrate is a blend, and a blend does not respect the joints a discrete theory would impose on it.
Third, it tells us which experiments are the right ones. A GSC parser with too-aggressive quantization commits prematurely and produces garden-path errors; one with too weak a policy never resolves. The Transformer analogues are testable: a model whose effective q is too strong should show early, confident, hard-to-revise misparses on locally ambiguous input — a garden-path signature in the layerwise trajectory — while one whose q is too weak should carry blends to the output and produce fluent but structurally noncommittal continuations. We do not run these here; we claim only that GSC makes them the right experiments, by giving the layerwise trajectory a vocabulary — blend, schedule, quantization corner — in which “when does this model commit, and to what” is a well-posed question.
There is a fourth connection we record cautiously, having overstated it in earlier drafts. GSC’s unsettled question of output gradience — whether a linguistic system must quantize to discrete structure before using it, or may carry gradient activity through to the output — has in the Transformer an arena the debate has not considered. The framework’s default, following Smolensky, Goldrick, and Mathis, is that outputs cannot contain blends, and a body of work proposes constraints to penalize non-discrete output directly — Zimmermann’s FULL!, WEAK, and STRONG; the QUANTIZATION constraint of Goldrick, Putnam, and Schwarz. As Hsu’s overview frames it, whether to relax that assumption is one of the framework’s most significant open questions, and the analyses positing surface gradient activity remain contested. Every layer of a Transformer, by contrast, manifestly outputs a blend, and the decoded final state is in general not quantized — the model decodes from a gradient representation with no enforced discreteness step on the representation itself. One might read this as siding with the relax-the-assumption camp. But two caveats deflate the claim. First, the decoder’s final softmax over the vocabulary is a quantization step and the emitted token is always discrete, so a deployed Transformer is equally consistent with “discreteness is enforced, but only at the last operation, on the token rather than the representation.” Second, Hsu notes the framework has not settled whether output gradience is even substantive — a detectable phonetic or psychological property, as Amato and Walker propose, or purely formal, as Zimmermann takes it; the Transformer’s hidden state is substantive in the only sense available to it, but that does not adjudicate a question about output representations of a different kind. So the architecture settles nothing. It poses the question concretely — “must structure be quantized before use” becomes something one can put to a running system — and we leave it there.
7. Conclusion
We set out to test whether the residual stream computes a parse and found something more specific than yes or no. The stream’s monotone accumulation is a join — the structure-building operation of a unification grammar, though directed by a target’s query over gathered sources rather than driven by the structures themselves — which makes HPSG the right comparison and Gradient Symbolic Computation the right account of its continuity. But HPSG’s two principles come apart in the transfer: the query/key mechanism is a selection principle that makes every position a head in the selecting sense, while the residual write is a merge principle that builds its structure out of the gathered sources — and the head that did the selecting, living in query-space, is never among them. The head selects but does not project into the product; the merge accumulates daughters but no governor. The resolution is to stop demanding that the two be one operation. A layer computes a gradient blend in which selection supplies the soft grammatical constraints and merge supplies the continuous substrate, and raising the effective discreteness can sharpen which daughters are gathered — but it cannot write the selector into the structure, so even the sharpest result the dynamics reach is a headed grammar’s product with the heads removed — removed as governors, that is, not as material: the MLP skip carries each head’s content forward, and only the selection relation, the query consumed and written nowhere, is irrecoverably gone.
This is a less tidy conclusion than “Transformers are Harmonic Grammar parsers,” and the untidiness is the finding. The Transformer is a constraint grammar that builds the product of a head-driven grammar without the heads — gathering daughters into accumulated structures while the governors that selected them stay outside, in query-space, as pure routing. When the dynamics hold the blend, it has not finished parsing; when they sharpen it, it has finished parsing into a headless tree. Whether that is a limitation, in that it cannot represent the discrete head-projected structure language is usually taken to have, or a discovery, in that language processing may not require that structure, is a question that HPSG, GSC, and the Transformer, read together, force into the open. The parse, if we insist on one, is something we quantize out of the model’s state by reading it discretely — a corner the dynamics passed near, and even there a tree without its heads.
That the fracture generalizes — that the same un-bundling recurs across the other formalisms, and that the gradient substrate’s part is not to make the un-bundling possible (a factorization a discrete grammar could perform too) but to hold the un-bundled choices graded and superposed, never forced to resolve — is the argument of the companion essay. Here we have established the first and sharpest instance: the head-driven grammar whose two principles the architecture holds apart, and the two bodies of theory that let us see it doing so.
This article was co-authored by Łukasz Stafiniak and Claude (Opus 4.7). It continues the series on mind, metaphysics, and artificial cognition published at lukstafi.github.io and syndicated at lukstafi.substack.com, follows “The Given and the Found: What Test-Time Reasoning Amortizes, and What It Cannot,” and develops the symbolic/subsymbolic line of “The Algebraic Mind Meets the Neural World,” from which the reading of attention as approximate tensor-product unbinding, Smolensky’s first-generation-neurocompositional framing of the vanishing attention graph, and the Gröndahl–Asokan vocabulary-generalization verdict are carried. Its companion, “A Parser Without a Grammar: Where the Transformer Sits Among the Formalisms,” takes the fracture established here as the first instance of a general pattern. The HPSG material draws on Pollard and Sag’s formalism and on the neural joint-parsing line of Zhou and Zhao and of Li et al. (“Head-driven Phrase Structure Parsing in O(n³) Time Complexity”). The Gradient Symbolic Computation material draws on Smolensky and Legendre’s “The Harmonic Mind,” on Smolensky, Goldrick, and Mathis’s optimization-and-quantization framework, on the GSC dynamics results establishing that continuous neural dynamics perform discrete Harmony optimization and Boltzmann sampling under scheduled discreteness (q) and temperature (T) parameters, on Cho, Goldrick, and Smolensky’s dynamical incremental parser, and on Hsu’s overview of Gradient Harmonic Grammar, from which the output-gradience discussion and its references — the no-blends-in-output assumption of Smolensky, Goldrick, and Mathis; the output-penalizing constraints of Zimmermann (FULL!, WEAK, STRONG) and of Goldrick, Putnam, and Schwarz (QUANTIZATION); and the substantive/non-substantive question raised by Amato, Walker, and Zimmermann — are drawn. The tensor-product substrate is Smolensky’s; the binding-problem analysis and the subject/object (selection) reading of attention are from Schlag, Schmidhuber, and colleagues’ TP-Transformer (“Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving”). The explicit-role-binding program we set our reading against includes the Tensor Product Generation Network (“A Neural-Symbolic Approach to Natural Language Tasks”), TP-N2F, the TP-Transformer, and the broader role-learning line (Palangi et al., Huang et al.); the query-as-selection reading and the observation that residual merge is binding-symmetric are due to that program, not to us, as is the finding that unsupervised role induction yields hybrid positional/syntactic roles. What we develop, as far as we know for the first time, is the reading of the binding problem through HPSG’s two principles, the observation that the binding-symmetric collapse is exact for looped and weight-tied Transformers while a stacked model evades it only through the implicit depth-timestamp its layer-varying projections supply, the separation of the binding problem into a reparable bag manifestation and an unrecorded selection relation (with the observation that the TP-Transformer’s target-role/source-value binding addresses the former and reaches the head only obliquely, where re-using the discarded query vector as a head-typed deposit would record the latter at no parameter cost), and the argument that the mainstream architecture’s refusal of the role-binding fix is evidence its native representations leave bindings uncommitted. These are offered as conjectures, inviting the layerwise-trajectory experiments §6 describes.