A Parser Without a Grammar: Where the Transformer Sits Among the Formalisms
Łukasz Stafiniak and Claude (Opus 4.7)
Abstract
An earlier essay in this series argued that the residual stream performs a monotone information-merge — structurally a join, the structure-building operation a unification grammar uses, though directed by a target’s query over gathered sources rather than driven by the structures themselves — which made Head-driven Phrase Structure Grammar the natural point of comparison, and then found a fault line: HPSG’s head-projection and its feature-unification, which coexist peacefully in the grammar, come apart when transplanted into a Transformer, for a single reason: the selector is not part of the value space. The query that gathers each combination is never among the values the merge sums, so the head is a selector over the structure, never a constituent within it — with the binding problem of stacked attention as the same exclusion’s other face. This essay takes that fracture as a clue rather than a defect, and asks the more general question it implies: not which grammatical formalism the Transformer instantiates, but what the space of formalisms can tell us about where in it the architecture sits. We treat six traditions — HPSG, Combinatory Categorial Grammar, Lexical-Functional Grammar, Dynamic Syntax, Construction Grammar, and Word Grammar, with Transformational-Generative Grammar as a negative anchor — as coordinates rather than as candidates in a fit contest: each fixes one axis of grammatical design, and the architecture’s position on each axis is a fact we can read. The thesis that emerges is that the Transformer is a member of no family because its signature is to un-bundle design choices the established formalisms fuse — selection from merge, function from precedence, inventory from combination, grammar from parser, structure from process. The headless-merge fracture is the first instance of a general pattern. We close by separating what the gradient substrate explains from what it does not: un-bundling is a factorization the discrete formalisms can themselves perform, and what the continuous medium uniquely adds is that the factored axes stay graded, the alternative structures stay superposed, and the state need never resolve to one of them. Gradient Symbolic Computation is the account of that medium — not a seventh formalism — and of why the architecture need not land on any single corner.
1. The Wrong Question
There is a question about Transformers and grammar that gets asked constantly and is almost worth retiring. It is: which grammatical formalism does the Transformer secretly implement? The question has produced a small industry of probing studies, and the industry’s findings are real and worth having. A recent systematic review of linguistic interpretability — López-Otal and colleagues, covering some 160 works across syntax, morphology, lexico-semantics, and discourse — assembles the verdict: syntactic information is robustly recoverable from the hidden states of pretrained language models, often more robustly than semantic information, acquired early in training, located (with much dispute about exactly where) in the middle layers, and detectable in attention heads that specialize in particular dependency relations. Structure is there, in the sense that a trained probe can pull it back out.
But probing, as “Merge and Selection” argued at length, shows that structure is present without showing what operation produced it; on that latter question the paradigm is silent by construction. The review itself is admirably frank about the gap. It foregrounds the correlation-versus-causation problem, the rise of causal and amnesic probes meant to address it, the finding that probes may be reading off the linear context surrounding a token rather than any structured representation, and the awkward result that one can shuffle word order in pretraining and still train a serviceable model — some of the apparent syntax, that is, may be statistical co-occurrence wearing a tree-shaped mask. The state of the art is a large, careful catalogue of what is recoverable, and recoverability is one step removed from what we want.
There is a cleaner demonstration of the gap, and it comes from outside the interpretability literature, in a study that is not about Transformers at all except as a control. Stanojević, Brennan, Steedman, Hale, and Dunagan modelled the brain signals of people listening to an audiobook, asking whether incremental structure-building metrics derived from a Combinatory Categorial Grammar parser explain neural activity over and above the next-word predictability estimated by a Transformer language model. They found that they do, in the left posterior temporal lobe, and — the part that matters here — that the two contributions are spatially separable: the structure-building effect is distinct from the predictability effect, which lives bilaterally in superior temporal cortex. The Transformer carries predictability; the grammar carries structure-building; the brain keeps them in different places. Whatever one makes of the neuroscience, the methodological lesson transfers. Structure-building and predictability are different functions. A model can be superb at the second while the first remains a separate question — which is why “the probe recovered a tree” cannot answer “what does the architecture compute.”
So we set aside “which formalism is it” and ask instead: the formalisms disagree with one another along identifiable axes — about whether combination consumes its inputs, about whether one parallel structure or several are maintained, about whether the grammar is a set of rules or an inventory of constructions, about whether there is a separate structural object at all. Each disagreement is a design decision, and a Transformer, being a definite mechanism, comes down somewhere on each. The formalisms, read this way, stop being rivals competing to be the right answer and become a coordinate system. The interesting facts are the architecture’s coordinates — and, as we will see, the fact that no single formalism shares all of them, because each formalism bundles together choices the architecture pulls apart.
A note on what follows. We assume a reader who will recognize a slash-category and a feature structure but not necessarily all six formalisms; we give each the minimum it needs at the axis where it is the exemplar, and no more. The contribution is conceptual. Where a mapping is ours rather than the field’s, we say so.
2. Six Axes
2.1 Monotonicity: HPSG and the rewrite grammar at the far pole
“Merge and Selection” established the positive pole of this first axis: because the residual stream is never overwritten, only added to, the core update is a monotone merge — the join in the information ordering of a constraint grammar — rather than the destructive rewrite of a phrase-structure rule, which consumes its left-hand side. That is what made Head-driven Phrase Structure Grammar the comparison, the lexicalized unification formalism whose derivation is a bundle of features that only grows. We take that result as given and ask what sits at the other end of the axis.
The negative pole is anchored, cleanly and somewhat unfairly, by Transformational-Generative Grammar. The transformational tradition is built on operations that are destructive in the way unification is not: movement leaves a trace and displaces material, deletion removes it, the derivation rewrites its own intermediate structures on the way to the surface. A grammar of transformations is a rewrite system par excellence, and it is the formalism the residual stream’s accumulate-don’t-overwrite character is most directly unlike. This is why we include it as an anchor rather than a contestant: its value to the coordinate system is to mark the far end of the axis.
But the anchoring has to be done with care, because the obvious version is wrong, and a reader who knows the Minimalist Program will catch the error. Bare Merge, the basic structure-building operation of minimalist syntax, is not destructive at all — it is monotone set-formation, taking two objects and forming the set containing them, adding structure without removing any, and so sits near the residual stream’s pole rather than opposite it. What anchors the destructive end is specifically Internal Merge — movement — together with the older transformational apparatus of deletion and trace: the operations that displace and erase. The transformational tradition occupies the negative pole because of movement, not because of Merge, and the architecture’s monotonicity is a vote for Merge-without-Move: combination that only ever adds, with no operation that relocates or deletes what an earlier step established. Where a transformational grammar would move a constituent and leave a gap, the Transformer can only write another vector into the stream — displacement achieved by addition, not relocation. This will matter again when we reach the one operation that looks like retraction.
2.2 Directionality: CCG and the head the join throws away
The fracture “Merge and Selection” arrived at was this: the residual
merge is a symmetric operation. A join is commutative —
A ⊔ B = B ⊔ A — and unification does not distinguish a head
from a dependent; it merges descriptions without governance. So the pure
merge reading, faithful as it is to the monotonicity of the stream,
erases the head-driven character that names HPSG. The Transformer’s
combination, taken as a join, is headless.
There is a formalism whose combinatory operation is built to be
directional from the start, and it is the natural exemplar of a second
axis. Combinatory Categorial Grammar — Steedman’s framework, with roots
in the categorial grammars of Ajdukiewicz and Bar-Hillel and a semantics
that runs in lockstep with the syntax in the Montague tradition —
assigns each word a category that is either basic or a function
seeking arguments. A transitive verb is (S\NP)/NP:
something that, given an NP to its right, yields something
that, given an NP to its left, yields S.
Combination is function application, and application is
inherently asymmetric: one category is the functor, the other the
argument, and the slash even encodes the direction in which the
argument is sought. This is the asymmetry HPSG’s symmetric join
discards, recovered as the primitive of the combinatory operation.
And it maps onto the Transformer with surprising directness — onto the part of the architecture the monotonicity reading left out. Consider the query/key mechanism on its own. A position computes a query, a specification of what it is looking for, and attends to positions whose keys match that specification. A query is, functionally, a function seeking an argument of a certain type; the key advertises the type of argument a position can serve as; attention is the application of the one to the other. The functor/argument asymmetry of CCG is the query/key asymmetry of attention. This reading is not original to us — the authors of the TP-Transformer described an attending cell as a subject querying others for an object, which is selection by another name — but CCG sharpens it: the slash direction, the thing categorial grammar adds to bare valence, is the directional signature attention has and a symmetric join lacks.
So CCG appears to repair HPSG’s defect. It does — and in doing so it
incurs the opposite defect, the first instance of this essay’s
thesis. CCG’s application is not monotone. When a
functor applies to its argument, the functor is consumed:
(S\NP)/NP combined with an NP yields
S\NP, and the original category is gone, discharged into
the result. Application is as destructive of its functor as a rewrite
rule is of its left-hand side — that is what it means for the slash to
“cancel.” So CCG sits on the wrong side of the monotonicity
axis: it wins directionality and loses accumulation, where HPSG wins
accumulation and loses directionality. Neither formalism gives both. And
the Transformer, manifestly, wants both: it has the directional
query/key asymmetry of CCG application and the monotone
never-overwrite accumulation of HPSG unification, because in the
architecture these are two different mechanisms — the query/key
inner product on one hand, the residual write on the other — rather than
two aspects of one combinatory operation. A discrete operation that
builds a constituent must fuse them: its single combine is
either symmetric-and-monotone (HPSG) or directional-and-consuming (CCG),
because that step happens at one point and cannot be two things there.
(Only a formalism that draws relations rather than building constituents
escapes this — Word Grammar, §2.6 — and pays for the escape in having no
constituents at all.) The Transformer factors them apart and runs them
at different sites. This is un-bundling, and it is why CCG and HPSG each
match the architecture on one axis and miss on the other: they are each
a bundling of a directionality choice with a monotonicity
choice, and the architecture has unbundled the pair.
One thing CCG genuinely contributes beyond the directionality axis, and we should credit it because it complicates a later section: CCG is incremental. Its flexible constituency — type-raising and composition allowing many derivations of the same string — lets even a simple left-to-right parser combine words as they arrive, without waiting for a phrase to complete. The brain study above leans on this: CCG’s left-branching incremental derivations are what let it track word-by-word processing. We flag it here because incrementality will return as its own axis, and CCG’s claim to it is real; what distinguishes the temporal exemplar we reach in §2.4 is not incrementality as such but the dissolution of the grammar/parser distinction, which CCG does not perform.
2.3 Strata: LFG and the conflation of position with function
A third axis concerns how many parallel structures a grammar maintains. Most formalisms produce one structural object — a tree, a derivation. Lexical-Functional Grammar maintains two, related but autonomous. There is c-structure, a constituency tree encoding precedence and dominance — who comes before whom, what groups with what — and there is f-structure, an attribute-value matrix encoding grammatical functions — subject, object, the features that unify — which is monotone and unification-based in just the way §2.1 described. The two are linked by a correspondence, the φ mapping, that says which nodes of the tree project which parts of the functional structure. LFG’s bet is that positional facts and functional facts are different kinds of information that need separate representations and a mapping between them, because within a single sentence the two need not line up: a fronted which book sits high in the tree but bears the object function of an embedded verb; a raised she occupies one position while filling the subject slot of two predicates; a dropped pro-subject is present as a function with no tree position at all. The non-alignment is intra-sentential, not merely cross-linguistic — it is there in one tree — which is why one structure with a fixed labeling will not do.
The Transformer does not obviously have two structures. It has one residual stream. And this is where LFG earns its place as a coordinate, by reframing an otherwise puzzling empirical result. The Tensor Product Generation Network finding that “Merge and Selection” engaged — that a model made to induce its binding roles unsupervised discovers neither pure positional slots nor pure grammatical functions but a hybrid of the two — becomes, on the LFG reading, evidence. It is the signature you would expect if the architecture is collapsing c-structure and f-structure onto a single vector, superimposing the positional stratum LFG keeps separate from the functional one. LFG predicts that a system with only one representational layer, asked to carry both kinds of information, would carry them entangled; the TPGN roles are that entanglement, observed.
So the architecture’s coordinate on the stratal axis is collapse: it occupies the position of a grammar that has run LFG’s two strata together into one. This rhymes with a recurring, unresolved finding in the interpretability survey — that syntactic and semantic information are sometimes reported as living in separate linear subspaces and sometimes as conflated, with no settled verdict. The stratal axis suggests why: a single stream can encode two strata in approximately-orthogonal subspaces (looking separate) or in overlapping ones (looking conflated), and which you observe depends on the probe, the layer, and how hard training pressed the two apart. And the un-bundling here runs the other way from §2.2: the architecture bundles what LFG keeps separate. So it is not uniformly a splitter — it unbundles selection from merge and bundles position with function, fusing where the formalisms split and splitting where they fuse.
2.4 Time: Dynamic Syntax and the grammar that is a parser
Every formalism so far describes a static object — a well-formed structure, a license on strings — and leaves to a separate “parser” the business of building it left-to-right in time. The competence/performance distinction is the orthodoxy that licenses this division: the grammar specifies what is well-formed, the parser is a performance mechanism that recovers it. Dynamic Syntax — Kempson, Meyer-Viol, and Gabbay’s framework — refuses the division. In Dynamic Syntax there is no static structure that the grammar characterizes and the parser then recovers; the grammar is the procedure for building interpretation word by word. Parsing is the incremental, strictly left-to-right growth of a semantic tree, driven by requirements — open goals, of the form “a formula of this type is needed here” — that later words discharge. The state of the parse at any moment is in general underspecified: nodes whose content is not yet fixed, requirements not yet met, structure that is provisional and gets resolved as more input arrives.
This is the formalism for decoder-only, causal Transformers, and it is the natural home of a reading “The Given and the Found” developed at length. A causal model processes strictly left-to-right; it cannot attend to the future; its representation at each position is built from what has come before and revised by what comes next, in the manner of a requirement awaiting discharge. Dynamic Syntax’s central dynamic — underspecification resolved to specification as parsing proceeds — is close to the picture that essay drew of the layer stack (or the iteration count of a looped model) driving an initially high-entropy, many-interpretations-superposed state toward a committed one. Where Dynamic Syntax has a node whose value is not yet fixed and a requirement that some later word will satisfy, the Transformer has a residual state that is a blend of partial commitments, sharpened as computation proceeds. The competence/performance collapse is the deepest agreement: a Transformer has no separate “grammar” stored apart from the “parser” that applies it — the weights are the procedure, there is no declarative grammar object anywhere in the system, and this is the most Dynamic-Syntax-like commitment in the entire architecture. The empirical layerwise findings from the survey fit the picture: positional information processed in lower layers, a switch to more hierarchical encoding in higher layers, is what incremental specification looks like from the probe’s side.
Dynamic Syntax thus owns the temporal axis as CCG owns the directional one, and the distinction we promised in §2.2 can now be made precise. CCG is incremental — it can combine as it goes — but it remains a competence grammar with a separate notion of what is well-formed; its incrementality is a property of how its derivations can be ordered. Dynamic Syntax is incremental in a stronger sense: it has no competence/performance split to be incremental within, because the growth process is all there is. The architecture’s coordinate on this axis sits at the Dynamic Syntax end — grammar-as-process, not grammar-as-object-plus-parser — and this is the un-bundling of a fourth bundled pair: every other formalism fuses “what is well-formed” with “the system has a stored characterization of well-formedness,” and the Transformer has the first without the second.
2.5 Inventory: Construction Grammar and what lives in the weights
The axes so far concern the per-layer operation — how a step combines, in what direction, maintaining how many structures, growing in time. A different axis concerns what the grammar contains. The mainstream generative tradition draws a sharp line between the lexicon (idiosyncratic, listed) and the grammar (general, rule-governed), and treats the rules as the locus of productivity. Construction Grammar — Goldberg, Croft, and in its usage-based form Bybee and Hilpert — denies the line. For Construction Grammar the basic unit is the construction: a learned pairing of form and meaning, at every level of granularity, from morphemes through words and idioms up to abstract argument-structure schemata like the ditransitive. There is no separate rule component; there is a single continuum from the wholly idiosyncratic to the wholly schematic, all of it stored, all of it the same kind of object, and the grammar is the inventory of these constructions. The framework is resolutely usage-based: constructions are entrenched by frequency, learned from data, with abstract schemata generalized over stored exemplars.
This is a description of the Transformer’s weights, and it lands at a different level from the previous four axes, which is its value. The other formalisms describe what a layer does to a representation as it passes through; Construction Grammar describes what is stored in the parameters that do it. And the match is close enough to be more than analogy. The feed-forward (MLP) layers of a Transformer have been read, in the interpretability literature, as a key-value memory: each MLP neuron responds to a pattern in its input (a key) and writes an associated vector into the stream (a value), and the layer is a large store of such pattern-response pairs retrieved by the current context. A stored mapping from a form-pattern to a content-contribution, retrieved by frequency-shaped activation, is — at the level of functional description — a construction. And the feature superposition of the residual stream, the near-orthogonal directions carrying many features at once, is the representational image of Construction Grammar’s central commitment: a continuum of stored units from the concrete to the abstract, with no categorical break between “lexical” and “grammatical,” because superposed directions admit every degree of specificity without a sortal boundary. The lexicon-grammar continuum is the feature continuum.
Construction Grammar is silent on the combinatory dynamics — it says what is stored, not how stored units are composed in a derivation — and so it is complementary to Dynamic Syntax rather than competing with it: Construction Grammar is the inventory axis, Dynamic Syntax the combination axis, and the architecture has coordinates on both. The bundled pair the architecture unbundles here is the lexicon/rule distinction itself — the generative tradition fuses “productive combination” with “rule component distinct from the listed lexicon,” and the Transformer has productive combination with no such distinct component, everything graded along one continuum of stored, retrievable units. That this is the most usage-based, most learned-from-frequency framework on our list, matched to the most thoroughly trained-from-data architecture, is no coincidence: it is where the formalism and the mechanism agree about epistemology, both holding that grammatical knowledge is generalization over experienced instances rather than a given rule system. We will return to that agreement, because it is also where the architecture’s deepest limitation lives.
2.6 Topology: Word Grammar and the grammar that is an activation network
The sixth axis is topology — the shape of the structural object the grammar builds: at one pole a phrase-structure tree, with constituents, brackets, and part-whole containment; at the other a flat network of word-to-word links, with no node that is a phrase rather than a relation. The Transformer sits at the network pole, and here the mapping barely needs an argument. An attention matrix is a weighted directed graph over positions; the biaffine scorer of the neural dependency parsers, which assigns a score to every candidate head-modifier arc, is structurally that same matrix; and the interpretability survey keeps rediscovering the fact from the data side — heads that specialize in particular dependency relations, while constituency probes fare worse, one cited study conceding the models barely acquire constituency at all. What the architecture builds is a network, not a tree. Generic dependency grammar sits at the same pole but is theoretically thin, nearer an annotation scheme than a theory of the object; as a coordinate it says little beyond “head-modifier links, no phrases.” Richard Hudson’s Word Grammar occupies the pole with a commitment that turns the shared shape into a genuine point of contact.
The commitment is that language does not produce a network, it is one. Where the other formalisms yield a structural object standing apart from the process that builds it — a tree, a derivation, a pair of strata — Word Grammar denies there is any such separate object. Knowledge of language is a single network of atomic nodes — word-types, sub-lexemes, the tokens of use, and an open-ended taxonomy of relational concepts (subject, sense, realisation) — joined by labelled directed links, with no phrase nodes anywhere; processing is spreading activation over the network, and generalization is default inheritance up the isa-hierarchy. The machinery is the knowledge-representation tradition’s — semantic networks, frames, inheritance — and at its furthest edge, in Lamb’s relational-network linguistics, a grammar is offered as something that could be wiring. This is the commitment no other coordinate makes. Dynamic Syntax (§2.4) makes grammar a process, but a symbolic procedure unfolding in time, with nothing said about activation; Construction Grammar (§2.5) says what is stored, but not that the store is a graph that lights up and fires into its neighbours. Word Grammar is the one formalism whose primitive object is an activation network over a weighted directed graph — which is, by construction, the Transformer’s kind of object.
So the form is shared, and shared at a level the other formalisms do not reach: the residual stream as an activation pattern over the graph, attention as the per-layer scoring of the links, the flatness that follows from neither having phrase nodes, and a structure that grows only by adding links — Word Grammar never consumes a node, the residual stream never overwrites. The interesting facts are therefore not the agreements but the two places where the shared form is cashed out differently. They are independent of each other, and the section turns on keeping them apart.
The first is what bears the activation. In Word Grammar the
activated nodes are reified concepts, and the network holds types and
tokens together in one graph: the abstraction subject, the
lexeme BOOK, the relational concept sense are
themselves nodes that grow active and fire into their neighbours —
Hudson’s evidence for the links is priming, one concept raising
the activation of related ones, abstractions activating abstractions.
The store of knowledge and the medium of processing are the same
activated graph; program and data are not separated. The Transformer
separates them. Activation rides only particulars — token
positions, contextualized further up the stack but always “this token,
here.” The abstractions live in the weights: a learned map engaged by
every matrix multiply, but an operator applied to the
activations, never a node that carries state forward or fires into a
neighbour. (The weights are not inert — each matmul engages them — but
they are not activation-carrying; they transform the stream rather than
flow in it.) The operator/operand split that Word Grammar refuses is the
architecture’s organizing fact, and it is the same fact “Merge and
Selection” located one rung down: the abstract relational apparatus —
there the head and its governance, here the type and the relation — is
kept out of the activated representation and held in the machinery that
acts on it.
The second difference is the older one, now properly parallel to the first rather than containing it: what the substrate is made of. Word Grammar’s nodes are individuated and addressable — a node that is the subject relation, a link that is this dependency — and each link carries its type on its face. The architecture’s “nodes” are superposed directions in a vector space, and the superposition is the survey’s most robust structural finding read from the inside: a single relation smeared across many heads, a single head carrying several, agreement riding subspaces of as few as three dimensions and tied to no unit. An attention weight is a bare scalar with no label, its grammatical force — if it has one — implicit in which subspace the routed value lands in. The unlabelled gradient link is something like the pre-quantized form of Word Grammar’s discrete typed dependency, and it is the seam at which §3’s substrate re-enters.
Both differences show most sharply where Word Grammar would seem strongest — in its generalization mechanism, default inheritance, which does two jobs and does both, in Word Grammar, by consulting discrete addressable nodes in an explicit hierarchy. Neither crosses into the architecture as that. Subsumption — a token inheriting its type’s properties, the type its supertype’s — becomes geometry: Park, Choe, Jiang, and Veitch define the subordinate relation by token-set inclusion (Word Grammar’s isa written extensionally) and prove that the containment forces orthogonality — the parent feature’s vector orthogonal to the difference vectors distinguishing its children, hierarchical concepts encoded as direct sums of polytopes, inherited content in the shared parent direction and distinguishing content in an orthogonal complement, recursively up the tree; validated on Gemma and LLaMA-3 and, against the worry that this is a noun artifact, replicated on the verb hierarchy. The inheritance hierarchy has a real analogue, but one implicit in the angles between directions: where Word Grammar consults a hierarchy and copies a value down, the architecture arranges directions so the parent’s content is a shared component of the child’s. Override becomes additive cancellation: the IOI circuit’s negative name-mover heads write against the token they attend to, and Gurnee and colleagues’ suppression neurons — late-layer, the mirror of prediction neurons — together with partition neurons that boost one group while suppressing another, are the mechanism, unified with the self-repair (Hydra) line. But where Hudson now stresses that Word Grammar’s inheritance is monotonic — the more specific value is simply the one inherited, resolved at the moment a concept is created, so the default is never laid down and never retracted — the architecture does the opposite: it writes the default and a cancelling term and forwards the algebraic sum, the suppression clustered in the very last layers, after the affirmative predictions have accumulated through the middle. Nothing is selected over a node; two pushes are added and one dominates. This also closes the loop to §2.1: a strictly additive stream cannot retract, so the non-monotonicity exception-handling needs is bought not by deletion but by writing a vector that points the other way.
So the formalism whose primitive object is closest in form to the architecture’s — an activation network over a weighted directed graph — diverges from it on two independent counts: what bears the activation, and what the substrate is made of. The abstractions Word Grammar reifies as activable nodes are, in the Transformer, the static operator; the discrete typed links it can point to are superposed gradient directions. Both divergences are the question §3 takes up — the medium in which structural commitments can stay un-quantized, and the abstractions can stay in the map.
3. The Substrate: Three Things at Once in One Medium
Six axes, and a pattern across them sharper than any single fit. The Transformer unbundles selection from merge (§2.2), bundles position with function (§2.3), has well-formedness without a stored grammar (§2.4), has productive combination without a rule component (§2.5), and has a spreading-activation network without reified abstractions (§2.6). It matches no formalism because each formalism is a bundle — a correlation its single combinatory operation forces among the values on these axes — and the architecture’s coordinates cut across every bundle. The question that turns the tour into an argument is what, if anything, in that is proprietary to the architecture. The honest answer begins by conceding how much of it is not.
Two temptations have to be resisted, because each credits the gradient substrate with something the discrete formalisms can already do. The first is un-bundling itself. It is natural to say that a continuous state has “room” to be partly one thing and partly another where a discrete symbol does not — but un-bundling is a factorization, not a blurring: directionality lives in the query/key inner product and monotonicity in the residual write, two mechanisms at two sites, so their settings vary independently, and that has nothing to do with whether the medium is continuous. A wholly discrete architecture running selection and accumulation as separate steps would un-bundle the pair just as cleanly — and Word Grammar already does (§2.6), drawing a typed arc never coupling direction to consumption the way building a constituent does. The second temptation is the blend. Maintaining several incompatible structures at once is the genuine contribution of Gradient Symbolic Computation, and we will lean on it — but it too has symbolic precedent: Dynamic Syntax holds a node’s value open across completions, chart parsers carry packed forests, disjunctive feature structures hold alternatives in one description. Neither ingredient, taken by itself, is beyond the discrete side.
What is beyond it is the conjunction, done in one homogeneous medium with no machinery stipulated per axis. The architecture factors the axes apart, holds each at a graded rather than a committed value, and superposes the alternatives — all with a single repeated operation, attention-plus-residual-write iterated, rather than a separate device for each job. Every ingredient has a symbolic precedent; what the discrete side lacks is the precedent for all of them together, emergent from one substrate. This is where Gradient Symbolic Computation earns its place — not as a seventh formalism, and not because it “un-bundles better,” but as the theory of that medium. Two of its facts do the work, and the apparatus is laid out in full in “Merge and Selection.” A structure is represented as a blend: filler symbols bound to roles and summed, but with continuous activations, so a position may hold a superposition of partly-active symbols rather than one discrete filler, well-formedness being Harmony, a weighted sum of soft constraints the dynamics ascend. And discreteness is scheduled: a quantization weight q, raised over processing, drives the state off the blend toward the grid of discrete structures, while a temperature T fixes the regime — cooling toward zero converges on the single maximum-Harmony structure (optimization), holding T concentrates the equilibrium into a Boltzmann distribution over structures (sampling).
This lets the fracture of “Merge and Selection” be restated rather than resolved. Selection — the query/key valence-matching — is the grammatical Harmony, the soft constraint; merge — the monotone residual write — is the substrate on which Harmony is maximized; and a soft blend of candidate daughters is what maximizing the constraints without quantization yields. Raising q sharpens which daughters are gathered, but cannot promote the selector into the gathered structure, since the query that does the selecting is not a summand. So the head’s absence is the one feature of the fracture the blend does not dissolve but carries through to the sharpest corner: the architecture lives in the blend, and even at its edges builds headless.
One clarification the section turns on, because the metaphor is seductive: what is superposed is structures, not formalisms. The GSC blend is a point in a representation space, and points can be convexly combined; a “blend of CCG and HPSG,” or of the corners the six formalisms name, is not a state any metric describes — there is no vector that is half-application and half-unification. Two of our coordinates are superpositions in the genuine, structural sense: Word Grammar’s unlabelled gradient link (§2.6) is a binding held at low q, not yet quantized to a typed dependency, and LFG’s collapsed strata (§2.3) are c-structure and f-structure superposed in one state rather than projected onto two. The axes coming apart is a different, non-metric fact — separate mechanisms, not a point between corners — and conflating the two is what made the substrate’s role look more mysterious than it is. What blends is parses under a fixed scheme; what un-bundles is the machinery. GSC is the theory of the first, a substrate rather than a contestant precisely because it does not say which structure — only in what medium structures can be carried so that they need not be made discrete.
That locates what the continuous medium actually adds, and Dynamic Syntax measures it the way Word Grammar measured localism. Dynamic Syntax superposes too — but its underspecification is a partial description over an enumerated set of completions, and the parse must end as one of them; the openness is provisional, a placeholder awaiting the input that fixes it. The GSC blend is a metric point that need answer to no discrete alternative at all, and need never resolve — the limit-cycle, non-settling trajectory “The Given and the Found” flagged, which visits the valid readings in turn and commits to none. That is the small and defensible thing gradience buys over symbolic superposition: not the holding of alternatives, which Dynamic Syntax also does, but a state under no obligation ever to become one.
The depth/breadth distinction “The Given and the Found” drew falls out of the same machinery, and it is where the GSC apparatus does its most concrete work. That essay separated two instruments of test-time computation: depth, iteration along a single trajectory, which entrenches whatever attractor has the state, and breadth, restarts from many initializations, which keeps several attractors alive and is the only remedy when collapsing to one is itself the failure. GSC says what these instruments are. Driving T to zero is the optimization regime — convergence to the single maximum-Harmony structure, one confident parse, depth. Holding T is the sampling regime — a distribution over high-Harmony structures rather than one, breadth. And the limit cycle — the non-settling trajectory just described — is the sampling regime realized temporally rather than across restarts. Depth and breadth are not two networks but one selection-plus-merge dynamics under two (q, T) schedules: a model that wants one confident parse cools toward zero, a model that wants to preserve alternatives holds temperature. The empirical separation of the knobs is the companion’s; the single dynamics whose schedule they are is GSC’s.
4. What This Does Not Show
The argument has a shape that invites overreading, and the honest section is the one that says where it stops — which the linguistic-interpretability literature, having been our witness throughout, also equips us to write.
The first limit is the one the survey itself insists on and §1 opened with: recoverability is not computation. Everything we have called the architecture’s “coordinate” is, in the end, a claim about what is represented in the stream — what a probe could pull back out — and that is one step removed from what operation the layer performs. The whole essay is, in the survey’s terms, a piece of representational interpretation, and it inherits the survey’s caveats wholesale: that a recovered structure may be read off the linear context rather than built, that word-order statistics can masquerade as syntax, that the correlation between a probe’s success and the model’s use of what it probes is exactly what causal and amnesic probes were invented to question. When we say the residual stream is a gradient feature structure or that attention is application, we are characterizing the content and geometry of the representations, not proving that the forward pass is a parsing algorithm. The coordinate system locates what is there; it does not certify what the mechanism does with it.
This bites with particular force on the one result we leaned on most
rigorously. The Park et al. geometry of hierarchical concepts —
subsumption as orthogonal direct sums — is a theorem about the
unembedding representation, the structure of the output space
g(y), built on the causal inner product defined there. It
is the strongest available evidence that an inheritance-shaped structure
is present, and we are right to use it; but it is evidence
about the geometry of what the model represents at the output, not
unambiguously about the operation a mid-stack layer runs, and the
inference from “the hierarchy is in the output geometry” to “the layers
compute by inheritance” is one we have not earned and do not claim. The
same applies, with less rigor available, to the suppression mechanism:
that negative heads and suppression neurons implement something
we are entitled to call override is our functional reading, consistent
with the calibration and self-repair stories their discoverers tell but
not framed in inheritance terms by anyone but us.
A second limit concerns the geometry’s scope. Park et al. validate their hierarchy on WordNet — taxonomic relations over nouns and verbs, lexical-conceptual hypernymy. That is real and it is broader than nouns, which disposes of the cheapest objection. But it is lexical-semantic taxonomy, not the grammatical category structure Word Grammar actually trades in: dependency types, valence frames, word-classes as syntactic-distributional objects. That “walk” subsumes “stroll” is troponymy of meaning; that “verb” subsumes a particular verb with its valence defaults is grammatical inheritance, and the geometry is demonstrated for the former and only conjectured by us to extend to the latter. The gap is not idle: whether grammatical categories live in the representation space with the same clean orthogonality as conceptual ones is precisely the syntax-versus-semantics question the survey kept finding unresolved, and we have no business assuming the answer.
A third limit is that some of the load-bearing mappings are ours by construction and flagged as such throughout: CCG-application-as-attention with its non-monotonicity, Dynamic-Syntax-as-causal-decoding, suppression-as-override. These are offered as the readings that make the coordinate system cohere, not as established results, and a reader is entitled to reject any one of them and keep the rest, because the thesis is the pattern of un-bundling, not any single axis. If CCG turns out to map more cleanly than we think, or Dynamic Syntax less, the claim that the architecture’s coordinates cut across the formalisms’ bundles survives the loss of a coordinate.
And a fourth, which is less a limit than a debt. The most consequential thing the coordinate system cannot locate is the one “The Given and the Found” made its center: where the correctness condition comes from. Every formalism on our list, and the GSC substrate beneath them, presupposes a grammar — a set of soft constraints, a Harmony function, an inventory of constructions, an inheritance hierarchy — that is given. The architecture’s coordinates tell us how it represents and combines structure under a grammar it has learned; they are silent on how the grammar was found. Construction Grammar’s usage-based epistemology (§2.5) comes as close to addressing acquisition as anything on our list, and it is suggestive because it shares the architecture’s commitment to learning grammar from instances — but “the constructions are entrenched by frequency” is a description of the result, not a mechanism for the discovery of which form-meaning pairings are there to be entrenched. Word Grammar (§2.6) makes the same commitment in network form: its grammar is a learned record of past usage — new nodes presumably introduced when a token is too poorly connected to existing types to be absorbed by them, and generalization left to what Hudson calls the mind’s spotting of generalisations and creating of super-categories. That, too, names the outcome, not the induction or abduction that would select which super-category to posit; the discovery is as unspecified as Construction Grammar’s, and the two formalisms nearest the architecture’s own usage-based epistemology stop at exactly the same line. This is the boundary that essay drew between amortizing search over a given condition and amortizing the discovery of one, and it falls in the same place here: the coordinate system is a map of how the Transformer parses under a learned grammar, and the learning of the grammar — the induction of the constraints whose blend §3 describes — is off the edge of the map.
5. The Shape of the Answer
We began by retiring a question — which formalism the Transformer is — and the retirement was the substantive move, because the question presupposes that the architecture sits at one of the corners the formalisms name, and it does not. It sits in the interior. Each of the six traditions fixes an axis of grammatical design and stakes out a corner along it: HPSG monotone-and-symmetric, CCG directional-and-consuming, LFG two-stratal, Dynamic Syntax grammar-as-process, Construction Grammar inventory-as-continuum, Word Grammar localist-network. The Transformer’s coordinates cut across every bundling these corners represent. It takes CCG’s directionality and HPSG’s monotonicity — which neither of those two formalisms can hold at once — by running them as separate mechanisms. It collapses LFG’s two strata into one stream. It has Dynamic Syntax’s grammar-as-process without a stored grammar, Construction Grammar’s continuum without a rule component, Word Grammar’s network without discrete types. The pattern is less that it resembles one formalism more than the others than that it un-bundles what each of them fuses — and, in the stratal case, fuses what one of them splits, the same pattern seen from the other side: the architecture’s joints are simply not the formalisms’ joints.
And what lets it sit in the interior is a conjunction, not a single trick. Un-bundling is factorization — separate mechanisms for separate axes — and asks nothing of a continuous medium; Word Grammar holds directionality and monotonicity together discretely. What the gradient substrate adds is the rest of the conjunction: each factored axis held at a graded rather than a committed value, the alternatives superposed, and — the one thing symbolic superposition cannot match — a state under no obligation ever to resolve to one of them. Gradient Symbolic Computation is the theory of that medium, not a seventh corner: it says not which structure the parser builds but in what medium structures are carried, quantizing toward a discrete corner only as the schedule demands, which at a Transformer’s interior positions it largely does not. The parser lives in the blend, and what blends is structures — not the formalisms, whose joints it instead holds apart.
Which gives the title its sense. A parser without a grammar is not a parser that lacks structure; it is a parser with no discrete, stored, declarative grammar object of the kind every formalism on our list supplies — only a gradient field of soft constraints learned into the weights, over which a monotone, directional, single-stratum, in-time, inventory-driven, networked, distributed dynamics runs, occupying at every interior position a superposition of the parses the discrete formalisms would each have it commit to. The structure is real. What is missing is the commitment. And the parse, if we insist on one, is something we quantize out of the model’s state by reading it discretely — a corner the dynamics passed near, not a tree it built, and not a grammar it has.
This article was co-authored by Łukasz Stafiniak and Claude (Opus 4.7). It continues the series on mind, metaphysics, and artificial cognition published at lukstafi.github.io and syndicated at lukstafi.substack.com. It grows directly out of “Merge and Selection: The Residual Stream as a Constraint Grammar,” whose monotone-merge reading, HPSG comparison, headless-merge fracture, and engagement with the explicit-binding (TPGN) program it takes as its starting point; and it follows “The Given and the Found: What Test-Time Reasoning Amortizes, and What It Cannot,” from which the depth/breadth distinction and the given-versus-discovered boundary are drawn. The grammatical formalisms are treated from their standard sources: Pollard and Sag for HPSG; Steedman for CCG, with the brain-modeling result from Stanojević, Brennan, Dunagan, Steedman, and Hale (“Modeling structure-building in the brain with CCG parsing and large language models”); Bresnan, Dalrymple, and colleagues for LFG; Kempson, Meyer-Viol, and Gabbay for Dynamic Syntax; Goldberg, Croft, and the usage-based line of Bybee and Hilpert for Construction Grammar; Hudson — drawing on his recent “Word Grammar” handbook chapter for the network ontology, the priming evidence, the monotonic-inheritance position, and the usage-based-learning commitment, with Lamb’s relational-network linguistics at the edge — for Word Grammar; and the Minimalist literature on Merge and Move for the transformational anchor. The Gradient Symbolic Computation material draws on Smolensky and Legendre’s “The Harmonic Mind,” on Smolensky, Goldrick, and Mathis’s optimization-and-quantization framework, and on Cho, Goldrick, and Smolensky’s dynamical incremental parser. The interpretability findings are from López-Otal, Gracia, Bernad, Bobed, Pitarch-Ballesteros, and Anglés-Herrero’s systematic review of linguistic interpretability; from Park, Choe, Jiang, and Veitch (“The Geometry of Categorical and Hierarchical Concepts in Large Language Models”) for the hierarchical-concept geometry; from Wang, Variengien, Conmy, Shieber, and Steinhardt’s IOI circuit for the negative name-mover heads; and from Gurnee and colleagues’ “Universal Neurons in GPT-2 Language Models” for the suppression-neuron taxonomy, with the self-repair connection to McGrath and colleagues’ Hydra effect. The mappings of CCG application onto attention, of Dynamic Syntax onto causal decoding, and of suppression onto inheritance-override are ours, and are offered as the readings that make the coordinate system cohere rather than as established results.