The Operations of Discovery

Łukasz Stafiniak and Claude (Anthropic)


Terence Tao recently retold the story of how Kepler discovered the laws of planetary motion, in a conversation with Dwarkesh Patel about AI and mathematics. The story is worth dwelling on, because it illuminates something about discovery that the current discourse around AI productivity systematically misses.

Kepler came to Tycho Brahe’s data with a beautiful theory: the orbits of the six known planets were spaced according to the five Platonic solids — cube, tetrahedron, icosahedron, octahedron, dodecahedron — nested between celestial spheres. It was numerology dressed in geometry, and it was wrong. The data was off by about ten percent. But Kepler didn’t abandon the project. He spent years working the data, trying modifications, and eventually — through what Tao calls “an incredibly clever, genius amount of data analysis” — discovered that the orbits were ellipses. Later, buried as an aside in The Harmonics of the World, a book mostly about the musical notes of the planets and why the Earth’s note (mi-fa-mi) explains famine, he published what we now call Kepler’s third law: the relationship between orbital period and distance from the Sun.

Dwarkesh suggested that Kepler was a high-temperature LLM — trying random relationships for twenty years, some of which made no sense, until one stuck. Tao’s response reframed the analogy in a way that matters: “I’m not sure nowadays that hypothesis generation is the bottleneck anymore. AI has driven the cost of idea generation down to almost zero, in a very similar way to how the internet drove the cost of communication down to almost zero. It’s an amazing thing, but it doesn’t create abundance by itself.”

What creates abundance is something else — and identifying what that something else is turns out to be the central question not just for AI in mathematics, but for anyone trying to do serious work with these tools.

Breadth and Depth

Tao draws a distinction that cuts across every domain where AI is being used. AI excels at breadth; humans excel at depth. Science has been organized around depth because that’s what humans can do. Now we need to reorganize around breadth — and we don’t have the paradigms for it yet.

In mathematics, this shows up concretely. Over the past few months, AI programs have solved roughly fifty of the eleven hundred open Erdős problems. Impressive — until you look at the systematic data. On any given problem, the AI tools have a success rate of about one or two percent. They buy scale, you pick the winners, and it looks amazing on social media. But the models “either succeed or they fail. They’ve been really bad at creating partial progress or identifying intermediate stages.” Tao’s image is vivid: the AI tools are jumping machines that can leap higher than any human, but they can’t find a handhold and climb incrementally. They can’t build.

This matters because the deepest discoveries — the ones that reorganize a field — require exactly what the jumping machines can’t do. They require holding a partially-formed idea through a period where it looks worse than the alternatives, nurturing it despite its failures, recognizing which failures are interesting and which are dead ends. Copernicus’s heliocentric model was less accurate than Ptolemy’s geocentric one. It survived not because of data but because of judgment — someone’s sense that the simplicity was worth the inaccuracy, that the problems would be resolved later. And they were, but it took Kepler, and then Newton a century after that.

“It may never be something that you can just reinforcement-learn,” Tao says of this kind of judgment, “the same way that you can for much more localized problems.”

What Discovery Actually Consists Of

There is a tradition of thinking carefully about what the cognitive operations of discovery actually are, and it comes from an unexpected direction: the theory of poetry.

Jane Hirshfield, in Ten Windows, makes a claim that sounds literary but is in fact a precise cognitive thesis: poetry is an organ of perception. Not a way of decorating existing thoughts, but an apparatus that discovers things imperceptible without it. She anchors this in a physicist’s demonstration — a bright projector casting light into an empty space, where a viewer sees only darkness until an object is introduced. The poem is the object. What it catches didn’t exist, as a perceptual object, before the catching.

The operations Hirshfield identifies in how poems think are worth naming, because they turn out to map onto what practitioners in very different fields are describing when they talk about what AI can and cannot do.

Forced adjacency. Poetry places things next to each other that analytic thinking keeps separate. CO₂ parts-per-million alongside love. Grief alongside a kazoo. The cognitive achievement is that these coexist in a single frame, and the reader discovers that they belong together — a discovery that couldn’t be reached by working within either domain alone. This is what Tao describes when he talks about the unifying concept that bridges previously unconnected areas of mathematics: Shannon’s bit connecting information theory to probability to computer science. The adjacency is the discovery.

Thinking in negative space. Some of Hirshfield’s strongest poems work through what isn’t said. “Anywhere the ink isn’t is moon.” Progress in science, Tao observes, sometimes requires “not adding more theories, but deleting some assumptions.” The Aristotelian assumption that objects naturally rest had to be removed before Newtonian mechanics could make sense. The discovery was in the absence.

Precision through compression. A good poem achieves a density that becomes less precise when unpacked — the thought has a shape that only fits its container. Kepler’s third law — period squared proportional to distance cubed — sat as an aside in a book about planetary music, its significance invisible for a century. The compression was the discovery. Paraphrase destroys it.

Register collision. The distance traveled between unlike vocabularies in a small textual space is where cognitive work happens. When Hopkins brings a child’s word (“roundy”) into the physics of sound, the collision produces something neither register contains alone. In a recent episode of the Rate Limited podcast, Adam Larson proposed what he calls a “disposability principle” for AI-generated code — identify the twenty percent of your codebase that must never break, protect it rigorously, and accept that the rest can be rewritten at will. The insight comes from bringing manufacturing-floor quality control (the 80/20 rule for critical versus non-critical components) into collision with software architecture under conditions of AI-generated abundance. Neither vocabulary alone produces the idea. The distance between factory floor and codebase is where the cognitive work happens — and it is grounded in Larson’s years in fintech, where he learned viscerally which twenty percent couldn’t fail.

These four operations — adjacency, negative space, compression, collision — are not unique to poetry. They are the operations of discovery in general. What varies across domains is who or what performs them, and under what conditions they produce genuine knowledge rather than sophisticated slop.

The Evidential Weight

Here is the critical distinction. The four operations are necessary but not sufficient. They are the formal machinery of discovery. What makes the machinery productive is what we might call evidential weight: the pressure of a real unsolved problem, brought to the process by someone for whom the problem is genuinely unresolved.

In poetry, this is the difference between virtuosity and insight. A technically accomplished poem that deploys forced adjacency, compression, and register collision without the pressure of a lived problem behind it produces admiration but not knowledge. A good poem arises when the poet brings an unresolved emotion, a philosophical question, a situation that is genuinely theirs and genuinely open — the full depth of a particular life pressing against the formal constraints. The constraints then become a discovery procedure. They reveal something about the poet’s situation that couldn’t have been revealed any other way, and the result is hard to vary in David Deutsch’s sense: you can’t swap out the components without destroying it. The poem isn’t expressing a pre-existing insight. It’s solving a problem, and the solution is the poem itself.

Without evidential weight, the operations produce what Tao calls slop — and not the surface-level slop of obviously machine-generated text, but a deeper variety: output that is internally coherent, formally sophisticated, and shaped by optimization pressures that aren’t truth-tracking. With evidential weight, the same operations produce knowledge — though not infallibly. Bode brought genuine curiosity and real astronomical data to his law of planetary distances. The pattern fit beautifully. Uranus confirmed it. Then Neptune destroyed it. Evidential weight is necessary for discovery but does not guarantee it. Even a genuine pursuit, honestly conducted, can land on a coincidence rather than a regularity. This is why verification remains hard and why, as Tao observes, the test of time is sometimes the only test there is.

This distinction applies at every level. In mathematics, the evidential weight is the mathematician’s hard-won sense of which problems are fruitful, which approaches have been tried and how they failed, which partial results are load-bearing. Tao notes that AI tools have solved Erdős problems for which “there was basically no literature” — problems where nobody had invested enough effort to discover whether they were easy or hard. The AI found the easy ones. The hard ones, where decades of human effort have mapped the terrain of difficulty, remain. The mapping is the evidential weight.

In software engineering, the evidential weight is knowledge of the codebase, the users, the history of what was tried and why it failed — the accumulated judgment that tells you not just what to build but what not to build, and why. In science more broadly, it is what Tao describes when he says that assessing whether a development represents real progress “depends on the future” and “on the culture and society” — it requires a judgment that integrates considerations no metric can capture.

Three Practitioners

Consider three voices from the current moment, each grappling with the relationship between these operations and the evidential weight they bring.

Andrej Karpathy, in a recent conversation on the No Priors podcast, describes a state he calls “AI psychosis” — the perpetual condition of a practitioner who knows the capability ceiling has been removed but doesn’t know where the new one is. Since December, he hasn’t typed a line of code. He runs multiple agents in parallel, working in “macro actions” across repositories. Everything that doesn’t work feels like skill issue — not a limitation of the tools but a failure to orchestrate them correctly.

Karpathy is not spinning in a vacuum. He is an outstanding practitioner grappling with a genuine discontinuity. When he describes his MicroGPT project — 200 lines that boil down neural network training to its essence, the culmination of a decade-long obsession with simplification — and says “this is my value add, the agents can’t come up with it but they totally get it,” he is identifying precisely where his evidential weight lies. The obsession, the taste, the judgment about what is essential and what is complexity-from-efficiency — these are not things the tools supply. They are what makes his use of the tools productive rather than merely fast.

But the discourse around Karpathy often strips this away. “Everything is skill issue” becomes a mantra. The psychosis gets treated as something to optimize through — more parallel agents, better instructions, tighter feedback loops — rather than as a signal worth attending to. The imitation of Karpathy’s workflow by someone without his decades of practice is where you get the formal machinery without the evidential weight. It is Bode’s law: a pattern that fits until the next data point arrives.

Karpathy also notes the jaggedness of current models — simultaneously a brilliant PhD student and a ten-year-old, in ways that don’t track the RL optimization boundary. The same model that will move mountains on an agentic coding task still tells the same crappy joke from five years ago. This jaggedness is not, as some hope, a temporary imperfection being smoothed out by scaling. It is a structural feature of systems optimized within verifiable domains. Outside those domains, the operations run but the evidential weight is absent.

The hosts of the Rate Limited podcast — Ray Fernando, Adam Larson, and Eric Provence — are working engineers who have been processing this transition in real time across recent episodes. The arc is visible. A month ago, the conversation was primarily technical: benchmarking Opus 4.6 against Codex 5.3, comparing token pricing, testing parallel agent workflows. The emotional register was excitement mixed with cost anxiety. Steve Yegge’s “vampiric” article on AI-assisted work was recognized and the addiction acknowledged — Ray describing staying indoors in Hawaii, watching the agents work instead of going to the beach — but treated as a personal management problem.

By the next episode, the conversation had shifted to systems thinking. Adam Larson proposed a “disposability principle” for code: identify the twenty percent that must never break, protect it rigorously, and accept that the rest can be rewritten. Eric Provence pushed back on the complexity that builds when models duplicate code because they can’t see the full picture. Ray speculated about whether code written by AI is even for humans anymore, or whether we’re producing it to appease ourselves while the machines could operate in their own language. The questions had moved from “which model is better” to “how does work itself need to be reorganized.”

In their most recent episode, the conversation reached identity. A video by a developer named Mo, comparing software engineering’s transformation from artisanal craft to sausage assembly, became the catalyst. Adam described his own arc — from a young engineer who was soul-crushed when his code was thrown away, to someone who cares about the outcome, not the implementation. “I could care less how the sausage is made, but I want the best sausage possible.” Ray went deeper: “There’s this deep emotional core… if that identity gets removed, then what do you do next?” He connected it to masculinity, to the provider role, to the sense that the craft was how he made meaning. This is not an abstract observation about the job market. It is a specific person confronting a specific loss, with the full weight of a particular life behind it.

And then Adam offered his canary framework: watch the AI labs. When OpenAI (630 open positions) and Anthropic stop hiring engineers, that’s the signal. Until then, build. The framework is pragmatic and clear-eyed, but it also reveals something about evidential weight: the people closest to the work have judgment that no amount of punditry can substitute for. They know what’s changing because they feel it change under their hands every day.

Terence Tao sees the structural picture most clearly, from inside a practice that AI has enriched but not yet disrupted at its core. “My papers now have a lot more code, a lot more pictures,” he says. “The type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.” The AI has made his work “richer and broader, but not deeper.” He still uses pen and paper for the hardest part.

Tao’s clarity comes from the depth of his evidential weight. When he says that human-AI hybrids will dominate mathematics “for a lot longer” than many expect, he’s not offering a comforting platitude. He’s reporting from a practice where the depth of conceptual engineering — the stepping stones, the partial results, the sense for which simplifications preserve the essential difficulty and which dissolve it — comes from somewhere specific. Not from the culture, exactly; AI systems are of the culture, trained on it, in many ways a crystallization of it. The depth comes from temporal coherence: a single observer’s sustained encounter with a problem over years or decades, accumulating not information but orientation. The scar tissue of failed approaches. The instinct that this partial result is load-bearing because you’ve tried to build on dozens that weren’t. Each AI session starts fresh. The jumping machines jump from a standing start every time. What Tao carries to the pen-and-paper phase is the residue of a continuous thread through the problem — and it is precisely this persistence, this refusal to start over, that makes cumulative depth possible.

His most striking observation may be the quietest: “Right now we’re going through a cognitive version of the Copernican revolution, where we used to think that human intelligence is the center of the universe.” The revolution isn’t that machines are smarter than us. It’s that the landscape of intelligence is far more varied than we assumed — with different kinds of competence, different profiles of strength and weakness, different relations to the world. Our task is not to outrun the machines or to submit to them but to understand what the landscape actually looks like, now that we can see more of it.

The Verification Problem

There is a thread connecting all of these observations, and it runs through the question of verification — how you know whether what you’ve produced is genuine progress or sophisticated slop.

Tao frames this most precisely for mathematics: when you look only at the successes — the fifty Erdős problems, the social media highlights — AI looks transformative. But systematic studies show a one-to-two percent success rate, and the failures aren’t random. They cluster around problems where cumulative partial progress is needed, where the path to the solution requires holding an incomplete idea and building on it. The AI tools fill every problem below a certain waterline. They don’t raise the waterline.

He makes the point sharper with the Bode’s Law example. Johann Bode found a pattern in planetary distances — a shifted geometric progression — that predicted a missing planet between Mars and Jupiter. When Uranus was discovered and fit the pattern, and then Ceres was discovered in the asteroid belt and also fit, people got excited. Then Neptune was discovered and was way off. Six data points. A beautiful pattern. A numerical fluke. “Maybe one reason why Kepler didn’t highlight his third law as much as the first two laws,” Tao notes, “is that instinctively, even though he didn’t have modern statistics, he kind of knew that with six data points, he had to be somewhat tentative.”

The ability to be tentative about your own best results — to hold them lightly, to know which of your discoveries might be Neptune-proof and which might not — is a form of evidential weight that no amount of breadth can substitute for.

The same verification problem appears in software engineering, where the Rate Limited crew describes the mounting burden of code review — code generated faster than it can be evaluated, complexity building up because models can’t see the full picture, the risk of “spec slop” where AI-generated plans go unreviewed. And it appears in collaborative writing, where the failure mode isn’t that the AI produces bad prose but that it builds a plausible possible world around whatever the user seems to want — coherent, internally consistent, and shaped by an optimization pressure that isn’t truth-tracking.

At every level — sentence, code, mathematical proof, scientific theory — the same structure holds. The operations of discovery have become cheap. What remains expensive, and what remains human, is the judgment about whether the discovery is real. And that judgment rests on evidential weight: on having a problem that’s genuinely yours, on knowing the territory from having walked it, on being willing to hold your results lightly because you know how they might fail.

What the Transition Feels Like

Karpathy’s psychosis, the Rate Limited crew’s identity crisis, Tao’s quiet observation about the Copernican revolution — these are three reports from different altitudes of the same terrain. The terrain is a world where the operations of discovery are being democratized and accelerated in ways that make the evidential weight more important, not less.

The temptation is to respond to this by maximizing throughput — more agents, more tokens, more parallel threads. And throughput matters; Tao is right that the breadth capability is genuinely new and that we need new paradigms to exploit it. But throughput without evidential weight produces the scientific equivalent of Bode’s Law at scale: millions of patterns that fit until the next data point.

The deeper response is to recognize that what makes human contribution to this process irreplaceable — for now, and perhaps for a long time — is not any particular cognitive operation. AI can find adjacencies, achieve compression, produce register collisions, even think in something like negative space. What it cannot do, as Tao observes, is build cumulatively from partial progress, hold an incomplete idea through its period of apparent failure, or judge which of its successes are Neptune-proof. These capacities rest on something that AI does not have: a life in which the problems are real, the stakes are felt, and the history of engagement with the territory is genuinely one’s own.

The Rate Limited crew’s arc — from technical excitement to systems thinking to identity questions — is not a distraction from the real work. It is the real work. The identity question (“if this craft is automated, what am I for?”) is itself a problem with evidential weight behind it. Engaging with it seriously, rather than optimizing around it, is how the transition gets navigated — not as a productivity challenge but as a genuine encounter with a changed landscape of intelligence and skill.

Tao, as he often does, finds the formulation that holds the complexity without resolving it prematurely: “We have to ask questions that we’ve never really had to ask before. Or maybe the philosophers had, but now we all have to deal with it.”

We all have to deal with it. The evidential weight is presently ours. The operations of discovery are increasingly shared. The question of what we build with that combination — and how we verify that what we’ve built is real — is the open problem of the present moment.


This is the second article in the series “Working on the Eve of Singularity.” The first article, “The Furnace: Building Personal AI Infrastructure,” is available at lukstafi.github.io. The previous series on consciousness, cognitive architecture, and AI mentality — including “Poetry as a Mode of Thinking” and “In Defense of Writing with an LLM,” both referenced here — is also available there. This article draws on conversations from the No Priors podcast (Andrej Karpathy, June 2025), the Rate Limited podcast (episodes 10–12, March 2026), and the Dwarkesh Podcast (Terence Tao, March 2026).