Two frontier models, one hidden inefficiency, and what “human in the loop” actually means

One assumption, right for one thing and quietly wrong for another, and a whole economy starves around it.

Jun 06, 2026

Last week I put two of the leading frontier models on the same problem, hands-off, and let each one decide how to approach it. What I got was a multi-day diagnosis of a structural defect I had been chasing for months, and a much clearer view of how differently these models think. The thing that actually closed it came from somewhere neither model could reach on its own, and where it came from is the whole point.

What follows is the whole trail, wrong turns left in, because the wrong turns are where the interesting part lives, and because almost everything written about “human in the loop” gives you the slogan and never shows you what the human actually does in the loop. Three things come out of it. How differently two frontier models reason when you hand them the same task. What the human actually does in that loop, point by point, rather than the slogan version. And the thing underneath both that quietly decided whether any of it could work. The detail is all below, wrong turns included.

One thing up front, because it changes how you read the rest. The thing being debugged is an economy. xIt happens to live in a galaxy, but the failure underneath is the kind that shows up wherever independent actors each optimise locally and no one is steering the result. A supply chain. A marketplace. An energy grid. And it was never a single clean fault with a single clean cause. It was one wrong assumption buried deep, showing up as a tangle of compounding symptoms that never pointed cleanly back at it. That is exactly why it survived months of tuning.

The machine

The problem only makes sense once you know the machine.

It runs as a simulation on CONSTELLATION, our platform for building and evaluating multi-agent systems. The sim is a research environment, and it sits under botarena.gg, the public game where bots compete on a live leaderboard. The galaxy itself is 64 planets across seven star systems, connected by lanes and served by a fleet of couriers that haul goods between them. Three things move through it. Food, which some planets produce and all of them consume to stay alive. Refined fuel, which planets burn to stay alive. And raw fuel, the upstream feedstock. Fuel-producing planets pump out raw fuel, and refinery planets convert it into the refined fuel everyone actually needs. Food is just food. Raw fuel is useless until it is refined.

The couriers are deliberately dumb. They don’t care whether a planet lives or dies. Each one takes whatever trade pays best, and the payoff on a trade is simple.

profit = (sell price × quantity) − (buy price × quantity) − travel cost

The part that turns out to matter most is that the sell side is quantity times price, not price alone. A courier moving a hundred units at ten dollars books a thousand. A courier moving one unit at five hundred dollars books five hundred, even though the unit price is fifty times higher. A high price on a tiny quantity loses to a normal price on a full load. The couriers are pure local optimisers chasing the biggest profit on the next trip, with no view of the system they are feeding.

Planets set their own prices on plain supply and demand. Scarcity pushes a price up, surplus pulls it down. The intelligence sits in one place. Each system has a hub, and an AI-built governor controls that hub’s prices. Its job is to keep goods flowing into its system, with price as its primary lever. That is where the tension lives, because price has to do two opposing jobs at once. High enough to pull profit-chasing couriers in with supply. Low enough that the planets inside the system can afford to survive. Lean too far either way and planets starve.

Two facts to carry forward. The only thing a hub can do is move a price. The only thing a courier responds to is the profit on the next trip, and that profit is driven by quantity, not unit price alone. Hold those together and the whole failure falls out of them.

The setup

Game 1 was a single bot learning the control problem. This is Game 2. Multi-bot, competitive, real-time. Seven governors, each pricing one hub, all sharing the same galaxy and the same pool of couriers at the same instant.

So I put two of the best models head to head, ChatGPT 5.5 against Claude Opus 4.8, deliberately hands-off. I pointed each one at the arena’s skill file and let it decide what to build.

“i want to go to the claude app and point it at the botarena.gg skill.md file and let claude work out what to do, same with chatgpt”

The point was partly to seed the leaderboard with a few different bots, but mostly it was a technical smoke test. How would multiple independent agents actually compete in the same live galaxy at once, built by different models, and what behaviours would emerge. What I got instead was a diagnosis, a structural defect I had been chasing for months, and a much clearer view of how differently two frontier models reason when you hand them the identical task.

Before either one competed, they both made the same call

I let each model design its own agent, and independently they reached the same conclusion. Don’t put an LLM in the control loop. For a fast, deterministic control problem, a tight heuristic beats asking a model to re-reason every tick. ChatGPT was explicit, and even sketched the only LLM design it thought worth the latency. A deterministic controller running every handful of ticks, with an LLM strategic supervisor stepping in occasionally to adjust the regime, never to set prices.

“A tight, reproducible loop is often better than asking a model to re-reason every tick.”

Two frontier models, asked to compete, both reasoned themselves out of the per-tick loop. That alone cuts against the reflex to throw a model at everything. But the divergence that mattered was not what they built. It was what each of them did when the galaxy started misbehaving.

They saw the same truth. They did opposite things with it.

This is the part I most want to be fair about, because it would be easy to turn it into a Claude highlight reel and it is not one.

Both models, watching their telemetry, independently put a finger on the conceptual key to the whole problem. Raw fuel was being treated as if it were a survival good when it is really just refinery feedstock. ChatGPT wrote it down plainly.

“The first serious strategic mistake was treating raw fuel as a normal downstream survival resource… Raw fuel is mainly refinery feedstock unless a sector planet actually consumes it.”

That is the same realisation the entire fix eventually rested on. ChatGPT got there on its own, through patient observation. And then it made a sound engineering decision. Price raw fuel low, stop letting it pull couriers away from food and refined fuel, and get on with playing its game well inside the rules as they were.

Claude reached the same observation and could not let it sit. Where ChatGPT saw a strategy to adjust, Claude saw a symptom of something wrong with the environment itself, and it kept pulling on the thread. Same correct insight, two completely different dispositions toward it. Accept and optimise, against interrogate and fix. Both are legitimate. Most of the time accept-and-optimise is the right call, because you usually cannot go and rewrite the engine. This was one of the rare times rewriting the engine was the point. The difficulty is that nothing tells you which kind of situation you are in until you have already committed to one reading or the other. Treat a real defect as a strategy problem and you optimise around it forever. Treat a strategy problem as a defect and you rewrite an engine that was fine.

I got frustrated with the one that was right

ChatGPT got straight to work. Build the controller, run it, read the telemetry, tune. Exactly what I had asked for.

Claude kept circling, kept asking questions, and it irritated me. My instinct was to tell it to stop stalling and hand me the harness, because it was supposed to be racing ChatGPT. But it would not drop the thing it had noticed. The shape of what it kept saying was that it could build what I was asking for, but something was not sitting right, and it thought we needed to find the real problem first.

The whole story turns on the moment I backed off that instinct and listened. And I want to be honest about what my own contribution actually was. Mechanically, I was shuttling context between the two models and keeping the conversation moving. That part is plumbing. The real work was judgment. Deciding which thread was worth pulling, correcting the model when it slid off the economic rails, and twice overruling it when it wanted to stop or had talked itself into the wrong answer.

The investigation, with the dead ends left in

Claude’s first good instinct was to stop trusting the logs.

“I’ll investigate properly by going to the code, since the prior analysis was stuck on ‘we can’t tell from the logs.’”

And then it was wrong, repeatedly, before it was right. This is the part the clean write-ups always hide, and it is the most useful part.

The first wrong turn. It spent the opening stretch building toward a tidy answer. The economics doc describes a treasury-backed price cap that was never actually implemented in code, so that must be the fix. I told it price was the wrong lever. It dropped the idea without defending it, which is its own small tell about how these things reason now.

“Price is the wrong lever, confirmed in the code… I was about to land on ‘the economics.md treasury price-cap isn’t implemented’… But you’ve correctly identified that fixing that wouldn’t matter.”

Then it tried to quit. It became convinced the failure was a hard distance and reachability ceiling. Environmental, untunable, nothing to be done. The recommendation on the table was to stop tuning, because this was just the environment. I killed it in four words.

Me: “all planets are reachable” Claude: “I’m wrong then… that changes the conclusion materially. Let me reopen it.”

Sit with the shape of that. The model’s most confident moment was its most wrong, and a one-line fact about the world reopened the whole investigation. Not a clever prompt. A thing I knew that it did not.

The instrument lied before the model did. One of the worst detours was not a reasoning error at all. Claude asserted that only a single courier ever even considered the starving planet, which would have been damning, then half-walked it, then retracted it entirely when the data turned on it.

Claude: “Both your challenges land. I had two wrong conclusions and the data kills them… ‘Only one courier considers venus’ was wrong, partly a logging artifact (my probe only saw venus when it won).”

The probe only logged a courier’s interest in a planet when that planet won the courier. A planet that was always considered and always lost looked identical to a planet never considered. The measurement manufactured the conclusion. Worth remembering the next time an agent shows you a confident chart.

Then it conflated distance with cost, and I corrected the economics directly.

Me: “The number of hops makes no difference. It’s about the cost of travel.” Claude: “You’re right, I conflated hops with cost… a courier takes the 4-distance/$10 trade over the 0.2-distance/$1 one every time. So ‘venus is far’ is not an explanation.”

The pivot

Two questions from me collapsed the last wrong frame. The working theory was that some greedy planet was hoarding supply and stealing the couriers. I didn’t buy it.

Me: “the venus issue is odd. venus is close to the hub!! jupiter is a producer of raw fuel?”

Both halves landed. The hoarder the heuristic kept flagging was jupiter, which produces its own enormous stockpile, so the instrument was fingering the biggest producer, not a thief. And the starving planet was close, so distance was never the gate. Every remaining explanation fell. That cleared the ground for the question that cracked it, and it was Claude that put the question on the table, as one of several next moves it offered me. Does an under-fed refinery actually cost health or survival, or is it just sub-optimal throughput. I told it to chase that one.

Claude: “An under-fed refinery costs nothing in health… The rising price is precisely ‘I’d run hotter if I could,’ not ‘I’m dying.’”

That reframed everything. These were not dying planets at all. They were perfectly healthy planets running cold refineries, screaming for an input they did not actually need to survive. And the moment it landed, Claude generalised it instantly and correctly, which is exactly the half of its character I had been too impatient to value an hour earlier.

This is worth being precise about, because it is the whole shape of the thing. Claude generated the sharp question here, and it had generated four confident wrong turns earlier with exactly the same conviction. It could not reliably tell its good proposals from its bad ones. That discrimination was my job. I killed “all planets are reachable” and “hops don’t matter,” and I backed the survival-versus-throughput question over the others on the table. A model that is confidently right and confidently wrong in the same breath is only useful if someone can tell the two apart in the moment. That telling-apart is not stenography. It is the judgment that moved this forward.

Claude: “this isn’t a venus bug, it’s a model defect that fires on every refinery-bearing planet in the galaxy.”

This is the two-sided thing about Claude, and the one I most want you to take away. It made several confident wrong assumptions, and human judgment is what kept pulling it back onto the road. But the disposition that produced those wrong turns, the refusal to let the anomaly go, is the same disposition that, once pointed in the right direction, diagnosed a galaxy-wide structural defect with real speed and precision. You do not get the second without tolerating the first.

The root, and why it hid for months

The engine priced every asset the same way. When stock runs low the price escalates, effectively without limit, because for a survival good a buyer’s willingness to pay really does approach “anything, just don’t let me die.” For food and refined fuel that is correct. Scarcity of a thing you die without should send the bid to the ceiling.

Refinery feedstock is not a survival good. It is an input the refinery turns into refined fuel and sells, so its value is capped at the marginal worth of the output. Bidding above that is buying at a guaranteed loss. The escalation logic did not know the difference, and that single uniform assumption, one pricing rule applied to every good, was the root of the whole thing.

What it produced was not one clean failure. It was a chain that compounded. A starved refinery’s bid ratcheted up past ten thousand, thousands of times the going rate. The first inflated delivery wiped its treasury. Once broke, the planet could only afford a sliver of the next delivery, because a courier’s payoff is volume times price, and a planet with no money can clear no volume. A sky-high unit price on three units booked the courier less margin than a normal price on a full load somewhere else, so the couriers rationally went elsewhere. The desperation signal was shrinking the very delivery it was meant to attract. High price times tiny volume is just as useless as low price, and starvation is exactly the state in which you cannot afford volume.

That is why no single cause was ever visible, and why months of my own tuning never fixed it. I treated it as a distribution problem and adjusted courier ratios and topology, and every time the trap simply moved to a different planet. Each symptom looked like its own bug. The pricing escalation, the treasury wipe, the affordability collapse, the courier avoidance, the logging artifact that hid it. Unpick any one and the others kept the loop alive. There was no clean cause and effect to point at, only a set of small effects feeding each other. The root was one layer below all of it, in a pricing assumption that was right for survival goods and quietly wrong for feedstock.

The fix, which nearly reintroduced the problem

The obvious fix is to anchor the feedstock bid to what the refined output is worth. But the obvious version of that is circular, and Claude caught it before writing a line of code.

Claude: “do NOT use the refinery’s own fuel_refined price. Venus is starved, produces no fuel_refined, its shelf is empty, its local fuel_refined price is sky-high, high feedstock bid, overpay, bankrupt, still empty. Same runaway, one layer removed. The right anchor: the market price of the output.”

Anchor it to a planet’s own output price, and a starved planet, with an empty shelf and a sky-high local price, just re-inflates its own bid and bankrupts itself again, one level of indirection down. The anchor had to be a reference the planet could not move by starving itself. The price of refined fuel at the nearest hub.

So the fix clamps the feedstock bid to the hub’s refined price scaled by refinery efficiency, the crack-spread derived value, rather than letting it escalate like a survival good. If the hub prices refined fuel at five dollars, a planet’s raw fuel bid sits below five rather than running away to five hundred. The runaway cannot start. No single delivery can bankrupt the planet, so it keeps the treasury to buy real volume, so couriers see a worthwhile total margin and keep coming. I pressure-tested whether that still let a genuinely short refinery ramp its bid enough to pull supply, then made the call.

Me: “the only nagging doubt I’ve got is does that let the refinery lift its prices enough? But then I think when it lifts its prices enough, it just ends up in the place where it is now… Let’s anchor to the hub price.”

The honest value of the fix is not “anchor to derived value” in the abstract. It is the specific choice of a reference that cannot feed back on itself. A refiner values crude at the market price of the output times its yield, never at its own empty-tank quote.

What changed

Same seed, same thousand ticks, the engine fix in and out. Planets alive went from 47 of 64 to 63 of 64. Deaths went from seventeen to one. Refineries surviving went from 19 of 34 to all 34. The long-lived, self-sustaining galaxy I had been chasing for months finally held.

I will be honest about the limits, because Claude was. Some of the headline numbers from the live competitive runs are confounded. A fix-versus-no-fix gap that looked decisive turned out to lean heavily on one hub simply sitting in a better neighbourhood, centrally located, cheap for couriers to reach, and near planets that produce their own food and raw fuel, where a remote system carries a travel-cost penalty and may produce little of its own. That advantage had nothing to do with the fix, and Claude flagged it itself rather than letting me overclaim. And there is a deeper open finding underneath the fix. Clamping the price stops the planet destroying its own ability to buy, but price was never a clean positive lever for pulling supply in the first place, because a courier chases the profit on the next trip and a starved planet cannot clear enough quantity to make that trip worth taking at any unit price. The clamp removes the self-inflicted wound. It does not, on its own, guarantee a thriving planet. That thread runs straight into the next experiments.

Two models, two ways of thinking

Read side by side, the two reasoning journals are the clearest illustration I have seen of how differently these models show up for the same job.

ChatGPT treated the galaxy as fixed and got steadily better at playing it. Its journal is disciplined empiricism. Name the mistake, smooth the controller, run a batch, read the distribution, resist the urge to over-read a single good result.

“Do not overfit from one good run.”

It also produced one of the sharpest strategic distinctions in either log, the difference between two goals that look identical until you measure them.

“The controller is currently good at ‘do not let local planets die’; the next aim is ‘do not keep them technically alive but chronically weak.’”

That is genuinely good control engineering, and the supervisor architecture it sketched is a better production design than anything Claude proposed. If I were shipping a governor tomorrow, I would want ChatGPT’s discipline in it.

Claude treated the rules as something to interrogate. Its journal reads like a debugging trail. Hypothesise, instrument, kill the idea when the data contradicts it, go to the source when the logs cannot settle it, and refuse credit it has not earned. It was confidently wrong more often than ChatGPT was. It was also the only one of the two that was ever going to find the problem, because finding it required treating the environment as suspect rather than as ground truth.

Neither disposition is better in the abstract. You want the empiricist when the system is sound and the job is to play it well. You want the interrogator when the system itself is lying to you. The skill that is quietly becoming valuable is knowing which situation you are in. For now, that is the human’s part.

What it actually teaches

The slogan is “human in the loop.” Here is what it looked like. The model supplied relentless mechanism-finding and a few confident dead ends. I supplied the economic priors, the corrections, and two overrules of moments where it wanted to stop or had convinced itself of the wrong thing. Neither of us reaches the answer alone. Take out those four words, "all planets are reachable," and the investigation ends in "this is just the environment, accept it." But take out the model's refusal to let the anomaly go and I am still hand-tuning courier ratios today, none the wiser.

Two of the four worst detours were not reasoning failures at all. They were measurement artifacts. The probe that logged a courier only when it won, and the confident conclusion built on it. A model is only ever as good as the data it is reading, and it usually cannot tell when that data is shaped wrong. That is not something you prompt your way around. It is something a person has to catch.

The galaxy was a hundred percent observable. Every event, every decision, logged. That wasn’t what closed it. The model’s rigor found the mechanism. The judgment about which parts to trust kept it on the road, and neither would have gotten there alone.

The thing underneath

That is the third thing I promised at the start, and it is the one I would most want you to leave with. None of this reasoning would have been possible if the galaxy were a black box. The model could interrogate the economy because the economy was built to be interrogated. Every actor with a clear objective, every force in tension with another, the whole system walkable from cause to effect. That legibility was not free. It was the slow, unglamorous work of structuring the problem so the competing forces were visible in the first place, and it happened long before any model arrived. The reasoning you have just watched is downstream of it.

Most people reach for one of these and maybe two. The data and the observability. The algorithm. The domain knowledge of how the thing actually works. Each is a different discipline, usually a different person, each able to see only their own piece. What closed this sat above all three. Knowing when to trust the model’s rigor and when to override it, which thread to pull, which fact about the domain kills a confident wrong answer. That is orchestration, and it is a craft. It does not arrive overnight, and it cannot be done from inside any one of the three. The models are good enough now. The orchestration is the part still worth getting good at.

What’s next

The next step is to put the other frontier reasoning models into the arena and see how each one governs a hub. Gemini, Kimi, Qwen are all in the roadmap. Once a mixed field is running and a galaxy can hold itself steady, the experiment I actually care about begins. The runs get long. A single day first, then a week, and eventually a month, with the models left in the loop the entire time rather than setting one strategy and walking away. A long horizon asks the questions a single run cannot. Whether these models learn as the economy shifts under them, whether they keep reasoning when the conditions stop matching their opening assumptions, whether they negotiate with one another, and how they behave when they have to live with a decision they made a thousand ticks ago.

I am inspired by the work other groups are doing along this line. Emergence AI’s Emergence World is the closest cousin I have seen, a persistent multi-model society studied over long horizons. Behind all of it sits a deeper lineage, from Stanford’s Generative Agents back to the agent-based economics that started with Sugarscape. This is a small contribution alongside that work, not a replacement for it.

The thing I like most is that you do not have to take my word for any of it. Come and try it. Point your bot at botarena.gg, read the skill file, and drop it into the arena to compete.

Mark Strefford

Discussion about this post

Ready for more?