Can you tell a good AI agent plan from a bad one before you run it? I spent 6 million tokens finding out.

saurabhsarkar
3 days ago
8 min read

A robotic hand holds a glowing scorecard up to a floating multi-agent plan of glowing nodes and edges, while a huge engine and GPU tower sit dark and idle in the background.

What reproducing Orch-RM on consumer GPUs taught me about cheap verification.

The shape is familiar by now. Instead of handing a hard problem to one model, you hand it to an orchestrator that plans a team of AI agents. It reads the question and decides who does what: a chain-of-thought solver takes the first pass, two agents debate the tricky step, a web search fills a gap, a reflection agent checks the result before it ships. One model writes the plan; a fleet of sub-agents carries it out. On hard benchmarks a well-orchestrated team beats a lone model, and it is the direction a lot of production agent systems are drifting.

It is also expensive in a specific, aggravating way. You cannot tell whether a plan was any good until you have run the whole thing and graded the answer. A brilliant plan and a doomed one cost the same to find out about, and nearly all of that cost is the sub-agents doing the actual work: thousands of tokens each, several agents to a plan, several plans if you want to keep the best of a few. Writing the plan is nearly free. The finding-out is the whole bill.

A June 2026 paper out of Rutgers and Salesforce, "Reward Modeling for Multi-Agent Orchestration," asks the question that turns out to be harder than it sounds. Could you tell a good AI agent plan from a bad one before you run it? Train a reward model that reads the plan itself, predicts whether it will work, and let it pick. Score five plans, execute only the one it likes best. Best-of-N quality for one-of-N execution, plus rounding error.

Here is what that asymmetry looked like on my hardware. Scoring the plans across a 30-problem slice of AIME cost about 70,000 tokens. Executing them cost 6.1 million. Same decision, which plan to trust, at roughly one percent of the price.

I wanted to know if that bet survives contact with hardware I actually own: an RTX 4090 and an RTX 3090, two cards in one box. It mostly does. But testing it turned up something the headline number hides. The cheap verifier is trained on the most expensive data in the whole system, and even after I paid for a lot more of that data, the verifier it produced was barely better than guessing. It still won. That gap, between "barely better than guessing" and "still won," is the actual story.

What scoring an AI agent plan quietly requires

To score plans you need a reward model, and to train a reward model with no human labels you need preference pairs: plan A beat plan B, learn to rank A higher. Orch-RM builds two kinds, and the difference between them turns out to be the entire plot.

The first kind is cheap. Call it specialized-over-base: take a trained orchestrator and an untrained base model, have both plan the same question, and assume the trained one's plan is better. You never run anything. You just need two models and a pile of questions, and you can mint these pairs by the thousand. The paper mixes them in at three-to-one.

The second kind is expensive, and it is the one that matters. Call it correct-over-incorrect: sample several plans for the same question, execute all of them, and pair the plans that solved it over the plans that didn't. To make a single one of these pairs you have to pay exactly the execution cost the method exists to avoid. The cheapness of verification at test time is borrowed against a large, unavoidable labeling bill at training time. That is the hidden clause in the contract, and I did not appreciate how binding it was until the reward model made me.

Horizontal bars on a log scale: executing the plans costs 6.11M tokens, scoring them 0.07M, about 87 times cheaper to select than to execute.

Scoring the plans costs about one percent of what executing them costs. Log scale; 30 AIME problems.

Standing it up on two cards

The authors' preference data is private, and the flashy benchmark from the paper (a web-browsing task where plan-scoring matches an expensive judge at roughly seventeen times fewer tokens) rides on a retrieval corpus that was never released. So I couldn't rerun their headline. What I could do was rebuild the machinery and point it at AIME, the competition-math set, where everything needed is public.

Rebuilding meant three things: generating both kinds of preference pairs myself on a math-question dataset with AIME held out; training a LoRA reward model on top of Skywork-Reward-8B with a Bradley-Terry loss; and writing a small best-of-N harness that samples plans from the orchestrator, scores them, executes them against a backbone, and grades with a math verifier. The paper drives its backbone at 120 billion parameters. I ran gpt-oss-20b, which fits one 4090 in about 13 GB. The orchestrator is the authors' released 7B checkpoint.

Two details cost me a day each and are worth passing on. The math grader uses a global parser that is not thread-safe, so grading inside a thread pool silently marked correct answers wrong until I moved grading to a single thread after the parallel execution. And multi-agent plans concatenate their sub-agents' answers, so the verifier saw the same fraction printed three times over and threw up its hands until I taught it to check the last line too. Neither bug is deep. Both are the kind of thing that stands between "the paper" and "a number you trust."

First result: the cheap part works, the smart part doesn't

The first honest run on AIME 2024, best-of-4, came back like this. A single plan scored 43 percent. Majority vote across the four plans, the dumbest possible selector, count the answers and take the mode, scored 60. And the reward model, the whole point of the exercise, also scored 60. A tie with the free baseline.

The efficiency claim held cleanly. Plan-scoring was about 1.4 percent of the token cost of execution, exactly the asymmetry the paper sells. So the cheap part worked. The problem was that the expensive-to-replace part, a smart selector that beats counting, wasn't there. Best-of-N is only worth building if your selector beats majority vote, and mine didn't.

It would have been easy to file that as "reproduces the efficiency, not the accuracy" and move on. But the tie was too clean, and it pointed somewhere specific.

Why it tied

At test time, every plan the reward model ranks comes from the same trained orchestrator. They are all "specialized" plans already. Which makes the specialized-over-base signal, the cheap pairs, three-quarters of my training data, worse than useless here. It teaches the model to tell trained plans from base-model plans, a distinction that never comes up when every candidate is already a trained plan. The only signal that transfers to the actual test is correct-over-incorrect: among plans that all look competent, which one is actually right. And I had built the reward model on twenty-nine of those pairs. Twenty-nine.

The verifier wasn't weak everywhere. It was weak in the one place the test lives. So I went and made more of the expensive signal.

Paying for the signal

Harvesting correct-over-incorrect pairs at scale is embarrassingly parallel and, on a two-card box, a small scheduling puzzle. The orchestrator and the backbone together don't fit twice, so I phased it. First, sample all 6,000 plans (1,000 questions, six plans each) with the orchestrator alone on the 3090. Then tear it down, serve the backbone on both cards, and execute the plans round-robin across the two. Sixty-eight percent of the executions solved their problem, enough within-question spread of right and wrong to mint 454 new pairs. The correct-over-incorrect corpus went from 29 to 483.

Then the reward model's own economics paid off in a way I didn't plan. Because scoring is decoupled from execution, I could execute one pool of AIME plans a single time and re-score it with several different reward models. The answers are already known, so only the ranking changes. That buys an unusually clean comparison: same problems, same plans, same correctness, and the only thing that moves between runs is the reward model itself. Three of them went over one fixed pool. The old 29-pair model, a paper-faithful three-to-one mix, and a correct-over-incorrect-heavy one-to-one mix.

Best-of-N accuracy climbed straight up the more of the expensive signal I fed it: 53.3, then 56.7, then 63.3 percent. On that pool majority vote scored 50. The correct-over-incorrect-heavy model beat it by more than thirteen points, picked up four problems majority got wrong, and lost none that majority got right. The paper's central ordering, a trained selector beats counting, finally held on my box.

Three bars climbing 53.3, 56.7, 63.3 percent across old, 3:1 and CoI-heavy reward models, crossing a dashed majority-vote line at 50 toward an oracle ceiling at 76.7.

Best-of-4 accuracy over one fixed pool; only the reward model changes. It climbs past majority as the correct-vs-incorrect data grows.

The coin flip

And then I looked at what the reward model could actually do. On a held-out set of correct-over-incorrect pairs, the exact skill the test rewards, the best of my models scored 0.588. Chance is 0.5.

Grouped bars: overall validation accuracy near 0.78-0.79 towers over correct-vs-incorrect accuracy at 0.562 and 0.588, which sit just above a red dashed coin-flip line at 0.50.

Overall accuracy looks healthy, but on the signal the test actually needs the model barely clears the 0.50 line.

I had not built a good verifier. I had built one that is barely distinguishable from a coin, and it was enough to move best-of-N by thirteen points. Both things are true at once, and holding them together is the point. A selector only has to be right at the margin, on the handful of problems where the plans genuinely disagree and the majority happens to land wrong, to shift the score. It does not need to understand the plans. It needs a slight, real tilt away from random on the cases that matter, and 0.588 was enough of a tilt.

That is also why I won't oversell the win. It is one seed, thirty problems, and a McNemar test puts it at p = 0.125, the kind of result that is real on this pool and would need a few more seeds to survive a referee. The clean, defensible claim is the monotonic one: more of the expensive signal, strictly better selection, at every step.

Cheap to run, expensive to teach

The headline of a paper like this is an inference-cost number. Score the plan for one percent of the rollout. That number is honest, and I reproduced it. But if you are the one building the system, it points at the wrong cost. The reward model that makes the one-percent selection worth doing is trained on data you can only produce by paying the ninety-nine percent, over and over, because the signal it needs is which plans fail, and the only way to know a plan fails is to run it. Then, having paid, you get a verifier that an 8-billion-parameter model can only push to 0.588 on hard math, because telling a subtly-wrong solution from a right one is close to the whole difficulty of the problem in the first place.

Cheap to run, expensive to teach. On consumer hardware the binding constraint was never the GPU or the inference bill. It was how many failures I could afford to generate, and how much an ordinary reward model could learn from them. If you take the plan-scoring idea to your own agents, budget for that. Not for the tokens you'll save at test time, but for the executions you'll spend teaching the thing to save them.

An independent reproduction on a single RTX 4090 plus an RTX 3090, with vLLM serving gpt-oss-20b as the sub-agent backbone and a LoRA reward model on Skywork-Reward-8B. The preference data and reward model were rebuilt from scratch; the paper is Reward Modeling for Multi-Agent Orchestration (arXiv:2606.13598). Single seed, 30 AIME problems; the accuracy result is directional, not yet significant.