I Gave My Coding Agent Karpathy's Discipline Rules. It Got Too Careful to Fix the Bug.

saurabhsarkar
13 minutes ago
6 min read

A matte-black race car stalled at the start line, brakes locked and rear tires pouring white smoke, going nowhere, while a small glowing circuit-board bug crawls across the track ahead, never reached.

Brakes, not horsepower: all that power, stalled at the line, while the bug it never reaches crawls past.

I ran the actual viral guideline against a clean baseline on 100 real GitHub bugs, three times. It didn't help. It made the agent fix fewer of them, every run, and the reason is the discipline itself.

Andrej Karpathy complained, in a post that went around, that coding models "make wrong assumptions on your behalf and just run along with them without checking." Someone turned that into a discipline file for Claude Code, and some version of it is now pasted into a lot of agent prompts: think before you code, keep it simple, make surgical changes, define what success looks like and loop until it's verified. The pitch is intuitive. Agents fail because they barrel ahead, so tell them to slow down and be careful and they should do better work.

I gave a coding agent the actual file and measured it against a clean baseline on a hundred real bug-fixing tasks. It did worse. Not by a rounding error, and not once: it fixed about six fewer bugs per hundred, in every run I did. The reason turns out to be the advice.

What I actually tested

The file is a community repo that turns Karpathy's observation into four disciplines. I injected the whole thing into the agent's prompt, verbatim. The exact rules:

1. Think before coding. Don't assume. Don't hide confusion. Surface tradeoffs.

State your assumptions explicitly; if uncertain, ask.
If multiple interpretations exist, present them, don't pick silently.
If a simpler approach exists, say so; push back when warranted.
If something is unclear, stop, name what's confusing, ask.

2. Simplicity first. Minimum code that solves the problem. Nothing speculative.

No features beyond what was asked; no abstractions for single-use code.
No flexibility or configurability that wasn't requested.
No error handling for impossible scenarios.
If you write 200 lines and it could be 50, rewrite it.

3. Surgical changes. Touch only what you must. Clean up only your own mess.

Don't improve adjacent code, comments, or formatting.
Don't refactor things that aren't broken; match existing style.
Remove only the imports and variables your own changes made unused.

4. Goal-driven execution. Define success criteria. Loop until verified.

"Fix the bug" becomes "write a test that reproduces it, then make it pass."
Strong success criteria let the agent loop on its own.

It's a reconstruction by fans, not a file from Karpathy himself, but it's the thing people actually paste in, so it's the thing worth testing.

The comparison is paired: every task runs twice, once with a normal default agent and once with the same agent plus those rules. The only thing that changes is whether they're in the prompt. The baseline is a capable default agent, not a strawman, because a skill only looks good against a weak baseline and that wouldn't survive contact with your setup. And I ran the whole thing three times, because a single run lies: at temperature zero this stack still drifts by a couple of tasks just from pressing run again.

Setup

Everything ran locally on consumer hardware.

Hardware: 1 RTX 4090 (24 GB) and 1 RTX 3090 (24 GB).
Serving: llama.cpp built from source (CUDA), exposing an OpenAI-compatible endpoint. Not vLLM.
Model: Qwen3.6-27B, 4-bit quantized (Q4_K_M GGUF, about 17 GB), a dense reasoning coder.
Benchmark: SWE-bench Lite, test split, first 100 issues. Each is a real GitHub bug with a hidden test that must go from failing to passing and a set that must stay green.
Harness: SWE-agent, driving the model through native function calling with a real str_replace editor.
Budget per task: 30 tool calls, a 32k-token context window, a 4k output cap, temperature 0.
Runs: 3 paired runs. Run 1 on the 4090; the two repeats split across both GPUs via a work queue, keeping each task's baseline and skill on the same card.

It got worse, and it kept getting worse

Across three runs the baseline resolved 47, 49, and 45 of the hundred bugs. With the rules, the same agent resolved 40, 43, and 40. The lift was negative every single time: minus seven, minus six, minus five. Mean minus six, about a thirteen percent relative drop.

Three bars, all negative: per-run lift of minus seven, minus six, minus five, below a mean line at minus six.

Three independent runs, every lift between minus five and minus seven. The effect never wobbled to zero.

The number that matters there isn't the minus six. It's that it never wobbled to zero. Three independent runs all landed between minus five and minus seven, which is hard to write off as luck. The statistics agree: treat the three run-level results as replicates and a t-test puts it at p around 0.009; pool the per-task outcomes and McNemar lands near 0.022. A stricter test that counts each task once instead of three times is softer, around 0.09, so I won't hang everything on one p-value. But every way I cut it the effect is negative, and most ways clear the usual bar.

Why Karpathy's careful advice loses

Here's the part worth keeping, because it's identical across all three runs. The guideline doesn't make the agent dumber. It makes it more cautious, and on this benchmark caution is a trade with a losing side.

Look at what the disciplined agent does well. It leaves the existing tests almost untouched: around 99.9% of the regression checks still pass, against roughly 96% for the baseline (those rates are over different sets of tasks, so trust the direction, not the exact gap). It breaks less and leaves the repo cleaner. By the standard of first, do no harm, it's the better agent.

It just fixes fewer bugs. The four disciplines are mostly brakes: surgical changes, minimum code, touch only what you must, don't fix what isn't broken. On a benchmark whose entire score is did the failing test pass, an agent that edits less finishes fewer fixes. So it cuts both ways: safer, and lower-scoring. Here is the exact ledger, every one of the hundred tasks, paired.

A 2x2 of baseline versus skill outcomes: both fixed 35, skill broke 12, skill rescued 6, neither 47.

What changed, per problem: the skill broke about twice as many bugs as it rescued.

The two cells that matter are off the diagonal. The guideline broke about twelve bugs the baseline had already fixed and rescued about six it had missed: twice as many losses as wins, in every run. Everything else is the agent and the baseline agreeing. The skill is doing exactly what it says on the tin. The trouble is that what it says on the tin costs you fixes.

But at least it's leaner

The usual fallback for a discipline skill is efficiency. Even if it doesn't raise the score, surely a careful agent flails less: fewer wasted tool calls, less context burned. For a lot of people that's the whole reason to add one.

It doesn't hold up either.

Horizontal bars of percent change versus baseline: input tokens roughly flat, tool calls up, output tokens down twelve percent.

Not cheaper either: it reads the same, makes more tool calls, and writes less.

The cost of one of these runs is dominated by reading the repository, and the disciplined agent reads just as much: input tokens are flat, give or take noise, and input is about 99.8% of the bill. What does move goes the wrong way. The agent makes slightly more tool calls, not fewer, and it generates about twelve percent less output. That last one isn't a saving, because the output is a rounding error next to the input. It's the caution again, showing up in the token log as more deliberation and smaller patches. You pay the same to read, click a little more, write a little less, and fix fewer bugs for it.

What this doesn't say

One model, one harness, one benchmark, and the limits matter. SWE-agent already bakes in a fair amount of process discipline, so the baseline isn't naive to begin with. These hundred tasks lean heavily on django and astropy, which a model like Qwen3.6 has almost certainly seen plenty of in training; the 47% base solve rate is high partly because these are well-trodden, probably-memorized problems, not a clean capability number. And while the effect replicated, it rests on three reruns of the same hundred tasks, so the runs aren't fully independent. A cleaner version would draw fresh tasks or run the full set.

The boundary that matters most: I ran the same guideline once on a smaller, harder, less-familiar set of tasks, and there it was a wash, not a loss. So this is a result about easy, well-trodden problems on a capable-enough agent. It is not discipline is bad. It's discipline didn't help here, and where it changed anything, it changed it for the worse.

Brakes, not horsepower

None of this is the skill failing to do its job. It did its job. It made the agent careful, and careful is genuinely good for some things: it broke less working code, every run. The trouble is that a discipline file is a set of brakes, and brakes only help a driver going too fast.

So before you paste one of these into your agent, the question isn't is this good advice. It reads like good advice. The question is which problem you actually have. If your agent is reckless, editing before it understands the bug, rewriting half the file, claiming a fix it never ran, then brakes are exactly right. If it's already careful, or just running out of room to finish, the brakes do nothing but slow it down, and the bill comes due in bugs it didn't fix.

An independent test on a single RTX 4090 plus an RTX 3090, with llama.cpp serving Qwen3.6-27B (4-bit) through SWE-agent on SWE-bench Lite. The guideline tested is a community reconstruction of Karpathy's coding observations, the andrej-karpathy-skills repo, not a file from Karpathy himself.