I ran DFlash on a MacBook. The 4x is a data-center number.

saurabhsarkar
8 hours ago
7 min read

A cinematic casino table where AI word tokens are accepted or discarded, with a glowing GPU server behind the dealer

Speculative decoding is a bet placed on every step: draft several tokens, keep the ones the target model accepts, and pay for the guesses it rejects.

On June 15, LMSYS published a number that is easy to misread. DFlash, their new speculative decoding method, hit more than 4.3 times the throughput of a baseline server on Qwen 3.5 397B-A17B. The detail that matters is one line down, in the hardware: eight B200s. I wanted to know what survives the trip from that rack to the machine on my desk, so I ran the same idea on an M5 Max against a 4B model, and then, when the laptop raised a question it could not answer, on a rented RTX 4090.

The short version, before the evidence: DFlash is real and worth turning on in the right place, but the size of the win is set by the workload, the backend, and how many people are waiting. The laptop story and the data-center story are not the same story.

What the trick actually is

Start with why a big model is slow. For every word it writes, it reads its entire set of weights, billions of numbers, out of memory. The arithmetic is quick. The reading is the bottleneck. Producing one word is less like solving a hard problem and more like hauling a heavy book down from a high shelf to write a sentence, reshelving it, then hauling it back for the next.

Speculative decoding turns that into an opening. A small, fast model guesses the next handful of words, and the big model checks them in a single read of its weights, the same read it would have spent on one word, keeping the guesses it accepts. Guess five, land four, and you got four words for the price of one trip up the shelf. In principle the target model still verifies the text; in practice, as the results below show, backend numerics can still change exact wording.

DFlash sharpens the guesser two ways. It drafts a whole block of words at once and then refines it, closer to sketching a line and tightening it than printing letter by letter, which the paper calls block diffusion. And it hands the guesser the big model's own working notes, the internal state already computed, so the small model can guess well without redoing the thinking. That second part is KV injection.

One number governs all of it: acceptance length, how many guessed words the big model keeps per check. High acceptance and you fly. Low acceptance and the big model throws the guesses away, you paid to make them anyway, and you can finish slower than if you had never guessed.

The target and draft are two different models

In my test, the main model was still the normal Qwen/Qwen3.5-4B. I paired it with Z Lab's premade DFlash draft model, z-lab/Qwen3.5-4B-DFlash. That draft is not a layer pasted on top of Qwen. It is a separately trained, roughly 634-million-parameter model that sits beside the 4B target model at inference time.

The target model is not retrained to accept draft tokens. DFlash changes the generation loop around it. First the target reads the prompt and exposes hidden states from selected internal layers. The draft model uses those hidden states to propose a block of future tokens. Then the target checks that whole proposed block in one forward pass.

If the draft's first few tokens match what the target would have produced, the runtime keeps that matching prefix. At the first mismatch, it can keep the target's correction token, but it discards the rest of the draft block because those later guesses were conditioned on the wrong token. The target stays the authority; the draft is just a learned proposer trying to save target-model passes.

Getting it running, and nearly fooling myself

None of the setup was hard in theory, and it still ate most of an afternoon. The download stalled at 3 of 11 files after forty-seven minutes, so I gave up on the parallel fetcher and wrote a single-worker one that walked the files one at a time. The DFlash draft shipped a config in a schema the MLX loader did not expect, two fields it had never seen, so I patched the loader to read them. And the gotcha that cost me twenty minutes of pure staring: MLX cannot reach the Mac's Metal GPU from inside a sandboxed process, so every benchmark has to run from a plain shell or it dies on import.

The worse trap was one I set for myself. My first quality pass capped output at 1300 tokens, and the 4B model is chatty: fourteen of twenty-four answers ran into the ceiling and got cut off mid-sentence. My rubric scored the stumps about even, so the easy conclusion wrote itself, same quality and much faster. It was grading half-written answers. I added a "be concise" line to every prompt, raised the cap, watched truncation fall to zero, and only then were the numbers comparing finished work to finished work. If your benchmark truncates, your quality comparison is fiction. I almost published the fiction.

Where the speedup actually lives

With that fixed, the laptop result came out clean. Same machine, block size sixteen, baseline against DFlash across four kinds of work:

SWE bug-fix / review: 54.16 baseline tok/s, 89.82 DFlash tok/s, 1.66x speedup, 4.45 acceptance.
Code generation: 43.08 baseline tok/s, 67.78 DFlash tok/s, 1.57x speedup, 4.82 acceptance.
Reasoning: 53.58 baseline tok/s, 67.06 DFlash tok/s, 1.25x speedup, 3.43 acceptance.
Long-context synthesis: 39.86 baseline tok/s, 40.14 DFlash tok/s, 1.01x speedup, 2.88 acceptance.

Read the acceptance number next to the speedup. The two rise and fall together. Tight, predictable work, a bug fix or a small function, lets the drafter land four or five tokens a step, and you collect most of the win. Open-ended synthesis is unpredictable token to token, the drafter gets rejected more, and on the long-context prompts the throughput was basically tied while the wall clock got worse: DFlash took 17.1 seconds against 14.9, with time to first token doubling.

DFlash speedup by workload on an M5 Max, showing gains from 1.01x to 1.66x

On the M5 Max, speedup tracks acceptance length. Coding work gets the win; long-context synthesis basically gives it back.

It changes the words, not the quality

The obvious fear with making a model faster is that you are quietly trading away correctness. On the coding set I could not find the trade. With the cap raised high enough that nothing truncated, the baseline and DFlash answers landed within noise on the rubric, 0.863 against 0.879, and DFlash was the one slightly ahead. Eleven of the twelve prompts scored the same, and on the twelfth DFlash gave the better answer.

What did change was the text, which I had not expected. At greedy decoding, temperature zero, DFlash sounds like it should return the baseline's exact tokens, since the target model verifies what it accepts. It did not: only two of eleven answers came back identical. Most of the gap was an artifact worth knowing, the MLX path printing the end-of-turn token into the output while the baseline strips it, which makes a naive comparison scream that everything differs. The rest was real and small.

So the useful version is narrow. DFlash changed the wording and, as far as a deterministic rubric and a hand check could tell, left the quality alone. Speculative decoding is close to lossless, not exactly lossless. If a test pins exact model output, DFlash will break it for reasons that have nothing to do with being wrong.

The question a laptop cannot answer

All of that was one user at a time, which is the one thing a laptop is good for and the one thing a server rarely is. So I put the same model on a 4090, served it with vLLM, and ran the suite at four levels of concurrency, once plain and once with DFlash, measuring aggregate throughput: total tokens over the wall clock for the whole batch.

1 concurrent request: baseline 99.5 tok/s, DFlash 162.0 tok/s, 1.63x speedup.
4 concurrent requests: baseline 352.6 tok/s, DFlash 483.0 tok/s, 1.37x speedup.
8 concurrent requests: baseline 513.8 tok/s, DFlash 488.8 tok/s, 0.95x speedup.
16 concurrent requests: baseline 959.2 tok/s, DFlash 487.8 tok/s, 0.51x speedup.

At a single request DFlash lands at 1.63 times baseline, almost exactly the 1.66 the laptop showed. Two unrelated stacks, a Metal laptop and a CUDA server, agreeing to the second digit is strong evidence that the single-stream win is real. Then watch what the requests do to it. DFlash climbs to about 485 tokens a second and stops dead, flat from four requests through sixteen. Baseline keeps climbing, all the way to nearly a thousand.

Aggregate throughput on an RTX 4090 at concurrency 1, 4, 8, and 16

On the 4090, DFlash climbs at low concurrency and then plateaus. Plain batching keeps scaling.

DFlash speedup versus concurrency, crossing below baseline around eight concurrent requests

The crossover: a single-stream win turns into a throughput loss once enough requests are waiting.

The reason is the memory-versus-compute story from the top, played in reverse. At one request the GPU sits mostly idle between memory reads, and speculation spends that idle compute to buy speed. But batching wants the same idle compute, to serve more users at once, and it uses it far better. You can spend that headroom on latency for one user or on throughput for many. You cannot spend it twice.

So should you turn it on?

Two questions decide it: what are you generating, and how many people are waiting.

One user on your own machine, generating short structured text, a coding assistant on Apple Silicon: turn it on. Roughly 1.5 to 1.7 times faster at no measurable quality cost, for the price of the draft model's memory. Long-form writing or summarization, where the answers run long and the drafter guesses badly: leave it off, you may be paying for speculation you never get back.

A server taking more than a handful of requests at once is more complicated. In my vLLM/4090 test, DFlash stopped being a win around eight concurrent requests and cost throughput after that. That does not disprove LMSYS's result; it explains why their Spec V2 scheduler matters. Buying latency for one stream by spending the compute that batching turns into throughput is a poor trade when throughput is the whole point, unless the serving stack is specifically built to recover that batching efficiency.

So do not quote the 4x as a general rule. It is real, on eight B200s with a 397B-A17B model and the serving stack LMSYS measured. On a 4B model in my tests, DFlash ran anywhere from a 2x loss to a 1.66x win, set by what I asked and how many were asking at once. Both numbers are true. Only one of them is about the machine on your desk.

I will be straight about what this is not. Two backends and two machines, but one draft size and one speculative-token setting, neither swept. Quality checked only on the coding set and only on the laptop; the server run measured speed, not correctness. The non-code suites were three or four prompts each, enough to show the shape of the curve and not enough to trust the second decimal. The direction, though, I would bet on now.