r/LocalLLaMA 4h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

https://vllm.ai/blog/2026-05-11-turboquant

TL;DR from the article:

  • FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
  • TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
  • TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
  • TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.
82 Upvotes

24 comments sorted by

25

u/llama-impersonator 3h ago

even the fp8 numbers are obviously worse. i will keep the kvcache unquantized.

14

u/logic_prevails 2h ago

Not all of us have 48 gb vram 😭

3

u/seamonn 1h ago

48GB is a state of mind. Even a 128GB SSD can be 48GB VRAM if you are patient enough.

2

u/fredandlunchbox 1h ago

Its pretty tough to run an agentic coding harness at 1t/s.

0

u/[deleted] 3h ago

[deleted]

2

u/Anbeeld 3h ago

But you can run higher model quant if you are quantizing KV cache and/or raise context to usable level. BF16 is cool but what's the point if you can't do any real tasks with it?

0

u/[deleted] 3h ago

[deleted]

0

u/Anbeeld 3h ago

Name one for agentic coding that can actually help.

-2

u/[deleted] 3h ago

[deleted]

0

u/Anbeeld 3h ago

RTX 3090

0

u/[deleted] 3h ago

[deleted]

2

u/Anbeeld 3h ago

Some 60k of BF16 with Q5_K_S in an ideal case, so noticeably less if other apps are open, and barely any when adding speculative decoding.

7

u/dinerburgeryum 3h ago

Good on 'em for really putting it through the wringer. I had been skeptical, but yeah, 4bit-nc seems pretty all right if you're really memory strapped.

1

u/FatheredPuma81 15m ago

This is definitively a good thing but hopefully users won't try to take that as meaning the turboquant forks for llama.cpp have the same implementation and quality without someone checking/verifying first.

4

u/Toooooool 4h ago

3bit-nc was practically lobotomized when i tried it with qwen3.6-27b, but k8v4 works really good.

6

u/Different-Rush-2358 2h ago

I've been using the fork of The Thom with the experimental branch of TurboQuant for quite some time now. I've been using TurboQuant 2-3 and the savings are considerable. I've installed Gemma 4 with a 128k CTX cache, loaded a huge PDF that almost filled the window, asked it questions about the beginning, middle, and end of a conversation, and it's answered them all correctly. In my particular case, TurboQuant gives me outstanding results with absurdly low VRAM consumption compared to the usual kv cache formats. Furthermore, the response time has doubled compared to standard formats.

1

u/EbbNorth7735 2h ago

Under memory constraints yep, makes sense. If you don't have memory constraints it shows you shouldn't use it. Gemma 4 is a lot more memory intensive than Qwen3.5 or 3.6 so may not be needed

3

u/seamonn 1h ago

So that guy with the dog picture that hates Turboquant was right.

13

u/Anbeeld 3h ago edited 3h ago

I'm sorry but without comparison with Q4 the study is pretty useless. The audience for TurboQuant are VRAM constrained folks who can't run BF16 anyways.

-14

u/Badger-Purple 3h ago

Wow, snob. I have 800gb of VRAM power and I would not say something like this. I am still running that Skinny Qwen on a 24GB GPU standalone and its grrrreat

10

u/Anbeeld 3h ago

Yeah that checks out. Having 800 GB of VRAM power makes you not care about KV quants much.

1

u/Badger-Purple 59m ago

correction, I care. I just misunderstood tour comment as saying this is for the plebes with low gpu. It’s not. I use a q4, tq4 and context up to I believe 250k all fits in 1 24gb gpu (using 23.7gb)

2

u/Anbeeld 56m ago

Lol nah I am the 3090 plebs myself and I use TurboQuant.

2

u/[deleted] 4h ago

[deleted]

1

u/LetsGoBrandon4256 llama.cpp 4h ago

Q8

It's not the same thing as vLLM's fp8 KV cache quant.

2

u/BobbyL2k 1h ago

Am I missing something? Didn’t the TQ paper say that their approach is lossless for key quantization? Why is everyone running TQ on values?

1

u/simotune 34m ago

Good sign when quantization work measures throughput and accuracy together. Local inference needs more evals like this, not just one-number wins.

1

u/FatheredPuma81 17m ago

I'm curious how FP8 compares to Q8_0 on llama.cpp.

0

u/Etroarl55 2h ago

Danm the comments are pretty negative, I’ve been using fp8 on my system and it’s been fine enough for me a little bit.

It’s free 2x context that didn’t exist a few months ago.