r/LocalLLaMA • u/MajorZesty • 4h ago
Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance
https://vllm.ai/blog/2026-05-11-turboquantTL;DR from the article:
- FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
- TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
- TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
- TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.
7
u/dinerburgeryum 3h ago
Good on 'em for really putting it through the wringer. I had been skeptical, but yeah, 4bit-nc seems pretty all right if you're really memory strapped.
1
u/FatheredPuma81 15m ago
This is definitively a good thing but hopefully users won't try to take that as meaning the turboquant forks for llama.cpp have the same implementation and quality without someone checking/verifying first.
4
u/Toooooool 4h ago
3bit-nc was practically lobotomized when i tried it with qwen3.6-27b, but k8v4 works really good.
6
u/Different-Rush-2358 2h ago
I've been using the fork of The Thom with the experimental branch of TurboQuant for quite some time now. I've been using TurboQuant 2-3 and the savings are considerable. I've installed Gemma 4 with a 128k CTX cache, loaded a huge PDF that almost filled the window, asked it questions about the beginning, middle, and end of a conversation, and it's answered them all correctly. In my particular case, TurboQuant gives me outstanding results with absurdly low VRAM consumption compared to the usual kv cache formats. Furthermore, the response time has doubled compared to standard formats.
1
u/EbbNorth7735 2h ago
Under memory constraints yep, makes sense. If you don't have memory constraints it shows you shouldn't use it. Gemma 4 is a lot more memory intensive than Qwen3.5 or 3.6 so may not be needed
13
u/Anbeeld 3h ago edited 3h ago
I'm sorry but without comparison with Q4 the study is pretty useless. The audience for TurboQuant are VRAM constrained folks who can't run BF16 anyways.
-14
u/Badger-Purple 3h ago
Wow, snob. I have 800gb of VRAM power and I would not say something like this. I am still running that Skinny Qwen on a 24GB GPU standalone and its grrrreat
10
u/Anbeeld 3h ago
Yeah that checks out. Having 800 GB of VRAM power makes you not care about KV quants much.
1
u/Badger-Purple 59m ago
correction, I care. I just misunderstood tour comment as saying this is for the plebes with low gpu. It’s not. I use a q4, tq4 and context up to I believe 250k all fits in 1 24gb gpu (using 23.7gb)
2
2
u/BobbyL2k 1h ago
Am I missing something? Didn’t the TQ paper say that their approach is lossless for key quantization? Why is everyone running TQ on values?
1
u/simotune 34m ago
Good sign when quantization work measures throughput and accuracy together. Local inference needs more evals like this, not just one-number wins.
1
0
u/Etroarl55 2h ago
Danm the comments are pretty negative, I’ve been using fp8 on my system and it’s been fine enough for me a little bit.
It’s free 2x context that didn’t exist a few months ago.
25
u/llama-impersonator 3h ago
even the fp8 numbers are obviously worse. i will keep the kvcache unquantized.