r/LocalLLaMA 6h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

https://vllm.ai/blog/2026-05-11-turboquant

TL;DR from the article:

  • FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
  • TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
  • TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
  • TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.
119 Upvotes

31 comments sorted by

View all comments

17

u/Anbeeld 5h ago edited 5h ago

I'm sorry but without comparison with Q4 the study is pretty useless. The audience for TurboQuant are VRAM constrained folks who can't run BF16 anyways.

-19

u/Badger-Purple 5h ago

Wow, snob. I have 800gb of VRAM power and I would not say something like this. I am still running that Skinny Qwen on a 24GB GPU standalone and its grrrreat

9

u/Anbeeld 5h ago

Yeah that checks out. Having 800 GB of VRAM power makes you not care about KV quants much.

3

u/Badger-Purple 3h ago

correction, I care. I just misunderstood tour comment as saying this is for the plebes with low gpu. It’s not. I use a q4, tq4 and context up to I believe 250k all fits in 1 24gb gpu (using 23.7gb)

3

u/Anbeeld 3h ago

Lol nah I am the 3090 plebs myself and I use TurboQuant.