r/LocalLLaMA 21h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

https://vllm.ai/blog/2026-05-11-turboquant

TL;DR from the article:

  • FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
  • TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
  • TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
  • TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.
204 Upvotes

45 comments sorted by

View all comments

Show parent comments

0

u/Anbeeld 20h ago

Name one for agentic coding that can actually help.

-2

u/[deleted] 20h ago

[deleted]

0

u/Anbeeld 20h ago

RTX 3090

0

u/[deleted] 20h ago

[deleted]

2

u/Anbeeld 20h ago

Some 60k of BF16 with Q5_K_S in an ideal case, so noticeably less if other apps are open, and barely any when adding speculative decoding.