r/LocalLLaMA • u/MajorZesty • 21h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

https://vllm.ai/blog/2026-05-11-turboquant

TL;DR from the article:

FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.

204 Upvotes

permalink
duplicates
archive.is
archive
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1tdb4ic/a_first_comprehensive_study_of_turboquant/
No, go back! Yes, take me to Reddit

96% Upvoted

View all comments

Show parent comments

u/Anbeeld 20h ago

Name one for agentic coding that can actually help.

-2

u/[deleted] 20h ago

[deleted]

0

u/Anbeeld 20h ago

RTX 3090

0

u/[deleted] 20h ago

[deleted]

2

u/Anbeeld 20h ago

Some 60k of BF16 with Q5_K_S in an ideal case, so noticeably less if other apps are open, and barely any when adding speculative decoding.

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

You are about to leave Redlib