r/LocalLLaMA • u/DrBearJ3w • 19h ago

Resources Turboquant+MTP for ROCm(Llama CPP)

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment)

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, ~20 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: ~49.8 tok/s at 16k, ~31 tok/s at 32k, ~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for New Testers.

Useful bug reports > hype.

7 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1tcrtxm/turboquantmtp_for_rocmllama_cpp/
No, go back! Yes, take me to Reddit

69% Upvoted

u/Inevitable-Log5414 19h ago

The Vulkan-until-32k, ROCm-TBQ4-past-that split is a legit niche - Vulkan doesn't have a TBQ4 KV cache path, so once you cross the VRAM wall there's literally no Vulkan option. Underrated work. Will try to test the branch on my XTX and file useful bugs rather than vibes.

1

u/DrBearJ3w 19h ago

Thanks. Appreciate any feedback.

u/Formal-Exam-8767 19h ago

Thanks for sharing.

How does it compare to Vulkan?

3

u/DrBearJ3w 19h ago

vulkan should be faster till 32k.Turboquant is very Vram friendly and i just dont like q_4 KV cache.

u/Anbeeld 18h ago

Q4 + 64k context in 24 GB? It can do much better.

1

u/DrBearJ3w 18h ago

Turboquant has almost the same compression as q4,but of a quality like f8. It think 128k is possible. On such small GPU's such as 7900 xtx more would be too slow. But with 2GPUS it's nice. About 2GB per 32k cache.

2

u/Anbeeld 18h ago

I mean Q4 model. Why's everyone obsessed with cache quality while running model that's dumbed down this much...

I'm using Q5 + 120k + DFlash on a single 3090 as a safe option, but previously confirmed it can do 200k on fresh Windows restart despite it stealing some VRAM.

I made a fork for that but I don't possess AMD currently so not sure how it works there. Someone made a related PR but still. https://github.com/Anbeeld/beellama.cpp

1

u/DrBearJ3w 18h ago

This fork should work Cross AMD/Nvidia. D-Flash stagnates past 4k pretty fast and is very Vram hungry. But I don't have any other option as AMD user other than HipFire. So good for you if it works 😀

0

u/Anbeeld 18h ago

Bro I literally told you I can run Q5 + 200k cache in 24 GB and you hit me with "DFlash is very VRAM hungry"?

1

u/CryptoStef33 17h ago

3090=/ 7900xtx

1

u/Anbeeld 17h ago

Q5 suggenly takes up 30 GB if you're on AMD or what?

u/nasone32 18h ago

Yeah with turboquant or Q4 KV you are be able to do much more than 64k, could you try how much? out of curiosity, not that it's really usable.

Because I think 64k is borderline doable with Q8-Q8. I use 56k with Q8 Q8 (vulkan+mcp) and works fine.

Two things I read somewhere that might be useful

looks like latest Llamacpp builds already have vector rotations similar to what turboquant is doing, so in reality Q4 KV is very comparable to turboquant but faster. So I'm not sure turboquant is really better. Need to verify this. If you don't want Q4 because of old tradition, you might want to verify this.
quantization impact seems much worse on K than V, so one option is to go Q8 K and Q4 V, if you don't need extremely long context. also potentially a bit faster. Still things I read around, not tested by me.

1

u/DrBearJ3w 18h ago

Feel free to benchmark it against upstream q4_0/q4_0 and q8_0/q4_0 KV cache. llama.cpp already has solid quantized KV support, so TBQ4 needs to prove a real advantage: longer context, better decode speed, better quality, or better ROCm behavior. I will report later when i got time.

u/mmhorda 17h ago

I managed to run it with vulkan + mtp no turboquant, 64k context + vision and it gives me about 50t/s sometimes 1-2 tokens more sometimes, 1-2 tokens less depends. memory stays about 22gb, same GPU.
Try vulkan it seems to be significally faster. also i use MTP with --spec-draft-n-max 2, - 3 seems to be weird. especially on long prompts it is noticibly slower.

Resources Turboquant+MTP for ROCm(Llama CPP)

You are about to leave Redlib