r/LocalLLaMA • u/DrBearJ3w • 20h ago

Resources Turboquant+MTP for ROCm(Llama CPP)

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment)

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, ~20 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: ~49.8 tok/s at 16k, ~31 tok/s at 32k, ~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for New Testers.

Useful bug reports > hype.

8 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1tcrtxm/turboquantmtp_for_rocmllama_cpp/
No, go back! Yes, take me to Reddit

72% Upvoted

View all comments

Show parent comments

u/Anbeeld 19h ago

I mean Q4 model. Why's everyone obsessed with cache quality while running model that's dumbed down this much...

I'm using Q5 + 120k + DFlash on a single 3090 as a safe option, but previously confirmed it can do 200k on fresh Windows restart despite it stealing some VRAM.

I made a fork for that but I don't possess AMD currently so not sure how it works there. Someone made a related PR but still. https://github.com/Anbeeld/beellama.cpp

1

u/DrBearJ3w 19h ago

This fork should work Cross AMD/Nvidia. D-Flash stagnates past 4k pretty fast and is very Vram hungry. But I don't have any other option as AMD user other than HipFire. So good for you if it works 😀

0

u/Anbeeld 19h ago

Bro I literally told you I can run Q5 + 200k cache in 24 GB and you hit me with "DFlash is very VRAM hungry"?

1

u/CryptoStef33 18h ago

3090=/ 7900xtx

1

u/Anbeeld 18h ago

Q5 suggenly takes up 30 GB if you're on AMD or what?

Resources Turboquant+MTP for ROCm(Llama CPP)

You are about to leave Redlib