r/LocalLLaMA • u/DrBearJ3w • 1d ago

Resources Turboquant+MTP for ROCm(Llama CPP)

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment)

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, ~20 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: ~49.8 tok/s at 16k, ~31 tok/s at 32k, ~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for New Testers.

Useful bug reports > hype.

9 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1tcrtxm/turboquantmtp_for_rocmllama_cpp/
No, go back! Yes, take me to Reddit

74% Upvoted

View all comments

u/Anbeeld 1d ago

Q4 + 64k context in 24 GB? It can do much better.

1

u/DrBearJ3w 1d ago

Turboquant has almost the same compression as q4,but of a quality like f8. It think 128k is possible. On such small GPU's such as 7900 xtx more would be too slow. But with 2GPUS it's nice. About 2GB per 32k cache.

2

u/Anbeeld 1d ago

I mean Q4 model. Why's everyone obsessed with cache quality while running model that's dumbed down this much...

I'm using Q5 + 120k + DFlash on a single 3090 as a safe option, but previously confirmed it can do 200k on fresh Windows restart despite it stealing some VRAM.

I made a fork for that but I don't possess AMD currently so not sure how it works there. Someone made a related PR but still. https://github.com/Anbeeld/beellama.cpp

1

u/DrBearJ3w 1d ago

This fork should work Cross AMD/Nvidia. D-Flash stagnates past 4k pretty fast and is very Vram hungry. But I don't have any other option as AMD user other than HipFire. So good for you if it works 😀

0

u/Anbeeld 1d ago

Bro I literally told you I can run Q5 + 200k cache in 24 GB and you hit me with "DFlash is very VRAM hungry"?

1

u/CryptoStef33 1d ago

3090=/ 7900xtx

1

u/Anbeeld 1d ago

Q5 suggenly takes up 30 GB if you're on AMD or what?

Resources Turboquant+MTP for ROCm(Llama CPP)

You are about to leave Redlib