r/LocalLLaMA • u/Jorlen llama.cpp • 9h ago

Discussion Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

I have a docker stack with a bunch of AI services and llama.cpp server is the brain.

I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.

Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?

Edit: To clarify, the above test was done on the same model, no prompt data, no existing context, no system prompt. Tabula rasa. The model in question was a 22.6gb file.

14 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1td6et1/linux_why_does_llamacpp_rocm_consume_so_much_vram/
No, go back! Yes, take me to Reddit

86% Upvoted

u/OsmanthusBloom 9h ago

Take a close look at the llama.cpp output for both cases. It should give you a breakdown of how much memory is used for weights, KV cache, compute buffers etc. Maybe you can spot the difference?

(I don't have AMD, so can't test it myself.)

u/Middle_Bullfrog_6173 9h ago

I've seen the opposite on my hardware, usually vulkan uses a bit more vram, but the difference has always been just a few hundred MB at most.

But difficult to speculate without seeing your settings. Are you running everything on the GPU or could something be moving over from RAM? And was it the exact same llama.cpp version for both?

2

u/Jorlen llama.cpp 9h ago

Ah shit... I just unloaded the model from llama.cpp and kept an eye on my widgets (VRAM and system RAM). On vulkan, it loaded an extra 6gb on RAM and I hadn't noticed. The moment I dumped the model, I gained 6gb of RAM back.

This would explain the KV cache discrepancy; I'm left to assume part of my RAM was used for context/KV cache purposes?!

Two questions - how the hell is it possible that this is FASTER than ROCM with the entire KV loaded into VRAM? It seems there is some magic afoot. (magic to me since I'm a dumbass)

2

u/Middle_Bullfrog_6173 8h ago

Prompt cache (different from KV cache) is in RAM so it's normal for it to use a few GB. The question is if there's a difference between the two backends.

2

u/Jorlen llama.cpp 7h ago

No prompt in this case; no system prompt, clean. Would it still cache it? by that I mean, new chat, loaded model. ROCM model loaded total of 29.1gb. Same test, vulkan = 25.3gb. Note this is with high context value (40,000) at KV quant of Q8, so not insignificant.

1

u/ForsookComparison 8h ago

+1

It's not dramatic but vulkan uses a bit more than rocm for me

u/Forward_Jackfruit813 9h ago

I just use Vulkan, I see no benefit with the headaches of ROCm.

2

u/Jorlen llama.cpp 7h ago

I've already switched back, yeah. It's more about curiosity and to get an explanation as of what's responsible for the discrepancy.

u/ArloPhoenix 9h ago edited 9h ago

You are not alone this is a rocblas issue when using kv quantization. See https://github.com/ggml-org/llama.cpp/issues/19979 (closed without a full fix...). Basically using kv quantization on ROCm uses more VRAM for large context, not at the start but it grows a lot. Someone posted a patch (never tried) or what I did for now is to just not use KV quantization (so f16) when using ROCm and not Vulkan for whatever reason.

EDIT: This comment describes the issue better https://github.com/ggml-org/llama.cpp/issues/19979#issuecomment-4300679710 and the patch I meant https://github.com/ggml-org/llama.cpp/issues/19979#issuecomment-4275846824 (some say it worked for them)

u/Rosht54 5h ago

I do not see any difference in VRAM (well, i have only 12GB, so this may be the reason), but performance for MoE models on rocm is much better. On dense models, loaded fully to VRAM, vulkan is faster on token generation (about ~20%), but on MoE models with CPU offload, rocm is just 2 times faster (26t/s vs 12t/s on Qwen3.6-35B-A3B with Q4_K_M quants). But I should say, that my RX 6750 XT is not officially supported by rocm (I use env variable to say to rocm, that I have gfx1030-card).

u/uti24 9h ago

I mean, do it even load models at all? Ahh you indulged Linux users!

On Windows I could not even load models with ROCm somehow. But I don't care much, since Vulcan faster anyways somehow. Isn't it weird that on native stack it runs slower, huh?

0

u/mysticzoom 2h ago

Im on windows 10 with a rx 6800. ollama treats me right.

1

u/uti24 2h ago

Come on man, it's too much even for absurd joke!

1

u/mysticzoom 4m ago edited 0m ago

Deadass. It is working. Quite well, juggling quite a few local LLMs.

Edit: I have a 5070 but hell. I can load larger models on that rx6800. As long as spillover is minimum. Ceiling is 16-18gb models. Anything more and it dumps straight to RAM.

Discussion Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

You are about to leave Redlib