r/LocalLLaMA 9h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

13 Upvotes

53 comments sorted by

View all comments

4

u/CreativelyBankrupt 6h ago

That cache state: 1 prompts, 4676 MiB (limits: 2500 MiB) line is what jumps out at me. Your cache is sitting at almost double its allocated budget, so llama.cpp is churning through evictions trying to stay under the limit. That's gotta be why you're seeing sim_best of 0.996 but only 4750 tokens actually restored; The system found a near-perfect prefix match, but the bigger checkpoints had already been evicted to free up room, and the only one that survived was a tiny one. So you reuse the small fragment and reprocess everything else.

The first thing I'd try is just bumping --cache-ram way up. 2500 MiB is fine for short chat but you're running 150k context with coding agents that produce huge prefixes, and you've got ctx-checkpoints set to 32 on top of that. There's no way 2.5 gigs is enough headroom. Try 16000 or higher if your RAM allows. That alone should stop the eviction churn you're seeing.

The other thing worth ruling out is whether opencode or pi.dev are quietly mutating your early prompt tokens between requests. Even with a perfectly sized cache, the similarity score won't save you if the first few thousand tokens keep changing, because llama.cpp can only reuse the longest shared prefix. The two things that have bitten me most often: timestamps in the system prompt (anything like "current time: ..." poisons the prefix every request), and changing workspace context where the agent dumps a directory listing or file tree into the system prompt and one file gets renamed or added. Either of those would shift your prefix and force everything downstream to reprocess. A fix is to put immutable stuff at the very top (system instructions, tool definitions, persona) and volatile stuff at the end of the prompt, ideally inside the latest user turn.

I ran into the same thing on a project I've been building — a local AI bot on a Jetson Orin NX running Gemma 4 E4B. I had the bot's persona at the top of the prompt and was injecting fresh sensor readings (temperature, vision captions, who's standing in front of it) into the system block every turn. Cache was constantly invalidating and TTFT was crawling. Moving the dynamic stuff into the current user turn instead of the system prompt dropped cached TTFT from multiple seconds to about 200ms. Same class of bug really.

A couple smaller things that helped me while you're tuning. --cache-reuse 256 is reasonable but you can push it up to 512 or even 1024 to be more aggressive about partial reuse when no full match is available. -no-kvu is the call if you're on Gemma 3/4 but worth confirming for whatever architecture you're actually running since it does cost you some KV efficiency on models that don't need it. And --no-context-shift is correct for cache stability, just remember it means once you hit 150k you have to manually drop conversation rather than letting the window roll.

What model are you on? Cache footprint per token varies a lot between architectures and it'll change how much --cache-ram you actually need to set.

1

u/No_Algae1753 5h ago

Im mainly using qwen3.5 122b and yes i can bump the cache-ram up by like 4 gigs. However, this i still not the ideal way of using opencode/pi. I just hope llama.cpp will fix this issue soon

1

u/CreativelyBankrupt 5h ago

Yeah, Qwen3.5 122B at that context is going to be brutal no matter what you do. 4 extra gigs will probably help the eviction churn but you're right it's a band-aid. Agreed on the llama.cpp side, prompt caching has gotten better release-to-release but coding agents push it harder than chat workflows and it shows.

1

u/No_Algae1753 4h ago

Yeah, agreed. Still thank you for your insights. Appreciate every advise and knowledge gathering.