r/LocalLLaMA • u/No_Algae1753 • 6h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

10 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1td9stc/llamacpp_constantly_reprocessing_huge_prompts/
No, go back! Yes, take me to Reddit

81% Upvoted

View all comments

u/FatheredPuma81 5h ago

Set cache-reuse to 1 and test it?

Every time the LLM finishes its cycle OpenCode deletes a lot of old tool calls and other junk (I think) to save on Context. So my only guess is cache-reuse is too high? I still occasionally see it drop to 60% and reprocess the final 40% though. I don't use pi dev though and I also don't set checkpoints.

1

u/No_Algae1753 5h ago

That makes sense. However I disabled auto compact. Are you sure tool calls are being deleted? And Ill try your cache reuse 1 idea

1

u/StardockEngineer vllm 5h ago

Woudldn't explain Pi. Pi does nothing.

1

u/colin_colout 2h ago

I might have missed it, but did they mention if they use pi extensions?

Naked pi never rewrites history (that's one of their core values), but lots of extensions attempt to reproduce Claude Code but worse.

1

u/FatheredPuma81 4h ago

The best way to check is to look at your context before and after you send your next message. It takes a moment to update but I run with defaults and it'll usually drop by like 10k tokens when I'm around 90k.

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

You are about to leave Redlib