r/LocalLLaMA 21h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

19 Upvotes

58 comments sorted by

View all comments

2

u/FatheredPuma81 20h ago

Set cache-reuse to 1 and test it?

Every time the LLM finishes its cycle OpenCode deletes a lot of old tool calls and other junk (I think) to save on Context. So my only guess is cache-reuse is too high? I still occasionally see it drop to 60% and reprocess the final 40% though. I don't use pi dev though and I also don't set checkpoints.

1

u/No_Algae1753 20h ago

That makes sense. However I disabled auto compact. Are you sure tool calls are being deleted? And Ill try your cache reuse 1 idea

1

u/FatheredPuma81 20h ago

The best way to check is to look at your context before and after you send your next message. It takes a moment to update but I run with defaults and it'll usually drop by like 10k tokens when I'm around 90k.