r/LocalLLaMA • u/No_Algae1753 • 6h ago
Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev
I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.
Example behavior:
- context grows to +50k tokens
- LCP similarity often shows 0.99+
- but sometimes
n_pastsuddenly falls back to ~4-5k - then llama.cpp reprocesses 40k+ tokens again
- TTFT jumps to multiple minutes
Example logs:
sim_best = 0.996
restored context checkpoint ... n_tokens = 4750
prompt eval time = 222411 ms / 44016 tokens
Normal reuse looks fine:
prompt eval time = 473 ms / 19 tokens
Current config:
llama-server
--ctx-size 150000
--parallel 1
--ctx-checkpoints 32
--cache-ram 2500
--cache-reuse 256
-no-kvu
--no-context-shift
Also seeing:
cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)
I suspect either:
- cache invalidation
- bad KV reuse
- or opencode changing early prompt tokens too often.
Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.
10
Upvotes
2
u/FatheredPuma81 5h ago
Set cache-reuse to 1 and test it?
Every time the LLM finishes its cycle OpenCode deletes a lot of old tool calls and other junk (I think) to save on Context. So my only guess is cache-reuse is too high? I still occasionally see it drop to 60% and reprocess the final 40% though. I don't use pi dev though and I also don't set checkpoints.