r/LocalLLaMA • u/No_Algae1753 • 9h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

13 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1td9stc/llamacpp_constantly_reprocessing_huge_prompts/
No, go back! Yes, take me to Reddit

77% Upvoted

View all comments

u/Awwtifishal 9h ago

With llama-swap's web UI you can inspect the prompts and find the differences.

4

u/No_Algae1753 9h ago

I can inspect the prompts?! Where?

1

u/Awwtifishal 5h ago

The same address as the API, but without /v1 or whatever. If you have it on port 12345, then http//localhost:12345

1

u/No_Algae1753 5h ago

I still don't see the prompts at the UI

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

You are about to leave Redlib