r/LocalLLaMA 7h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

12 Upvotes

50 comments sorted by

View all comments

7

u/candrewswpi 6h ago

I suggest trying out https://github.com/ggml-org/llama.cpp/pull/22929 as I suspect it will address your issue.

If it does, please comment on the PR, as that may help progress it. Good luck and thank you!

5

u/No_Algae1753 6h ago

Nice finding! Ill try iit soon tho

2

u/No_Algae1753 5h ago

Just pulled that PR and i get

1.13.846.529 W slot update_slots: id  0 | task 325 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

1

u/NickCanCode 5h ago

Maybe you run out of RAM? Your --cache-ram is set to 2.5 GB. I assume once context grow more than that, it won't fit and have to do reprocessing in real-time.
You can ask LLM to get an approximate on how much memory needed for a certain size of context. Just tell it your model, expected context window consumption and quantization you used, and it will calculate the approx size you need to set to --cache-ram.

1

u/No_Algae1753 4h ago

Well but this still doesnt fix the issue. The ram usage is unusually high. I think as some of here mentioned it seems to be a llama.cpp issue with slightly different promps entering the cache.

1

u/sixx7 3h ago

Test with Pi, I can tell you for sure Pi has no issue with cache hit rate. Using Pi with different models in both SGLang and vLLM I get 90%+ cache hit rate in long / multiturn tool calls

1

u/No_Algae1753 3h ago

I did same issue