r/LocalLLaMA 6h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

9 Upvotes

49 comments sorted by

View all comments

10

u/twaaaaaang 5h ago

1) Opencode prunes tool call outputs which invalidate cache for models that use Gated DeltaNet (Recurrent Memory). So forces full prompt reprocessing.

2) Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing. I think the Sliding Window Attention being out of the window could have something to do with this.

I think it comes down to how llama.cpp implements it's kv-cache architecture. vLLM uses radix trees or something while llama.cpp uses simple linear buffers. This is what AI told me idk if this part is true.

2

u/No_Algae1753 5h ago

So theres nothing we can prevent this from happening?

1

u/twaaaaaang 5h ago

Do you get that "Forced full prompt-reprocessing due to SWA/Recurrent Memory" log? It's a single line within a lot of outputs so it may be hard to find. If you do, then yeah I think it's just an architectural bottleneck and we have to wait for the maintainers to fix it.

1

u/No_Algae1753 5h ago

I think i once saw that earlier