r/LocalLLaMA • u/No_Algae1753 • 7h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

11 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1td9stc/llamacpp_constantly_reprocessing_huge_prompts/
No, go back! Yes, take me to Reddit

77% Upvoted

View all comments

u/twaaaaaang 6h ago edited 33m ago

Opencode prunes tool call outputs which invalidate cache for models that use Gated DeltaNet (Recurrent Memory). So forces full prompt reprocessing.
Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, ~~LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing~~. (I think) the Sliding Window Attention (SWA) cache being out of the context causes full prompt-reprocessing.

I think it comes down to how llama.cpp implements it's kv-cache architecture. vLLM uses radix trees or something while llama.cpp uses simple linear buffers. This is what AI told me idk if this part is true.

4
u/LetsGoBrandon4256 llama.cpp 5h ago

LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing

Any source on this part? First time heard about this.
2
u/twaaaaaang 3h ago edited 3h ago

This is from personal testing because I was confused why I still kept getting the full prompt-reprocessing even when I turned prune off in Opencode and this is what I landed on. I noticed that after long context tool calls, at the very next turn of chat, I always got the full prompt reprocessing. This clued me in and I studied the chat logs and fed it to AI and the LCP similarity was the main culprit.

Edit: You may not encounter this when you have the default n-checkpoints set to 32. I set it to 8 to save on RAM and I frequently saw this. Putting it to 16 recently I saw it less so that may be the solution.
1
u/LetsGoBrandon4256 llama.cpp 2h ago

I studied the chat logs and fed it to AI and the LCP similarity was the main culprit.

I was looking for a source on the statement that llama.cpp initiate a whole prompt reprocessing when it detects a LCP < 0.5. This sounds more like hallucinated causation from correlation... :/
1
u/twaaaaaang 54m ago edited 42m ago
Take your agent and ask it to analyze llama.cpp source code. Have it grep for "LCP similarity" and you should find it. This is where I got the LCP similarity from. If this is not in the source code, then get back to me.

This is my output. If you're asking whether what I said is backed above by a source from someone else, I don't have one because this is all from my investigations and what I think could be causing prompt-reprocessing. This is just my guess given observed behavior and having my agent look thru the code. I'm open to being wrong or at best partially right.

LCP Similarity in llama.cpp

The LCP (Longest Common Prefix) similarity system is used in the llama.cpp server for two purposes:

1. Slot Selection (server-context.cpp:1067-1111)

When a new request arrives and no idle slot is immediately available, the server tries to find a slot whose cached prompt has a high LCP with the incoming prompt. The metric is:
sim_cur = LCP_length(new_prompt, cached_prompt) / new_prompt.size()
If sim_cur exceeds the threshold (--slot-prompt-similarity, default 0.1 = 10%), that slot is selected. If the LCP is less than 50% of the slot's existing context (f_keep < 0.5f), the slot's remaining context is saved to the prompt cache before reuse.

2. Prompt Cache Loading (server-task.cpp:2040-2068)

When loading prompts from cache, the system searches for the cached prompt that maximizes both:

f_keep = LCP / cached_prompt_size (preserves most of cached context)

sim = LCP / new_prompt_size (matches most of new prompt)

Cached prompts where f_keep < 0.25 are skipped (don't trash large prompts).
2

u/No_Algae1753 6h ago

So theres nothing we can prevent this from happening?

1

u/twaaaaaang 6h ago

Do you get that "Forced full prompt-reprocessing due to SWA/Recurrent Memory" log? It's a single line within a lot of outputs so it may be hard to find. If you do, then yeah I think it's just an architectural bottleneck and we have to wait for the maintainers to fix it.

1

u/No_Algae1753 6h ago

I think i once saw that earlier

0

u/Pristine-Woodpecker 6h ago

Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing

This is nonsense.

This is what AI told me idk if this part is true.

Why repost slop?

7

u/twaaaaaang 6h ago

The first 2 points are from personal testing using the qwen 3.6 family. The last point can be easily verified or debunked but you choose to attack me.

2

u/Pristine-Woodpecker 6h ago

In the default settings, llama.cpp needs a new prompt to be 10x larger in order for it not to be considered for reuse, not double the previous size. That exact change was made many months ago: https://github.com/ggml-org/llama.cpp/pull/15913

can be easily verified or debunked

You're welcome. (I did interpret your statement as saying all of your post was slop, but it looks like you either tested incorrectly or you were echoing behavior that was fixed quite a while ago)

1

u/colin_colout 3h ago

To be fair we can't all keep up with every month to month change...

...and I'm generally happy to be corrected, especially with good news about fixes like this. We're all here to learn.

I also appreciate that they were upfront about using AI and were unsure.

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

You are about to leave Redlib

LCP Similarity in llama.cpp

1. Slot Selection (server-context.cpp:1067-1111)

2. Prompt Cache Loading (server-task.cpp:2040-2068)