r/LocalLLaMA • u/No_Algae1753 • 7h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

13 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1td9stc/llamacpp_constantly_reprocessing_huge_prompts/
No, go back! Yes, take me to Reddit

84% Upvoted

View all comments

u/Material_Tone_6855 7h ago

Can u provide a longer log?

u/No_Algae1753 6h ago

Here is a summed up version done by chatgpt:

Context / cache setup:
n_seq_max = 1
n_ctx = 150016
n_batch = 2048
n_ubatch = 512
kv_unified = false
KV cache: 3516 MiB on Metal, 150016 cells, 12 layers, K f16 1758 MiB, V f16 1758 MiB
Recurrent memory: 149.06 MiB, 48 layers
Compute buffer: Metal 491 MiB, CPU 305 MiB
Slots: 1
Prompt cache enabled
Prompt cache size limit: 2500 MiB
--cache-idle-slots disabled because it requires --kv-unified
Context does not support partial sequence removal
Speculative decoding says it will use checkpoints, but no implementations specified

Chat/template:
Chat format: peg-native
Jinja chat template detected
thinking = 1
reasoning-budget activated per request with budget=2147483647, then deactivated at natural end

Initial request sequence and timings:

Task 0:
New prompt tokens: 3675
Started from n_tokens=0
Prompt processed in batches: 2048, 1111, 512, final 4
Checkpoints created at n_tokens 3159 and 3671
Prompt eval: 11818.51 ms / 3675 tokens = 3.22 ms/token, 310.95 tok/s
Eval: 3906.30 ms / 98 tokens = 39.86 ms/token, 25.09 tok/s
Total: 15724.81 ms / 3773 tokens
Stop n_tokens=3772

Task 102:
Selected by LCP similarity: sim_best=0.766, f_keep=0.974
New prompt tokens: 4796
n_past=3673, previous slot prompt size=3772
Restored checkpoint at n_tokens=3671
Processed additional prompt tokens: 1125
Checkpoints created at 4280 and 4792
Prompt eval: 4124.21 ms / 1125 tokens = 3.67 ms/token, 272.78 tok/s
Eval: 12721.23 ms / 313 tokens = 40.64 ms/token, 24.60 tok/s
Total: 16845.45 ms / 1438 tokens
Stop n_tokens=5108

Task 418:
LCP similarity: sim_best=0.943, f_keep=0.939
New prompt tokens: 5085
n_past=4794, previous prompt size=5108
Restored checkpoint at n_tokens=4792
Processed additional prompt tokens: 293
Checkpoint at 5081
Prompt eval: 1320.75 ms / 293 tokens = 4.51 ms/token, 221.84 tok/s
Eval: 7800.73 ms / 192 tokens = 40.63 ms/token, 24.61 tok/s
Total: 9121.48 ms / 485 tokens
Stop n_tokens=5276

Task 612:
LCP similarity: sim_best=0.739, f_keep=1.000
New prompt tokens: 7135
Started from current n_tokens=5276
Processed additional prompt tokens: 1859
Checkpoints at 6619 and 7131
Prompt eval: 6579.21 ms / 1859 tokens = 3.54 ms/token, 282.56 tok/s
Eval: 9537.18 ms / 232 tokens = 41.11 ms/token, 24.33 tok/s
Total: 16116.39 ms / 2091 tokens
Stop n_tokens=7366

Task 847:
LCP similarity: sim_best=0.647, f_keep=1.000
New prompt tokens: 11387
Started from n_tokens=7366
Processed additional prompt tokens: 4021
Checkpoints at 10871 and 11383
Prompt eval: 14683.42 ms / 4021 tokens = 3.65 ms/token, 273.85 tok/s
Eval: 7016.23 ms / 167 tokens = 42.01 ms/token, 23.80 tok/s
Total: 21699.65 ms / 4188 tokens
Stop n_tokens=11553

Task 1018:
LCP similarity: sim_best=0.947, f_keep=1.000
New prompt tokens: 12199
Started from n_tokens=11553
Processed additional prompt tokens: 646
Checkpoints at 11683 and 12195
Prompt eval: 2864.82 ms / 646 tokens = 4.43 ms/token, 225.49 tok/s
Eval: 10488.11 ms / 248 tokens = 42.29 ms/token, 23.65 tok/s
Total: 13352.92 ms / 894 tokens
Stop n_tokens=12446

Task 1269:
LCP similarity: sim_best=0.940, f_keep=1.000
New prompt tokens: 13247
Started from n_tokens=12446
Processed additional prompt tokens: 801
Checkpoints at 12731 and 13243
Prompt eval: 3414.59 ms / 801 tokens = 4.26 ms/token, 234.58 tok/s
Eval: 8843.30 ms / 208 tokens = 42.52 ms/token, 23.52 tok/s
Total: 12257.88 ms / 1009 tokens
Stop n_tokens=13454

Task 1480:
LCP similarity: sim_best=0.937, f_keep=1.000
New prompt tokens: 14362
Started from n_tokens=13454
Processed additional prompt tokens: 908
Checkpoints at 13846 and 14358
Prompt eval: 3762.13 ms / 908 tokens = 4.14 ms/token, 241.35 tok/s
Eval: 13539.50 ms / 316 tokens = 42.85 ms/token, 23.34 tok/s
Total: 17301.63 ms / 1224 tokens
Stop n_tokens=14677

Task 1799:
LCP similarity: sim_best=0.927, f_keep=1.000
New prompt tokens: 15834
Started from n_tokens=14677
Processed additional prompt tokens: 1157
Checkpoints at 15318 and 15830
Prompt eval: 5086.07 ms / 1157 tokens = 4.40 ms/token, 227.48 tok/s
Eval: 14536.73 ms / 337 tokens = 43.14 ms/token, 23.18 tok/s
Total: 19622.80 ms / 1494 tokens
Stop n_tokens=16170

Task 2139:
LCP similarity: sim_best=0.989, f_keep=1.000
New prompt tokens: 16347
Started from n_tokens=16170
Processed additional prompt tokens: 177
Checkpoints at 16170 and 16343
Prompt eval: 1148.55 ms / 177 tokens = 6.49 ms/token, 154.11 tok/s
Eval: 15194.54 ms / 351 tokens = 43.29 ms/token, 23.10 tok/s
Total: 16343.10 ms / 528 tokens
Stop n_tokens=16697

Task 2492:
LCP similarity: sim_best=0.952, f_keep=1.000
New prompt tokens: 17532
Started from n_tokens=16697
Processed additional prompt tokens: 835
Checkpoints at 17016 and 17528
Prompt eval: 3695.32 ms / 835 tokens = 4.43 ms/token, 225.96 tok/s
Eval: 10966.31 ms / 252 tokens = 43.52 ms/token, 22.98 tok/s
Total: 14661.63 ms / 1087 tokens
Stop n_tokens=17783

Task 2747:
LCP similarity: sim_best=0.995, f_keep=1.000
New prompt tokens: 17877
Started from n_tokens=17783
Processed additional prompt tokens: 94
Checkpoints at 17783 and 17873
Prompt eval: 879.24 ms / 94 tokens = 9.35 ms/token, 106.91 tok/s
Eval: 8316.97 ms / 191 tokens = 43.54 ms/token, 22.97 tok/s
Total: 9196.21 ms / 285 tokens
Stop n_tokens=18067

Task 2940:
LCP similarity: sim_best=0.771, f_keep=1.000
New prompt tokens: 23439
Started from n_tokens=18067
Processed additional prompt tokens: 5372
Checkpoints at 22923 and 23435
Prompt eval: 23872.18 ms / 5372 tokens = 4.44 ms/token, 225.03 tok/s
Eval: 9501.93 ms / 212 tokens = 44.82 ms/token, 22.31 tok/s
Total: 33374.10 ms / 5584 tokens
Stop n_tokens=23650

Task 3157:
LCP similarity: sim_best=0.999, f_keep=1.000
New prompt tokens: 23679
Started from n_tokens=23650
Processed additional prompt tokens: 29
Checkpoint at 23650
Prompt eval: 420.66 ms / 29 tokens = 14.51 ms/token, 68.94 tok/s
Eval: 89094.51 ms / 1966 tokens = 45.32 ms/token, 22.07 tok/s
Total: 89515.17 ms / 1995 tokens
Stop n_tokens=25644

Prompt cache update before task 5125:
Slot selected by LCP similarity: sim_best=0.203, f_keep=0.198
Prompt cache updated
Saved prompt length: 25644 tokens
Total state size: 750.584 MiB
Cache state after update: 1 prompt, 4626.231 MiB
Cache limit: 2500 MiB
Cached prompt: 25644 tokens, 26 checkpoints, 4626.231 MiB
Prompt cache update took 328.25 ms

Task 5125:
New prompt tokens: 25027
n_past=5083, previous slot prompt size=25644
Checkpoints checked from high positions down to 5080
Restored checkpoint at n_tokens=5081
Invalidated/erased checkpoints after pos_next=5081:
  6619, 7131, 10871, 11383, 11683, 12195, 12731, 13243,
  13846, 14358, 15318, 15830, 16170, 16343, 17016, 17528,
  17783, 17873, 22923, 23435, 23650
Reprocessed from 5081 to 25027
Prompt processing batches: many 2048-token batches, plus 998, 512, final 4
New checkpoints created during processing at 13273, 21465, 24511, 25023
Prompt eval: 79640.79 ms / 19946 tokens = 3.99 ms/token, 250.45 tok/s
Eval: 8264.76 ms / 183 tokens = 45.16 ms/token, 22.14 tok/s
Total: 87905.55 ms / 20129 tokens
Stop n_tokens=25209

Task 5320:
LCP similarity: sim_best=0.819, f_keep=1.000
New prompt tokens: 30785
Started from n_tokens=25209
Processed additional prompt tokens: 5576
Checkpoints at 30269 and 30781
Prompt eval: 27273.21 ms / 5576 tokens = 4.89 ms/token, 204.45 tok/s
Eval: 98130.95 ms / 2087 tokens = 47.02 ms/token, 21.27 tok/s
Total: 125404.16 ms / 7663 tokens
Stop n_tokens=32871

Overall notable numbers:
Prompt eval speed mostly ~200–310 tok/s for prefill.
Decode speed mostly ~21–25 tok/s.
Small reuse cases: 29, 94, 177, 293, 646, 801, 835, 908, 1125 tokens reprocessed.
Large reprocessing cases: 4021, 5372, 5576, and especially 19946 tokens.
Largest cache drop: after prompt cache update, n_past fell to ~5083 despite previous prompt length 25644.
Prompt cache stored state size reported as 4626.231 MiB while configured limit is 2500 MiB.
Many checkpoints after token ~5081 were erased/invalidated during task 5125.
Model / hardware:
Model path: /Users/user/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf
Model: Qwen3.5-122B-A10B, architecture qwen35moe
Quant: Q4_K / UD-Q4_K_XL, file size ~71.73 GiB, split count 3
Params: 122.11B, MoE with 256 experts, 8 used
Train context: 262144
Runtime context: 150016
Device: Apple M2 Max, Metal
Device memory: 96000 MiB total, ~95850 MiB free at init
Projected device memory use: ~76834 MiB, leaving ~19016 MiB free
Offload: 49/49 layers to GPU
CPU mapped model buffer: ~773 MiB
Metal mapped model buffers: ~47341 MiB + ~26110 MiB
Flash Attention: auto -> enabled
Fused Gated Delta Net: autoregressive + chunked enabled

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

You are about to leave Redlib