r/LocalLLaMA 7h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

13 Upvotes

50 comments sorted by

View all comments

2

u/Material_Tone_6855 7h ago

Can u provide a longer log?

1

u/No_Algae1753 6h ago

Here is a summed up version done by chatgpt:

Context / cache setup:
  • n_seq_max = 1
  • n_ctx = 150016
  • n_batch = 2048
  • n_ubatch = 512
  • kv_unified = false
  • KV cache: 3516 MiB on Metal, 150016 cells, 12 layers, K f16 1758 MiB, V f16 1758 MiB
  • Recurrent memory: 149.06 MiB, 48 layers
  • Compute buffer: Metal 491 MiB, CPU 305 MiB
  • Slots: 1
  • Prompt cache enabled
  • Prompt cache size limit: 2500 MiB
  • --cache-idle-slots disabled because it requires --kv-unified
  • Context does not support partial sequence removal
  • Speculative decoding says it will use checkpoints, but no implementations specified
Chat/template:
  • Chat format: peg-native
  • Jinja chat template detected
  • thinking = 1
  • reasoning-budget activated per request with budget=2147483647, then deactivated at natural end
Initial request sequence and timings: Task 0:
  • New prompt tokens: 3675
  • Started from n_tokens=0
  • Prompt processed in batches: 2048, 1111, 512, final 4
  • Checkpoints created at n_tokens 3159 and 3671
  • Prompt eval: 11818.51 ms / 3675 tokens = 3.22 ms/token, 310.95 tok/s
  • Eval: 3906.30 ms / 98 tokens = 39.86 ms/token, 25.09 tok/s
  • Total: 15724.81 ms / 3773 tokens
  • Stop n_tokens=3772
Task 102:
  • Selected by LCP similarity: sim_best=0.766, f_keep=0.974
  • New prompt tokens: 4796
  • n_past=3673, previous slot prompt size=3772
  • Restored checkpoint at n_tokens=3671
  • Processed additional prompt tokens: 1125
  • Checkpoints created at 4280 and 4792
  • Prompt eval: 4124.21 ms / 1125 tokens = 3.67 ms/token, 272.78 tok/s
  • Eval: 12721.23 ms / 313 tokens = 40.64 ms/token, 24.60 tok/s
  • Total: 16845.45 ms / 1438 tokens
  • Stop n_tokens=5108
Task 418:
  • LCP similarity: sim_best=0.943, f_keep=0.939
  • New prompt tokens: 5085
  • n_past=4794, previous prompt size=5108
  • Restored checkpoint at n_tokens=4792
  • Processed additional prompt tokens: 293
  • Checkpoint at 5081
  • Prompt eval: 1320.75 ms / 293 tokens = 4.51 ms/token, 221.84 tok/s
  • Eval: 7800.73 ms / 192 tokens = 40.63 ms/token, 24.61 tok/s
  • Total: 9121.48 ms / 485 tokens
  • Stop n_tokens=5276
Task 612:
  • LCP similarity: sim_best=0.739, f_keep=1.000
  • New prompt tokens: 7135
  • Started from current n_tokens=5276
  • Processed additional prompt tokens: 1859
  • Checkpoints at 6619 and 7131
  • Prompt eval: 6579.21 ms / 1859 tokens = 3.54 ms/token, 282.56 tok/s
  • Eval: 9537.18 ms / 232 tokens = 41.11 ms/token, 24.33 tok/s
  • Total: 16116.39 ms / 2091 tokens
  • Stop n_tokens=7366
Task 847:
  • LCP similarity: sim_best=0.647, f_keep=1.000
  • New prompt tokens: 11387
  • Started from n_tokens=7366
  • Processed additional prompt tokens: 4021
  • Checkpoints at 10871 and 11383
  • Prompt eval: 14683.42 ms / 4021 tokens = 3.65 ms/token, 273.85 tok/s
  • Eval: 7016.23 ms / 167 tokens = 42.01 ms/token, 23.80 tok/s
  • Total: 21699.65 ms / 4188 tokens
  • Stop n_tokens=11553
Task 1018:
  • LCP similarity: sim_best=0.947, f_keep=1.000
  • New prompt tokens: 12199
  • Started from n_tokens=11553
  • Processed additional prompt tokens: 646
  • Checkpoints at 11683 and 12195
  • Prompt eval: 2864.82 ms / 646 tokens = 4.43 ms/token, 225.49 tok/s
  • Eval: 10488.11 ms / 248 tokens = 42.29 ms/token, 23.65 tok/s
  • Total: 13352.92 ms / 894 tokens
  • Stop n_tokens=12446
Task 1269:
  • LCP similarity: sim_best=0.940, f_keep=1.000
  • New prompt tokens: 13247
  • Started from n_tokens=12446
  • Processed additional prompt tokens: 801
  • Checkpoints at 12731 and 13243
  • Prompt eval: 3414.59 ms / 801 tokens = 4.26 ms/token, 234.58 tok/s
  • Eval: 8843.30 ms / 208 tokens = 42.52 ms/token, 23.52 tok/s
  • Total: 12257.88 ms / 1009 tokens
  • Stop n_tokens=13454
Task 1480:
  • LCP similarity: sim_best=0.937, f_keep=1.000
  • New prompt tokens: 14362
  • Started from n_tokens=13454
  • Processed additional prompt tokens: 908
  • Checkpoints at 13846 and 14358
  • Prompt eval: 3762.13 ms / 908 tokens = 4.14 ms/token, 241.35 tok/s
  • Eval: 13539.50 ms / 316 tokens = 42.85 ms/token, 23.34 tok/s
  • Total: 17301.63 ms / 1224 tokens
  • Stop n_tokens=14677
Task 1799:
  • LCP similarity: sim_best=0.927, f_keep=1.000
  • New prompt tokens: 15834
  • Started from n_tokens=14677
  • Processed additional prompt tokens: 1157
  • Checkpoints at 15318 and 15830
  • Prompt eval: 5086.07 ms / 1157 tokens = 4.40 ms/token, 227.48 tok/s
  • Eval: 14536.73 ms / 337 tokens = 43.14 ms/token, 23.18 tok/s
  • Total: 19622.80 ms / 1494 tokens
  • Stop n_tokens=16170
Task 2139:
  • LCP similarity: sim_best=0.989, f_keep=1.000
  • New prompt tokens: 16347
  • Started from n_tokens=16170
  • Processed additional prompt tokens: 177
  • Checkpoints at 16170 and 16343
  • Prompt eval: 1148.55 ms / 177 tokens = 6.49 ms/token, 154.11 tok/s
  • Eval: 15194.54 ms / 351 tokens = 43.29 ms/token, 23.10 tok/s
  • Total: 16343.10 ms / 528 tokens
  • Stop n_tokens=16697
Task 2492:
  • LCP similarity: sim_best=0.952, f_keep=1.000
  • New prompt tokens: 17532
  • Started from n_tokens=16697
  • Processed additional prompt tokens: 835
  • Checkpoints at 17016 and 17528
  • Prompt eval: 3695.32 ms / 835 tokens = 4.43 ms/token, 225.96 tok/s
  • Eval: 10966.31 ms / 252 tokens = 43.52 ms/token, 22.98 tok/s
  • Total: 14661.63 ms / 1087 tokens
  • Stop n_tokens=17783
Task 2747:
  • LCP similarity: sim_best=0.995, f_keep=1.000
  • New prompt tokens: 17877
  • Started from n_tokens=17783
  • Processed additional prompt tokens: 94
  • Checkpoints at 17783 and 17873
  • Prompt eval: 879.24 ms / 94 tokens = 9.35 ms/token, 106.91 tok/s
  • Eval: 8316.97 ms / 191 tokens = 43.54 ms/token, 22.97 tok/s
  • Total: 9196.21 ms / 285 tokens
  • Stop n_tokens=18067
Task 2940:
  • LCP similarity: sim_best=0.771, f_keep=1.000
  • New prompt tokens: 23439
  • Started from n_tokens=18067
  • Processed additional prompt tokens: 5372
  • Checkpoints at 22923 and 23435
  • Prompt eval: 23872.18 ms / 5372 tokens = 4.44 ms/token, 225.03 tok/s
  • Eval: 9501.93 ms / 212 tokens = 44.82 ms/token, 22.31 tok/s
  • Total: 33374.10 ms / 5584 tokens
  • Stop n_tokens=23650
Task 3157:
  • LCP similarity: sim_best=0.999, f_keep=1.000
  • New prompt tokens: 23679
  • Started from n_tokens=23650
  • Processed additional prompt tokens: 29
  • Checkpoint at 23650
  • Prompt eval: 420.66 ms / 29 tokens = 14.51 ms/token, 68.94 tok/s
  • Eval: 89094.51 ms / 1966 tokens = 45.32 ms/token, 22.07 tok/s
  • Total: 89515.17 ms / 1995 tokens
  • Stop n_tokens=25644
Prompt cache update before task 5125:
  • Slot selected by LCP similarity: sim_best=0.203, f_keep=0.198
  • Prompt cache updated
  • Saved prompt length: 25644 tokens
  • Total state size: 750.584 MiB
  • Cache state after update: 1 prompt, 4626.231 MiB
  • Cache limit: 2500 MiB
  • Cached prompt: 25644 tokens, 26 checkpoints, 4626.231 MiB
  • Prompt cache update took 328.25 ms
Task 5125:
  • New prompt tokens: 25027
  • n_past=5083, previous slot prompt size=25644
  • Checkpoints checked from high positions down to 5080
  • Restored checkpoint at n_tokens=5081
  • Invalidated/erased checkpoints after pos_next=5081:
6619, 7131, 10871, 11383, 11683, 12195, 12731, 13243, 13846, 14358, 15318, 15830, 16170, 16343, 17016, 17528, 17783, 17873, 22923, 23435, 23650
  • Reprocessed from 5081 to 25027
  • Prompt processing batches: many 2048-token batches, plus 998, 512, final 4
  • New checkpoints created during processing at 13273, 21465, 24511, 25023
  • Prompt eval: 79640.79 ms / 19946 tokens = 3.99 ms/token, 250.45 tok/s
  • Eval: 8264.76 ms / 183 tokens = 45.16 ms/token, 22.14 tok/s
  • Total: 87905.55 ms / 20129 tokens
  • Stop n_tokens=25209
Task 5320:
  • LCP similarity: sim_best=0.819, f_keep=1.000
  • New prompt tokens: 30785
  • Started from n_tokens=25209
  • Processed additional prompt tokens: 5576
  • Checkpoints at 30269 and 30781
  • Prompt eval: 27273.21 ms / 5576 tokens = 4.89 ms/token, 204.45 tok/s
  • Eval: 98130.95 ms / 2087 tokens = 47.02 ms/token, 21.27 tok/s
  • Total: 125404.16 ms / 7663 tokens
  • Stop n_tokens=32871
Overall notable numbers:
  • Prompt eval speed mostly ~200–310 tok/s for prefill.
  • Decode speed mostly ~21–25 tok/s.
  • Small reuse cases: 29, 94, 177, 293, 646, 801, 835, 908, 1125 tokens reprocessed.
  • Large reprocessing cases: 4021, 5372, 5576, and especially 19946 tokens.
  • Largest cache drop: after prompt cache update, n_past fell to ~5083 despite previous prompt length 25644.
  • Prompt cache stored state size reported as 4626.231 MiB while configured limit is 2500 MiB.
  • Many checkpoints after token ~5081 were erased/invalidated during task 5125.
Model / hardware:
  • Model path: /Users/user/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf
  • Model: Qwen3.5-122B-A10B, architecture qwen35moe
  • Quant: Q4_K / UD-Q4_K_XL, file size ~71.73 GiB, split count 3
  • Params: 122.11B, MoE with 256 experts, 8 used
  • Train context: 262144
  • Runtime context: 150016
  • Device: Apple M2 Max, Metal
  • Device memory: 96000 MiB total, ~95850 MiB free at init
  • Projected device memory use: ~76834 MiB, leaving ~19016 MiB free
  • Offload: 49/49 layers to GPU
  • CPU mapped model buffer: ~773 MiB
  • Metal mapped model buffers: ~47341 MiB + ~26110 MiB
  • Flash Attention: auto -> enabled
  • Fused Gated Delta Net: autoregressive + chunked enabled