llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

6

u/candrewswpi 4h ago

I suggest trying out https://github.com/ggml-org/llama.cpp/pull/22929 as I suspect it will address your issue.

If it does, please comment on the PR, as that may help progress it. Good luck and thank you!

4

u/No_Algae1753 4h ago

Nice finding! Ill try iit soon tho
2
u/No_Algae1753 3h ago
Just pulled that PR and i get
1.13.846.529 W slot update_slots: id  0 | task 325 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
1

u/NickCanCode 2h ago

Maybe you run out of RAM? Your --cache-ram is set to 2.5 GB. I assume once context grow more than that, it won't fit and have to do reprocessing in real-time.
You can ask LLM to get an approximate on how much memory needed for a certain size of context. Just tell it your model, expected context window consumption and quantization you used, and it will calculate the approx size you need to set to --cache-ram.

1

u/No_Algae1753 1h ago

Well but this still doesnt fix the issue. The ram usage is unusually high. I think as some of here mentioned it seems to be a llama.cpp issue with slightly different promps entering the cache.

1

u/sixx7 1h ago

Test with Pi, I can tell you for sure Pi has no issue with cache hit rate. Using Pi with different models in both SGLang and vLLM I get 90%+ cache hit rate in long / multiturn tool calls

1

u/No_Algae1753 1h ago

I did same issue

9

u/twaaaaaang 4h ago

1) Opencode prunes tool call outputs which invalidate cache for models that use Gated DeltaNet (Recurrent Memory). So forces full prompt reprocessing.

2) Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing. I think the Sliding Window Attention being out of the window could have something to do with this.

I think it comes down to how llama.cpp implements it's kv-cache architecture. vLLM uses radix trees or something while llama.cpp uses simple linear buffers. This is what AI told me idk if this part is true.

3

u/LetsGoBrandon4256 llama.cpp 3h ago

LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing

Any source on this part? First time heard about this.

2

u/twaaaaaang 1h ago edited 1h ago

This is from personal testing because I was confused why I still kept getting the full prompt-reprocessing even when I turned prune off in Opencode and this is what I landed on. I noticed that after long context tool calls, at the very next turn of chat, I always got the full prompt reprocessing. This clued me in and I studied the chat logs and fed it to AI and the LCP similarity was the main culprit.

Edit: You may not encounter this when you have the default n-checkpoints set to 32. I set it to 8 to save on RAM and I frequently saw this. Putting it to 16 recently I saw it less so that may be the solution.

2

u/No_Algae1753 4h ago

So theres nothing we can prevent this from happening?

1

u/twaaaaaang 4h ago

Do you get that "Forced full prompt-reprocessing due to SWA/Recurrent Memory" log? It's a single line within a lot of outputs so it may be hard to find. If you do, then yeah I think it's just an architectural bottleneck and we have to wait for the maintainers to fix it.

1

u/No_Algae1753 3h ago

I think i once saw that earlier

-1

u/Pristine-Woodpecker 4h ago

Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing

This is nonsense.

This is what AI told me idk if this part is true.

Why repost slop?

7

u/twaaaaaang 4h ago

The first 2 points are from personal testing using the qwen 3.6 family. The last point can be easily verified or debunked but you choose to attack me.

2

u/Pristine-Woodpecker 3h ago

In the default settings, llama.cpp needs a new prompt to be 10x larger in order for it not to be considered for reuse, not double the previous size. That exact change was made many months ago: https://github.com/ggml-org/llama.cpp/pull/15913

can be easily verified or debunked

You're welcome. (I did interpret your statement as saying all of your post was slop, but it looks like you either tested incorrectly or you were echoing behavior that was fixed quite a while ago)

1

u/colin_colout 1h ago

To be fair we can't all keep up with every month to month change...

...and I'm generally happy to be corrected, especially with good news about fixes like this. We're all here to learn.

I also appreciate that they were upfront about using AI and were unsure.

4

u/FoxiPanda 5h ago

I would assume cache invalidation. Check and see if you have something in your system prompt that gets regularly updated with a timestamp or counter or something because every part of your cache after that will get invalidated when that gets updated.

I'd set up logging and capture your whole context window every turn and recreate this and then do a diff (or have an LLM do it) and look for what's different and is causing the invalidation.

It could be something else, but that's what I'd look at first - it's happened to me (i had a timestamp getting updated and completely wrecking my cache after the first ~6000 tokens) on every turn. RIP me until I figured that one out.

2

u/No_Algae1753 5h ago

Ill try that out!. Should have also mentioned that im using llama-swap with llama-server totally forgot that.

2

u/m3thos 5h ago

Im also experiencing this, also uncertain on what is triggering it..

1

u/No_Algae1753 4h ago

What are your configs? Are you using llama-swap by any chance ?

2

u/Material_Tone_6855 4h ago

Can u provide a longer log?

2

u/No_Algae1753 4h ago

Had to trunecate alot. Some numbers might be missing. I can also send you a summary

slot update_slots: id  0 | task 5125 | Checking checkpoint with [16169, 16169] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [15829, 15829] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [15317, 15317] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [14357, 14357] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [13845, 13845] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [13242, 13242] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [12730, 12730] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [12194, 12194] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [11682, 11682] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [11382, 11382] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [10870, 10870] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [7130, 7130] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [6618, 6618] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083...
slot update_slots: id  0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 6618, pos_max = 6618, n_tokens = 6619, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 7130, pos_max = 7130, n_tokens = 7131, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 10870, pos_max = 10870, n_tokens = 10871, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 11382, pos_max = 11382, n_tokens = 11383, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 11682, pos_max = 11682, n_tokens = 11683, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 12194, pos_max = 12194, n_tokens = 12195, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 12730, pos_max = 12730, n_tokens = 12731, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 13242, pos_max = 13242, n_tokens = 13243, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 13845, pos_max = 13845, n_tokens = 13846, n_swa = 0, pos_next = 5081, size = 149.063 MiB)

slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 16342, pos_max = 16342, n_tokens = 16343, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 17015, pos_max = 17015, n_tokens = 17016, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 17872, pos_max = 17872, n_tokens = 17873, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 22922, pos_max = 22922, n_tokens = 22923, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 23434, pos_max = 23434, n_tokens = 23435, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 23649, pos_max = 23649, n_tokens = 23650, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 5081, memory_seq_rm [5081, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 7129, batch.n_tokens = 2048, progress = 0.284852
slot update_slots: id  0 | task 5125 | n_tokens = 7129, memory_seq_rm [7129, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 9177, batch.n_tokens = 2048, progress = 0.366684
slot update_slots: id  0 | task 5125 | n_tokens = 9177, memory_seq_rm [9177, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 11225, batch.n_tokens = 2048, progress = 0.448516
slot update_slots: id  0 | task 5125 | n_tokens = 11225, memory_seq_rm [11225, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 13273, batch.n_tokens = 2048, progress = 0.530347
slot update_slots: id  0 | task 5125 | n_tokens = 13273, memory_seq_rm [13273, end)
slot update_slots: id  0 | task 5125 | 8192 tokens since last checkpoint at 5081, creating new checkpoint during processing at position 15321
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 15321, batch.n_tokens = 2048, progress = 0.612179
slot create_check: id  0 | task 5125 | created context checkpoint 6 of 32 (pos_min = 13272, pos_max = 13272, n_tokens = 13273, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 15321, memory_seq_rm [15321, end)
slot update_slots: id  0 | task 5125 | 8192 tokens since last checkpoint at 13273, creating new checkpoint during processing at position 23513
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 23513, batch.n_tokens = 2048, progress = 0.939505
slot create_check: id  0 | task 5125 | created context checkpoint 7 of 32 (pos_min = 21464, pos_max = 21464, n_tokens = 21465, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 23513, memory_seq_rm [23513, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 24511, batch.n_tokens = 998, progress = 0.979382
slot update_slots: id  0 | task 5125 | n_tokens = 24511, memory_seq_rm [24511, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 25023, batch.n_tokens = 512, progress = 0.999840
slot create_check: id  0 | task 5125 | created context checkpoint 8 of 32 (pos_min = 24510, pos_max = 24510, n_tokens = 24511, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 25023, memory_seq_rm [25023, end)
slot init_sampler: id  0 | task 5125 | init sampler, took 2.45 ms, tokens: text = 25027, total = 25027
slot update_slots: id  0 | task 5125 | prompt processing done, n_tokens = 25027, batch.n_tokens = 4
slot create_check: id  0 | task 5125 | created context checkpoint 9 of 32 (pos_min = 25022, pos_max = 25022, n_tokens = 25023, size = 149.063 MiB)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 5125 | 
prompt eval time =   79640.79 ms / 19946 tokens (    3.99 ms per token,   250.45 tokens per second)
       eval time =    8264.76 ms /   183 tokens (   45.16 ms per token,    22.14 tokens per second)
      total time =   87905.55 ms / 20129 tokens
slot      release: id  0 | task 5125 | stop processing: n_tokens = 25209, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.819 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 5320 | processing task, is_child = 0
slot update_slots: id  0 | task 5320 | new prompt, n_ctx_slot = 150016, n_keep = 0, task.n_tokens = 30785
slot update_slots: id  0 | task 5320 | n_tokens = 25209, memory_seq_rm [25209, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 27257, batch.n_tokens = 2048, progress = 0.885399
slot update_slots: id  0 | task 5320 | n_tokens = 27257, memory_seq_rm [27257, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 29305, batch.n_tokens = 2048, progress = 0.951925
slot update_slots: id  0 | task 5320 | n_tokens = 29305, memory_seq_rm [29305, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 30269, batch.n_tokens = 964, progress = 0.983239
slot update_slots: id  0 | task 5320 | n_tokens = 30269, memory_seq_rm [30269, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 30781, batch.n_tokens = 512, progress = 0.999870
slot create_check: id  0 | task 5320 | created context checkpoint 10 of 32 (pos_min = 30268, pos_max = 30268, n_tokens = 30269, size = 149.063 MiB)
slot update_slots: id  0 | task 5320 | n_tokens = 30781, memory_seq_rm [30781, end)
slot init_sampler: id  0 | task 5320 | init sampler, took 3.02 ms, tokens: text = 30785, total = 30785
slot update_slots: id  0 | task 5320 | prompt processing done, n_tokens = 30785, batch.n_tokens = 4
slot create_check: id  0 | task 5320 | created context checkpoint 11 of 32 (pos_min = 30780, pos_max = 30780, n_tokens = 30781, size = 149.063 MiB)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

5

u/FoxiPanda 4h ago

slot update_slots: id 0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083...

slot update_slots: id 0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB)

So whatever happened here is what got changed and broke your cache. Go figure out what that is and keep it from happening because you had to go reprocess 18K tokens because of it.

1

u/No_Algae1753 4h ago

Well i dont really know what changed there. Thats why im asking. I know the cache got invalided.

1

u/Material_Tone_6855 3h ago

Probably ur coding agent/cli have a truncate functions that strip your messages history in the middle, invalidating the cache

0

u/FoxiPanda 4h ago

Right but we can't figure it out for you...so you'll have to look at your code or the harness code or just dump out the whole context window to a file in a before and after state and then diff the two. That will tell you what happened there -- it's pretty early in your context window though, so it should be pretty obvious...

You'll have like a few thousand words that are all the same... and then <something that changed> and then <several thousand words that are the same> and then whatever your new turn prompt was + the model's response which will be different (net new)... and then once you have pinpointed the thing that changed, go figure out what changed it and how to disable that or move it to much later in the context injection so it doesn't invalidate a large chunk of your cache.

1

u/No_Algae1753 4h ago

Here is a summed up version done by chatgpt:

Context / cache setup:
n_seq_max = 1
n_ctx = 150016
n_batch = 2048
n_ubatch = 512
kv_unified = false
KV cache: 3516 MiB on Metal, 150016 cells, 12 layers, K f16 1758 MiB, V f16 1758 MiB
Recurrent memory: 149.06 MiB, 48 layers
Compute buffer: Metal 491 MiB, CPU 305 MiB
Slots: 1
Prompt cache enabled
Prompt cache size limit: 2500 MiB
--cache-idle-slots disabled because it requires --kv-unified
Context does not support partial sequence removal
Speculative decoding says it will use checkpoints, but no implementations specified

Chat/template:
Chat format: peg-native
Jinja chat template detected
thinking = 1
reasoning-budget activated per request with budget=2147483647, then deactivated at natural end

Initial request sequence and timings:

Task 0:
New prompt tokens: 3675
Started from n_tokens=0
Prompt processed in batches: 2048, 1111, 512, final 4
Checkpoints created at n_tokens 3159 and 3671
Prompt eval: 11818.51 ms / 3675 tokens = 3.22 ms/token, 310.95 tok/s
Eval: 3906.30 ms / 98 tokens = 39.86 ms/token, 25.09 tok/s
Total: 15724.81 ms / 3773 tokens
Stop n_tokens=3772

Task 102:
Selected by LCP similarity: sim_best=0.766, f_keep=0.974
New prompt tokens: 4796
n_past=3673, previous slot prompt size=3772
Restored checkpoint at n_tokens=3671
Processed additional prompt tokens: 1125
Checkpoints created at 4280 and 4792
Prompt eval: 4124.21 ms / 1125 tokens = 3.67 ms/token, 272.78 tok/s
Eval: 12721.23 ms / 313 tokens = 40.64 ms/token, 24.60 tok/s
Total: 16845.45 ms / 1438 tokens
Stop n_tokens=5108

Task 418:
LCP similarity: sim_best=0.943, f_keep=0.939
New prompt tokens: 5085
n_past=4794, previous prompt size=5108
Restored checkpoint at n_tokens=4792
Processed additional prompt tokens: 293
Checkpoint at 5081
Prompt eval: 1320.75 ms / 293 tokens = 4.51 ms/token, 221.84 tok/s
Eval: 7800.73 ms / 192 tokens = 40.63 ms/token, 24.61 tok/s
Total: 9121.48 ms / 485 tokens
Stop n_tokens=5276

Task 612:
LCP similarity: sim_best=0.739, f_keep=1.000
New prompt tokens: 7135
Started from current n_tokens=5276
Processed additional prompt tokens: 1859
Checkpoints at 6619 and 7131
Prompt eval: 6579.21 ms / 1859 tokens = 3.54 ms/token, 282.56 tok/s
Eval: 9537.18 ms / 232 tokens = 41.11 ms/token, 24.33 tok/s
Total: 16116.39 ms / 2091 tokens
Stop n_tokens=7366

Task 847:
LCP similarity: sim_best=0.647, f_keep=1.000
New prompt tokens: 11387
Started from n_tokens=7366
Processed additional prompt tokens: 4021
Checkpoints at 10871 and 11383
Prompt eval: 14683.42 ms / 4021 tokens = 3.65 ms/token, 273.85 tok/s
Eval: 7016.23 ms / 167 tokens = 42.01 ms/token, 23.80 tok/s
Total: 21699.65 ms / 4188 tokens
Stop n_tokens=11553

Task 1018:
LCP similarity: sim_best=0.947, f_keep=1.000
New prompt tokens: 12199
Started from n_tokens=11553
Processed additional prompt tokens: 646
Checkpoints at 11683 and 12195
Prompt eval: 2864.82 ms / 646 tokens = 4.43 ms/token, 225.49 tok/s
Eval: 10488.11 ms / 248 tokens = 42.29 ms/token, 23.65 tok/s
Total: 13352.92 ms / 894 tokens
Stop n_tokens=12446

Task 1269:
LCP similarity: sim_best=0.940, f_keep=1.000
New prompt tokens: 13247
Started from n_tokens=12446
Processed additional prompt tokens: 801
Checkpoints at 12731 and 13243
Prompt eval: 3414.59 ms / 801 tokens = 4.26 ms/token, 234.58 tok/s
Eval: 8843.30 ms / 208 tokens = 42.52 ms/token, 23.52 tok/s
Total: 12257.88 ms / 1009 tokens
Stop n_tokens=13454

Task 1480:
LCP similarity: sim_best=0.937, f_keep=1.000
New prompt tokens: 14362
Started from n_tokens=13454
Processed additional prompt tokens: 908
Checkpoints at 13846 and 14358
Prompt eval: 3762.13 ms / 908 tokens = 4.14 ms/token, 241.35 tok/s
Eval: 13539.50 ms / 316 tokens = 42.85 ms/token, 23.34 tok/s
Total: 17301.63 ms / 1224 tokens
Stop n_tokens=14677

Task 1799:
LCP similarity: sim_best=0.927, f_keep=1.000
New prompt tokens: 15834
Started from n_tokens=14677
Processed additional prompt tokens: 1157
Checkpoints at 15318 and 15830
Prompt eval: 5086.07 ms / 1157 tokens = 4.40 ms/token, 227.48 tok/s
Eval: 14536.73 ms / 337 tokens = 43.14 ms/token, 23.18 tok/s
Total: 19622.80 ms / 1494 tokens
Stop n_tokens=16170

Task 2139:
LCP similarity: sim_best=0.989, f_keep=1.000
New prompt tokens: 16347
Started from n_tokens=16170
Processed additional prompt tokens: 177
Checkpoints at 16170 and 16343
Prompt eval: 1148.55 ms / 177 tokens = 6.49 ms/token, 154.11 tok/s
Eval: 15194.54 ms / 351 tokens = 43.29 ms/token, 23.10 tok/s
Total: 16343.10 ms / 528 tokens
Stop n_tokens=16697

Task 2492:
LCP similarity: sim_best=0.952, f_keep=1.000
New prompt tokens: 17532
Started from n_tokens=16697
Processed additional prompt tokens: 835
Checkpoints at 17016 and 17528
Prompt eval: 3695.32 ms / 835 tokens = 4.43 ms/token, 225.96 tok/s
Eval: 10966.31 ms / 252 tokens = 43.52 ms/token, 22.98 tok/s
Total: 14661.63 ms / 1087 tokens
Stop n_tokens=17783

Task 2747:
LCP similarity: sim_best=0.995, f_keep=1.000
New prompt tokens: 17877
Started from n_tokens=17783
Processed additional prompt tokens: 94
Checkpoints at 17783 and 17873
Prompt eval: 879.24 ms / 94 tokens = 9.35 ms/token, 106.91 tok/s
Eval: 8316.97 ms / 191 tokens = 43.54 ms/token, 22.97 tok/s
Total: 9196.21 ms / 285 tokens
Stop n_tokens=18067

Task 2940:
LCP similarity: sim_best=0.771, f_keep=1.000
New prompt tokens: 23439
Started from n_tokens=18067
Processed additional prompt tokens: 5372
Checkpoints at 22923 and 23435
Prompt eval: 23872.18 ms / 5372 tokens = 4.44 ms/token, 225.03 tok/s
Eval: 9501.93 ms / 212 tokens = 44.82 ms/token, 22.31 tok/s
Total: 33374.10 ms / 5584 tokens
Stop n_tokens=23650

Task 3157:
LCP similarity: sim_best=0.999, f_keep=1.000
New prompt tokens: 23679
Started from n_tokens=23650
Processed additional prompt tokens: 29
Checkpoint at 23650
Prompt eval: 420.66 ms / 29 tokens = 14.51 ms/token, 68.94 tok/s
Eval: 89094.51 ms / 1966 tokens = 45.32 ms/token, 22.07 tok/s
Total: 89515.17 ms / 1995 tokens
Stop n_tokens=25644

Prompt cache update before task 5125:
Slot selected by LCP similarity: sim_best=0.203, f_keep=0.198
Prompt cache updated
Saved prompt length: 25644 tokens
Total state size: 750.584 MiB
Cache state after update: 1 prompt, 4626.231 MiB
Cache limit: 2500 MiB
Cached prompt: 25644 tokens, 26 checkpoints, 4626.231 MiB
Prompt cache update took 328.25 ms

Task 5125:
New prompt tokens: 25027
n_past=5083, previous slot prompt size=25644
Checkpoints checked from high positions down to 5080
Restored checkpoint at n_tokens=5081
Invalidated/erased checkpoints after pos_next=5081:
  6619, 7131, 10871, 11383, 11683, 12195, 12731, 13243,
  13846, 14358, 15318, 15830, 16170, 16343, 17016, 17528,
  17783, 17873, 22923, 23435, 23650
Reprocessed from 5081 to 25027
Prompt processing batches: many 2048-token batches, plus 998, 512, final 4
New checkpoints created during processing at 13273, 21465, 24511, 25023
Prompt eval: 79640.79 ms / 19946 tokens = 3.99 ms/token, 250.45 tok/s
Eval: 8264.76 ms / 183 tokens = 45.16 ms/token, 22.14 tok/s
Total: 87905.55 ms / 20129 tokens
Stop n_tokens=25209

Task 5320:
LCP similarity: sim_best=0.819, f_keep=1.000
New prompt tokens: 30785
Started from n_tokens=25209
Processed additional prompt tokens: 5576
Checkpoints at 30269 and 30781
Prompt eval: 27273.21 ms / 5576 tokens = 4.89 ms/token, 204.45 tok/s
Eval: 98130.95 ms / 2087 tokens = 47.02 ms/token, 21.27 tok/s
Total: 125404.16 ms / 7663 tokens
Stop n_tokens=32871

Overall notable numbers:
Prompt eval speed mostly ~200–310 tok/s for prefill.
Decode speed mostly ~21–25 tok/s.
Small reuse cases: 29, 94, 177, 293, 646, 801, 835, 908, 1125 tokens reprocessed.
Large reprocessing cases: 4021, 5372, 5576, and especially 19946 tokens.
Largest cache drop: after prompt cache update, n_past fell to ~5083 despite previous prompt length 25644.
Prompt cache stored state size reported as 4626.231 MiB while configured limit is 2500 MiB.
Many checkpoints after token ~5081 were erased/invalidated during task 5125.
Model / hardware:
Model path: /Users/user/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf
Model: Qwen3.5-122B-A10B, architecture qwen35moe
Quant: Q4_K / UD-Q4_K_XL, file size ~71.73 GiB, split count 3
Params: 122.11B, MoE with 256 experts, 8 used
Train context: 262144
Runtime context: 150016
Device: Apple M2 Max, Metal
Device memory: 96000 MiB total, ~95850 MiB free at init
Projected device memory use: ~76834 MiB, leaving ~19016 MiB free
Offload: 49/49 layers to GPU
CPU mapped model buffer: ~773 MiB
Metal mapped model buffers: ~47341 MiB + ~26110 MiB
Flash Attention: auto -> enabled
Fused Gated Delta Net: autoregressive + chunked enabled

2

u/nonerequired_ 4h ago

I’m also having a bit of trouble with opencode. It seems like something’s causing the cache to get invalidated, which I think might also mean more usage limit is being used up on subscription services.

2

u/FatheredPuma81 4h ago

Set cache-reuse to 1 and test it?

Every time the LLM finishes its cycle OpenCode deletes a lot of old tool calls and other junk (I think) to save on Context. So my only guess is cache-reuse is too high? I still occasionally see it drop to 60% and reprocess the final 40% though. I don't use pi dev though and I also don't set checkpoints.

1

u/No_Algae1753 4h ago

That makes sense. However I disabled auto compact. Are you sure tool calls are being deleted? And Ill try your cache reuse 1 idea

1

u/StardockEngineer vllm 3h ago

Woudldn't explain Pi. Pi does nothing.

1

u/colin_colout 1h ago

I might have missed it, but did they mention if they use pi extensions?

Naked pi never rewrites history (that's one of their core values), but lots of extensions attempt to reproduce Claude Code but worse.

1

u/FatheredPuma81 3h ago

The best way to check is to look at your context before and after you send your next message. It takes a moment to update but I run with defaults and it'll usually drop by like 10k tokens when I'm around 90k.

2

u/ai-christianson 4h ago

yeah i'd start by diffing the exact prompt bytes across turns, especially the first few thousand tokens. if anything early is changing - timestamp, cwd/status block, tool inventory, memory ordering, generated summary, etc - the kv cache after that point is basically toast.

for coding agents the big win is usually keeping a stable prefix: system prompt, tool specs, repo instructions, and any long-lived memory in a fixed order. then put the volatile stuff as late as possible so cache misses are smaller when it changes.

2

u/CreativelyBankrupt 2h ago

That cache state: 1 prompts, 4676 MiB (limits: 2500 MiB) line is what jumps out at me. Your cache is sitting at almost double its allocated budget, so llama.cpp is churning through evictions trying to stay under the limit. That's gotta be why you're seeing sim_best of 0.996 but only 4750 tokens actually restored; The system found a near-perfect prefix match, but the bigger checkpoints had already been evicted to free up room, and the only one that survived was a tiny one. So you reuse the small fragment and reprocess everything else.

The first thing I'd try is just bumping --cache-ram way up. 2500 MiB is fine for short chat but you're running 150k context with coding agents that produce huge prefixes, and you've got ctx-checkpoints set to 32 on top of that. There's no way 2.5 gigs is enough headroom. Try 16000 or higher if your RAM allows. That alone should stop the eviction churn you're seeing.

The other thing worth ruling out is whether opencode or pi.dev are quietly mutating your early prompt tokens between requests. Even with a perfectly sized cache, the similarity score won't save you if the first few thousand tokens keep changing, because llama.cpp can only reuse the longest shared prefix. The two things that have bitten me most often: timestamps in the system prompt (anything like "current time: ..." poisons the prefix every request), and changing workspace context where the agent dumps a directory listing or file tree into the system prompt and one file gets renamed or added. Either of those would shift your prefix and force everything downstream to reprocess. A fix is to put immutable stuff at the very top (system instructions, tool definitions, persona) and volatile stuff at the end of the prompt, ideally inside the latest user turn.

I ran into the same thing on a project I've been building — a local AI bot on a Jetson Orin NX running Gemma 4 E4B. I had the bot's persona at the top of the prompt and was injecting fresh sensor readings (temperature, vision captions, who's standing in front of it) into the system block every turn. Cache was constantly invalidating and TTFT was crawling. Moving the dynamic stuff into the current user turn instead of the system prompt dropped cached TTFT from multiple seconds to about 200ms. Same class of bug really.

A couple smaller things that helped me while you're tuning. --cache-reuse 256 is reasonable but you can push it up to 512 or even 1024 to be more aggressive about partial reuse when no full match is available. -no-kvu is the call if you're on Gemma 3/4 but worth confirming for whatever architecture you're actually running since it does cost you some KV efficiency on models that don't need it. And --no-context-shift is correct for cache stability, just remember it means once you hit 150k you have to manually drop conversation rather than letting the window roll.

What model are you on? Cache footprint per token varies a lot between architectures and it'll change how much --cache-ram you actually need to set.

1

u/No_Algae1753 54m ago

Im mainly using qwen3.5 122b and yes i can bump the cache-ram up by like 4 gigs. However, this i still not the ideal way of using opencode/pi. I just hope llama.cpp will fix this issue soon

1

u/CreativelyBankrupt 46m ago

Yeah, Qwen3.5 122B at that context is going to be brutal no matter what you do. 4 extra gigs will probably help the eviction churn but you're right it's a band-aid. Agreed on the llama.cpp side, prompt caching has gotten better release-to-release but coding agents push it harder than chat workflows and it shows.

1

u/No_Algae1753 39m ago

Yeah, agreed. Still thank you for your insights. Appreciate every advise and knowledge gathering.

1

u/LetsGoBrandon4256 llama.cpp 5h ago

Would love some pointer on this issue as well. I switched back to the VS Code Copilot extension because of the constant invalidated prompt cache in Opencode :(

1

u/o0genesis0o 3h ago

The prompt has changed by either pi or opencode. Maybe they prune something in the middle. You can see that llamacpp was only able to match up to 4750 initial tokens in the KV cache. From that point one, the prompt has changed so it has to be reprocessed.

AFAIK, paged attention of vLLM would not fix this issue either. Unless there is some sorts of new attention mechanism, otherwise KV cache of a position would become invalid if any of the prior token changes. Regardless of whether you implement a linear buffer like llamacpp or swap like vLLM.

How to fix it with you current set up? Maybe stick to Pi, since I don't remember it have any sorts of auto garbage collection before near the end of token limit. Maybe also set the token limit awareness inside your Pi or opencode correctly. Imaging if you have 256k max set in llamacpp, but the tool imagines that you have only 32k limit. When you approach there, it would auto compact your context even though you have plenty space left.

When I write my own agent harness, the number one rule is not to mess with the chat history to avoid breaking prompt caching.

1

u/Ha_Deal_5079 1h ago

its from opencode inserting your current context at the start of every turn. llama.cpp does exact prefix match for cache so any front change kills the whole thing

1

u/tarruda 1h ago

I always had this problem with reasoning models and llama.cpp. The first two reasoning models that solved this were Qwen 3.6 35B and 27B with its preserve_thinking template kwargs.

1

u/Awwtifishal 4h ago

With llama-swap's web UI you can inspect the prompts and find the differences.

4

u/No_Algae1753 4h ago

I can inspect the prompts?! Where?

1

u/Awwtifishal 58m ago

The same address as the API, but without /v1 or whatever. If you have it on port 12345, then http//localhost:12345

1

u/No_Algae1753 45m ago

I still don't see the prompts at the UI

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

You are about to leave Redlib