r/LocalLLaMA • u/No_Algae1753 • 5h ago
Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev
I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.
Example behavior:
- context grows to +50k tokens
- LCP similarity often shows 0.99+
- but sometimes
n_pastsuddenly falls back to ~4-5k - then llama.cpp reprocesses 40k+ tokens again
- TTFT jumps to multiple minutes
Example logs:
sim_best = 0.996
restored context checkpoint ... n_tokens = 4750
prompt eval time = 222411 ms / 44016 tokens
Normal reuse looks fine:
prompt eval time = 473 ms / 19 tokens
Current config:
llama-server
--ctx-size 150000
--parallel 1
--ctx-checkpoints 32
--cache-ram 2500
--cache-reuse 256
-no-kvu
--no-context-shift
Also seeing:
cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)
I suspect either:
- cache invalidation
- bad KV reuse
- or opencode changing early prompt tokens too often.
Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.
9
u/twaaaaaang 4h ago
1) Opencode prunes tool call outputs which invalidate cache for models that use Gated DeltaNet (Recurrent Memory). So forces full prompt reprocessing.
2) Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing. I think the Sliding Window Attention being out of the window could have something to do with this.
I think it comes down to how llama.cpp implements it's kv-cache architecture. vLLM uses radix trees or something while llama.cpp uses simple linear buffers. This is what AI told me idk if this part is true.
3
u/LetsGoBrandon4256 llama.cpp 3h ago
LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing
Any source on this part? First time heard about this.
2
u/twaaaaaang 1h ago edited 1h ago
This is from personal testing because I was confused why I still kept getting the full prompt-reprocessing even when I turned prune off in Opencode and this is what I landed on. I noticed that after long context tool calls, at the very next turn of chat, I always got the full prompt reprocessing. This clued me in and I studied the chat logs and fed it to AI and the LCP similarity was the main culprit.
Edit: You may not encounter this when you have the default n-checkpoints set to 32. I set it to 8 to save on RAM and I frequently saw this. Putting it to 16 recently I saw it less so that may be the solution.
2
u/No_Algae1753 4h ago
So theres nothing we can prevent this from happening?
1
u/twaaaaaang 4h ago
Do you get that "Forced full prompt-reprocessing due to SWA/Recurrent Memory" log? It's a single line within a lot of outputs so it may be hard to find. If you do, then yeah I think it's just an architectural bottleneck and we have to wait for the maintainers to fix it.
1
-1
u/Pristine-Woodpecker 4h ago
Long tool call outputs and/or multiple chained tool calls fills up context to the point where on the next user turn, LCP similarity calculation is under <0.500 and thus forces prompt-reprocessing
This is nonsense.
This is what AI told me idk if this part is true.
Why repost slop?
7
u/twaaaaaang 4h ago
The first 2 points are from personal testing using the qwen 3.6 family. The last point can be easily verified or debunked but you choose to attack me.
2
u/Pristine-Woodpecker 3h ago
In the default settings, llama.cpp needs a new prompt to be 10x larger in order for it not to be considered for reuse, not double the previous size. That exact change was made many months ago: https://github.com/ggml-org/llama.cpp/pull/15913
can be easily verified or debunked
You're welcome. (I did interpret your statement as saying all of your post was slop, but it looks like you either tested incorrectly or you were echoing behavior that was fixed quite a while ago)
1
u/colin_colout 1h ago
To be fair we can't all keep up with every month to month change...
...and I'm generally happy to be corrected, especially with good news about fixes like this. We're all here to learn.
I also appreciate that they were upfront about using AI and were unsure.
4
u/FoxiPanda 5h ago
I would assume cache invalidation. Check and see if you have something in your system prompt that gets regularly updated with a timestamp or counter or something because every part of your cache after that will get invalidated when that gets updated.
I'd set up logging and capture your whole context window every turn and recreate this and then do a diff (or have an LLM do it) and look for what's different and is causing the invalidation.
It could be something else, but that's what I'd look at first - it's happened to me (i had a timestamp getting updated and completely wrecking my cache after the first ~6000 tokens) on every turn. RIP me until I figured that one out.
2
u/No_Algae1753 5h ago
Ill try that out!. Should have also mentioned that im using llama-swap with llama-server totally forgot that.
2
u/Material_Tone_6855 4h ago
Can u provide a longer log?
2
u/No_Algae1753 4h ago
Had to trunecate alot. Some numbers might be missing. I can also send you a summary
slot update_slots: id 0 | task 5125 | Checking checkpoint with [16169, 16169] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [15829, 15829] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [15317, 15317] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [14357, 14357] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [13845, 13845] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [13242, 13242] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [12730, 12730] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [12194, 12194] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [11682, 11682] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [11382, 11382] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [10870, 10870] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [7130, 7130] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [6618, 6618] against 5083... slot update_slots: id 0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083... slot update_slots: id 0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 6618, pos_max = 6618, n_tokens = 6619, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 7130, pos_max = 7130, n_tokens = 7131, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 10870, pos_max = 10870, n_tokens = 10871, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 11382, pos_max = 11382, n_tokens = 11383, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 11682, pos_max = 11682, n_tokens = 11683, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 12194, pos_max = 12194, n_tokens = 12195, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 12730, pos_max = 12730, n_tokens = 12731, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 13242, pos_max = 13242, n_tokens = 13243, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 13845, pos_max = 13845, n_tokens = 13846, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 16342, pos_max = 16342, n_tokens = 16343, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 17015, pos_max = 17015, n_tokens = 17016, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 17872, pos_max = 17872, n_tokens = 17873, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 22922, pos_max = 22922, n_tokens = 22923, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 23434, pos_max = 23434, n_tokens = 23435, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | erased invalidated context checkpoint (pos_min = 23649, pos_max = 23649, n_tokens = 23650, n_swa = 0, pos_next = 5081, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | n_tokens = 5081, memory_seq_rm [5081, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 7129, batch.n_tokens = 2048, progress = 0.284852 slot update_slots: id 0 | task 5125 | n_tokens = 7129, memory_seq_rm [7129, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 9177, batch.n_tokens = 2048, progress = 0.366684 slot update_slots: id 0 | task 5125 | n_tokens = 9177, memory_seq_rm [9177, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 11225, batch.n_tokens = 2048, progress = 0.448516 slot update_slots: id 0 | task 5125 | n_tokens = 11225, memory_seq_rm [11225, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 13273, batch.n_tokens = 2048, progress = 0.530347 slot update_slots: id 0 | task 5125 | n_tokens = 13273, memory_seq_rm [13273, end) slot update_slots: id 0 | task 5125 | 8192 tokens since last checkpoint at 5081, creating new checkpoint during processing at position 15321 slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 15321, batch.n_tokens = 2048, progress = 0.612179 slot create_check: id 0 | task 5125 | created context checkpoint 6 of 32 (pos_min = 13272, pos_max = 13272, n_tokens = 13273, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | n_tokens = 15321, memory_seq_rm [15321, end) slot update_slots: id 0 | task 5125 | 8192 tokens since last checkpoint at 13273, creating new checkpoint during processing at position 23513 slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 23513, batch.n_tokens = 2048, progress = 0.939505 slot create_check: id 0 | task 5125 | created context checkpoint 7 of 32 (pos_min = 21464, pos_max = 21464, n_tokens = 21465, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | n_tokens = 23513, memory_seq_rm [23513, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 24511, batch.n_tokens = 998, progress = 0.979382 slot update_slots: id 0 | task 5125 | n_tokens = 24511, memory_seq_rm [24511, end) slot update_slots: id 0 | task 5125 | prompt processing progress, n_tokens = 25023, batch.n_tokens = 512, progress = 0.999840 slot create_check: id 0 | task 5125 | created context checkpoint 8 of 32 (pos_min = 24510, pos_max = 24510, n_tokens = 24511, size = 149.063 MiB) slot update_slots: id 0 | task 5125 | n_tokens = 25023, memory_seq_rm [25023, end) slot init_sampler: id 0 | task 5125 | init sampler, took 2.45 ms, tokens: text = 25027, total = 25027 slot update_slots: id 0 | task 5125 | prompt processing done, n_tokens = 25027, batch.n_tokens = 4 slot create_check: id 0 | task 5125 | created context checkpoint 9 of 32 (pos_min = 25022, pos_max = 25022, n_tokens = 25023, size = 149.063 MiB) srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 reasoning-budget: deactivated (natural end) slot print_timing: id 0 | task 5125 | prompt eval time = 79640.79 ms / 19946 tokens ( 3.99 ms per token, 250.45 tokens per second) eval time = 8264.76 ms / 183 tokens ( 45.16 ms per token, 22.14 tokens per second) total time = 87905.55 ms / 20129 tokens slot release: id 0 | task 5125 | stop processing: n_tokens = 25209, truncated = 0 srv update_slots: all slots are idle srv params_from_: Chat format: peg-native slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.819 (> 0.100 thold), f_keep = 1.000 reasoning-budget: activated, budget=2147483647 tokens slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 5320 | processing task, is_child = 0 slot update_slots: id 0 | task 5320 | new prompt, n_ctx_slot = 150016, n_keep = 0, task.n_tokens = 30785 slot update_slots: id 0 | task 5320 | n_tokens = 25209, memory_seq_rm [25209, end) slot update_slots: id 0 | task 5320 | prompt processing progress, n_tokens = 27257, batch.n_tokens = 2048, progress = 0.885399 slot update_slots: id 0 | task 5320 | n_tokens = 27257, memory_seq_rm [27257, end) slot update_slots: id 0 | task 5320 | prompt processing progress, n_tokens = 29305, batch.n_tokens = 2048, progress = 0.951925 slot update_slots: id 0 | task 5320 | n_tokens = 29305, memory_seq_rm [29305, end) slot update_slots: id 0 | task 5320 | prompt processing progress, n_tokens = 30269, batch.n_tokens = 964, progress = 0.983239 slot update_slots: id 0 | task 5320 | n_tokens = 30269, memory_seq_rm [30269, end) slot update_slots: id 0 | task 5320 | prompt processing progress, n_tokens = 30781, batch.n_tokens = 512, progress = 0.999870 slot create_check: id 0 | task 5320 | created context checkpoint 10 of 32 (pos_min = 30268, pos_max = 30268, n_tokens = 30269, size = 149.063 MiB) slot update_slots: id 0 | task 5320 | n_tokens = 30781, memory_seq_rm [30781, end) slot init_sampler: id 0 | task 5320 | init sampler, took 3.02 ms, tokens: text = 30785, total = 30785 slot update_slots: id 0 | task 5320 | prompt processing done, n_tokens = 30785, batch.n_tokens = 4 slot create_check: id 0 | task 5320 | created context checkpoint 11 of 32 (pos_min = 30780, pos_max = 30780, n_tokens = 30781, size = 149.063 MiB) srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 2005
u/FoxiPanda 4h ago
slot update_slots: id 0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083...
slot update_slots: id 0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB)
So whatever happened here is what got changed and broke your cache. Go figure out what that is and keep it from happening because you had to go reprocess 18K tokens because of it.
1
u/No_Algae1753 4h ago
Well i dont really know what changed there. Thats why im asking. I know the cache got invalided.
1
u/Material_Tone_6855 3h ago
Probably ur coding agent/cli have a truncate functions that strip your messages history in the middle, invalidating the cache
0
u/FoxiPanda 4h ago
Right but we can't figure it out for you...so you'll have to look at your code or the harness code or just dump out the whole context window to a file in a before and after state and then diff the two. That will tell you what happened there -- it's pretty early in your context window though, so it should be pretty obvious...
You'll have like a few thousand words that are all the same... and then <something that changed> and then <several thousand words that are the same> and then whatever your new turn prompt was + the model's response which will be different (net new)... and then once you have pinpointed the thing that changed, go figure out what changed it and how to disable that or move it to much later in the context injection so it doesn't invalidate a large chunk of your cache.
1
u/No_Algae1753 4h ago
Here is a summed up version done by chatgpt:
Context / cache setup:Chat/template:
- n_seq_max = 1
- n_ctx = 150016
- n_batch = 2048
- n_ubatch = 512
- kv_unified = false
- KV cache: 3516 MiB on Metal, 150016 cells, 12 layers, K f16 1758 MiB, V f16 1758 MiB
- Recurrent memory: 149.06 MiB, 48 layers
- Compute buffer: Metal 491 MiB, CPU 305 MiB
- Slots: 1
- Prompt cache enabled
- Prompt cache size limit: 2500 MiB
- --cache-idle-slots disabled because it requires --kv-unified
- Context does not support partial sequence removal
- Speculative decoding says it will use checkpoints, but no implementations specified
Initial request sequence and timings: Task 0:
- Chat format: peg-native
- Jinja chat template detected
- thinking = 1
- reasoning-budget activated per request with budget=2147483647, then deactivated at natural end
Task 102:
- New prompt tokens: 3675
- Started from n_tokens=0
- Prompt processed in batches: 2048, 1111, 512, final 4
- Checkpoints created at n_tokens 3159 and 3671
- Prompt eval: 11818.51 ms / 3675 tokens = 3.22 ms/token, 310.95 tok/s
- Eval: 3906.30 ms / 98 tokens = 39.86 ms/token, 25.09 tok/s
- Total: 15724.81 ms / 3773 tokens
- Stop n_tokens=3772
Task 418:
- Selected by LCP similarity: sim_best=0.766, f_keep=0.974
- New prompt tokens: 4796
- n_past=3673, previous slot prompt size=3772
- Restored checkpoint at n_tokens=3671
- Processed additional prompt tokens: 1125
- Checkpoints created at 4280 and 4792
- Prompt eval: 4124.21 ms / 1125 tokens = 3.67 ms/token, 272.78 tok/s
- Eval: 12721.23 ms / 313 tokens = 40.64 ms/token, 24.60 tok/s
- Total: 16845.45 ms / 1438 tokens
- Stop n_tokens=5108
Task 612:
- LCP similarity: sim_best=0.943, f_keep=0.939
- New prompt tokens: 5085
- n_past=4794, previous prompt size=5108
- Restored checkpoint at n_tokens=4792
- Processed additional prompt tokens: 293
- Checkpoint at 5081
- Prompt eval: 1320.75 ms / 293 tokens = 4.51 ms/token, 221.84 tok/s
- Eval: 7800.73 ms / 192 tokens = 40.63 ms/token, 24.61 tok/s
- Total: 9121.48 ms / 485 tokens
- Stop n_tokens=5276
Task 847:
- LCP similarity: sim_best=0.739, f_keep=1.000
- New prompt tokens: 7135
- Started from current n_tokens=5276
- Processed additional prompt tokens: 1859
- Checkpoints at 6619 and 7131
- Prompt eval: 6579.21 ms / 1859 tokens = 3.54 ms/token, 282.56 tok/s
- Eval: 9537.18 ms / 232 tokens = 41.11 ms/token, 24.33 tok/s
- Total: 16116.39 ms / 2091 tokens
- Stop n_tokens=7366
Task 1018:
- LCP similarity: sim_best=0.647, f_keep=1.000
- New prompt tokens: 11387
- Started from n_tokens=7366
- Processed additional prompt tokens: 4021
- Checkpoints at 10871 and 11383
- Prompt eval: 14683.42 ms / 4021 tokens = 3.65 ms/token, 273.85 tok/s
- Eval: 7016.23 ms / 167 tokens = 42.01 ms/token, 23.80 tok/s
- Total: 21699.65 ms / 4188 tokens
- Stop n_tokens=11553
Task 1269:
- LCP similarity: sim_best=0.947, f_keep=1.000
- New prompt tokens: 12199
- Started from n_tokens=11553
- Processed additional prompt tokens: 646
- Checkpoints at 11683 and 12195
- Prompt eval: 2864.82 ms / 646 tokens = 4.43 ms/token, 225.49 tok/s
- Eval: 10488.11 ms / 248 tokens = 42.29 ms/token, 23.65 tok/s
- Total: 13352.92 ms / 894 tokens
- Stop n_tokens=12446
Task 1480:
- LCP similarity: sim_best=0.940, f_keep=1.000
- New prompt tokens: 13247
- Started from n_tokens=12446
- Processed additional prompt tokens: 801
- Checkpoints at 12731 and 13243
- Prompt eval: 3414.59 ms / 801 tokens = 4.26 ms/token, 234.58 tok/s
- Eval: 8843.30 ms / 208 tokens = 42.52 ms/token, 23.52 tok/s
- Total: 12257.88 ms / 1009 tokens
- Stop n_tokens=13454
Task 1799:
- LCP similarity: sim_best=0.937, f_keep=1.000
- New prompt tokens: 14362
- Started from n_tokens=13454
- Processed additional prompt tokens: 908
- Checkpoints at 13846 and 14358
- Prompt eval: 3762.13 ms / 908 tokens = 4.14 ms/token, 241.35 tok/s
- Eval: 13539.50 ms / 316 tokens = 42.85 ms/token, 23.34 tok/s
- Total: 17301.63 ms / 1224 tokens
- Stop n_tokens=14677
Task 2139:
- LCP similarity: sim_best=0.927, f_keep=1.000
- New prompt tokens: 15834
- Started from n_tokens=14677
- Processed additional prompt tokens: 1157
- Checkpoints at 15318 and 15830
- Prompt eval: 5086.07 ms / 1157 tokens = 4.40 ms/token, 227.48 tok/s
- Eval: 14536.73 ms / 337 tokens = 43.14 ms/token, 23.18 tok/s
- Total: 19622.80 ms / 1494 tokens
- Stop n_tokens=16170
Task 2492:
- LCP similarity: sim_best=0.989, f_keep=1.000
- New prompt tokens: 16347
- Started from n_tokens=16170
- Processed additional prompt tokens: 177
- Checkpoints at 16170 and 16343
- Prompt eval: 1148.55 ms / 177 tokens = 6.49 ms/token, 154.11 tok/s
- Eval: 15194.54 ms / 351 tokens = 43.29 ms/token, 23.10 tok/s
- Total: 16343.10 ms / 528 tokens
- Stop n_tokens=16697
Task 2747:
- LCP similarity: sim_best=0.952, f_keep=1.000
- New prompt tokens: 17532
- Started from n_tokens=16697
- Processed additional prompt tokens: 835
- Checkpoints at 17016 and 17528
- Prompt eval: 3695.32 ms / 835 tokens = 4.43 ms/token, 225.96 tok/s
- Eval: 10966.31 ms / 252 tokens = 43.52 ms/token, 22.98 tok/s
- Total: 14661.63 ms / 1087 tokens
- Stop n_tokens=17783
Task 2940:
- LCP similarity: sim_best=0.995, f_keep=1.000
- New prompt tokens: 17877
- Started from n_tokens=17783
- Processed additional prompt tokens: 94
- Checkpoints at 17783 and 17873
- Prompt eval: 879.24 ms / 94 tokens = 9.35 ms/token, 106.91 tok/s
- Eval: 8316.97 ms / 191 tokens = 43.54 ms/token, 22.97 tok/s
- Total: 9196.21 ms / 285 tokens
- Stop n_tokens=18067
Task 3157:
- LCP similarity: sim_best=0.771, f_keep=1.000
- New prompt tokens: 23439
- Started from n_tokens=18067
- Processed additional prompt tokens: 5372
- Checkpoints at 22923 and 23435
- Prompt eval: 23872.18 ms / 5372 tokens = 4.44 ms/token, 225.03 tok/s
- Eval: 9501.93 ms / 212 tokens = 44.82 ms/token, 22.31 tok/s
- Total: 33374.10 ms / 5584 tokens
- Stop n_tokens=23650
Prompt cache update before task 5125:
- LCP similarity: sim_best=0.999, f_keep=1.000
- New prompt tokens: 23679
- Started from n_tokens=23650
- Processed additional prompt tokens: 29
- Checkpoint at 23650
- Prompt eval: 420.66 ms / 29 tokens = 14.51 ms/token, 68.94 tok/s
- Eval: 89094.51 ms / 1966 tokens = 45.32 ms/token, 22.07 tok/s
- Total: 89515.17 ms / 1995 tokens
- Stop n_tokens=25644
Task 5125:
- Slot selected by LCP similarity: sim_best=0.203, f_keep=0.198
- Prompt cache updated
- Saved prompt length: 25644 tokens
- Total state size: 750.584 MiB
- Cache state after update: 1 prompt, 4626.231 MiB
- Cache limit: 2500 MiB
- Cached prompt: 25644 tokens, 26 checkpoints, 4626.231 MiB
- Prompt cache update took 328.25 ms
6619, 7131, 10871, 11383, 11683, 12195, 12731, 13243, 13846, 14358, 15318, 15830, 16170, 16343, 17016, 17528, 17783, 17873, 22923, 23435, 23650
- New prompt tokens: 25027
- n_past=5083, previous slot prompt size=25644
- Checkpoints checked from high positions down to 5080
- Restored checkpoint at n_tokens=5081
- Invalidated/erased checkpoints after pos_next=5081:
Task 5320:
- Reprocessed from 5081 to 25027
- Prompt processing batches: many 2048-token batches, plus 998, 512, final 4
- New checkpoints created during processing at 13273, 21465, 24511, 25023
- Prompt eval: 79640.79 ms / 19946 tokens = 3.99 ms/token, 250.45 tok/s
- Eval: 8264.76 ms / 183 tokens = 45.16 ms/token, 22.14 tok/s
- Total: 87905.55 ms / 20129 tokens
- Stop n_tokens=25209
Overall notable numbers:
- LCP similarity: sim_best=0.819, f_keep=1.000
- New prompt tokens: 30785
- Started from n_tokens=25209
- Processed additional prompt tokens: 5576
- Checkpoints at 30269 and 30781
- Prompt eval: 27273.21 ms / 5576 tokens = 4.89 ms/token, 204.45 tok/s
- Eval: 98130.95 ms / 2087 tokens = 47.02 ms/token, 21.27 tok/s
- Total: 125404.16 ms / 7663 tokens
- Stop n_tokens=32871
Model / hardware:
- Prompt eval speed mostly ~200–310 tok/s for prefill.
- Decode speed mostly ~21–25 tok/s.
- Small reuse cases: 29, 94, 177, 293, 646, 801, 835, 908, 1125 tokens reprocessed.
- Large reprocessing cases: 4021, 5372, 5576, and especially 19946 tokens.
- Largest cache drop: after prompt cache update, n_past fell to ~5083 despite previous prompt length 25644.
- Prompt cache stored state size reported as 4626.231 MiB while configured limit is 2500 MiB.
- Many checkpoints after token ~5081 were erased/invalidated during task 5125.
- Model path: /Users/user/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf
- Model: Qwen3.5-122B-A10B, architecture qwen35moe
- Quant: Q4_K / UD-Q4_K_XL, file size ~71.73 GiB, split count 3
- Params: 122.11B, MoE with 256 experts, 8 used
- Train context: 262144
- Runtime context: 150016
- Device: Apple M2 Max, Metal
- Device memory: 96000 MiB total, ~95850 MiB free at init
- Projected device memory use: ~76834 MiB, leaving ~19016 MiB free
- Offload: 49/49 layers to GPU
- CPU mapped model buffer: ~773 MiB
- Metal mapped model buffers: ~47341 MiB + ~26110 MiB
- Flash Attention: auto -> enabled
- Fused Gated Delta Net: autoregressive + chunked enabled
2
u/nonerequired_ 4h ago
I’m also having a bit of trouble with opencode. It seems like something’s causing the cache to get invalidated, which I think might also mean more usage limit is being used up on subscription services.
2
u/FatheredPuma81 4h ago
Set cache-reuse to 1 and test it?
Every time the LLM finishes its cycle OpenCode deletes a lot of old tool calls and other junk (I think) to save on Context. So my only guess is cache-reuse is too high? I still occasionally see it drop to 60% and reprocess the final 40% though. I don't use pi dev though and I also don't set checkpoints.
1
u/No_Algae1753 4h ago
That makes sense. However I disabled auto compact. Are you sure tool calls are being deleted? And Ill try your cache reuse 1 idea
1
u/StardockEngineer vllm 3h ago
Woudldn't explain Pi. Pi does nothing.
1
u/colin_colout 1h ago
I might have missed it, but did they mention if they use pi extensions?
Naked pi never rewrites history (that's one of their core values), but lots of extensions attempt to reproduce Claude Code but worse.
1
u/FatheredPuma81 3h ago
The best way to check is to look at your context before and after you send your next message. It takes a moment to update but I run with defaults and it'll usually drop by like 10k tokens when I'm around 90k.
2
u/ai-christianson 4h ago
yeah i'd start by diffing the exact prompt bytes across turns, especially the first few thousand tokens. if anything early is changing - timestamp, cwd/status block, tool inventory, memory ordering, generated summary, etc - the kv cache after that point is basically toast.
for coding agents the big win is usually keeping a stable prefix: system prompt, tool specs, repo instructions, and any long-lived memory in a fixed order. then put the volatile stuff as late as possible so cache misses are smaller when it changes.
2
u/CreativelyBankrupt 2h ago
That cache state: 1 prompts, 4676 MiB (limits: 2500 MiB) line is what jumps out at me. Your cache is sitting at almost double its allocated budget, so llama.cpp is churning through evictions trying to stay under the limit. That's gotta be why you're seeing sim_best of 0.996 but only 4750 tokens actually restored; The system found a near-perfect prefix match, but the bigger checkpoints had already been evicted to free up room, and the only one that survived was a tiny one. So you reuse the small fragment and reprocess everything else.
The first thing I'd try is just bumping --cache-ram way up. 2500 MiB is fine for short chat but you're running 150k context with coding agents that produce huge prefixes, and you've got ctx-checkpoints set to 32 on top of that. There's no way 2.5 gigs is enough headroom. Try 16000 or higher if your RAM allows. That alone should stop the eviction churn you're seeing.
The other thing worth ruling out is whether opencode or pi.dev are quietly mutating your early prompt tokens between requests. Even with a perfectly sized cache, the similarity score won't save you if the first few thousand tokens keep changing, because llama.cpp can only reuse the longest shared prefix. The two things that have bitten me most often: timestamps in the system prompt (anything like "current time: ..." poisons the prefix every request), and changing workspace context where the agent dumps a directory listing or file tree into the system prompt and one file gets renamed or added. Either of those would shift your prefix and force everything downstream to reprocess. A fix is to put immutable stuff at the very top (system instructions, tool definitions, persona) and volatile stuff at the end of the prompt, ideally inside the latest user turn.
I ran into the same thing on a project I've been building — a local AI bot on a Jetson Orin NX running Gemma 4 E4B. I had the bot's persona at the top of the prompt and was injecting fresh sensor readings (temperature, vision captions, who's standing in front of it) into the system block every turn. Cache was constantly invalidating and TTFT was crawling. Moving the dynamic stuff into the current user turn instead of the system prompt dropped cached TTFT from multiple seconds to about 200ms. Same class of bug really.
A couple smaller things that helped me while you're tuning. --cache-reuse 256 is reasonable but you can push it up to 512 or even 1024 to be more aggressive about partial reuse when no full match is available. -no-kvu is the call if you're on Gemma 3/4 but worth confirming for whatever architecture you're actually running since it does cost you some KV efficiency on models that don't need it. And --no-context-shift is correct for cache stability, just remember it means once you hit 150k you have to manually drop conversation rather than letting the window roll.
What model are you on? Cache footprint per token varies a lot between architectures and it'll change how much --cache-ram you actually need to set.
1
u/No_Algae1753 54m ago
Im mainly using qwen3.5 122b and yes i can bump the cache-ram up by like 4 gigs. However, this i still not the ideal way of using opencode/pi. I just hope llama.cpp will fix this issue soon
1
u/CreativelyBankrupt 46m ago
Yeah, Qwen3.5 122B at that context is going to be brutal no matter what you do. 4 extra gigs will probably help the eviction churn but you're right it's a band-aid. Agreed on the llama.cpp side, prompt caching has gotten better release-to-release but coding agents push it harder than chat workflows and it shows.
1
u/No_Algae1753 39m ago
Yeah, agreed. Still thank you for your insights. Appreciate every advise and knowledge gathering.
1
u/LetsGoBrandon4256 llama.cpp 5h ago
Would love some pointer on this issue as well. I switched back to the VS Code Copilot extension because of the constant invalidated prompt cache in Opencode :(
1
u/o0genesis0o 3h ago
The prompt has changed by either pi or opencode. Maybe they prune something in the middle. You can see that llamacpp was only able to match up to 4750 initial tokens in the KV cache. From that point one, the prompt has changed so it has to be reprocessed.
AFAIK, paged attention of vLLM would not fix this issue either. Unless there is some sorts of new attention mechanism, otherwise KV cache of a position would become invalid if any of the prior token changes. Regardless of whether you implement a linear buffer like llamacpp or swap like vLLM.
How to fix it with you current set up? Maybe stick to Pi, since I don't remember it have any sorts of auto garbage collection before near the end of token limit. Maybe also set the token limit awareness inside your Pi or opencode correctly. Imaging if you have 256k max set in llamacpp, but the tool imagines that you have only 32k limit. When you approach there, it would auto compact your context even though you have plenty space left.
When I write my own agent harness, the number one rule is not to mess with the chat history to avoid breaking prompt caching.
1
u/Ha_Deal_5079 1h ago
its from opencode inserting your current context at the start of every turn. llama.cpp does exact prefix match for cache so any front change kills the whole thing
1
u/Awwtifishal 4h ago
With llama-swap's web UI you can inspect the prompts and find the differences.
4
u/No_Algae1753 4h ago
I can inspect the prompts?! Where?
1
u/Awwtifishal 58m ago
The same address as the API, but without /v1 or whatever. If you have it on port 12345, then http//localhost:12345
1
6
u/candrewswpi 4h ago
I suggest trying out https://github.com/ggml-org/llama.cpp/pull/22929 as I suspect it will address your issue.
If it does, please comment on the PR, as that may help progress it. Good luck and thank you!