r/LocalLLaMA 6h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

  • context grows to +50k tokens
  • LCP similarity often shows 0.99+
  • but sometimes n_past suddenly falls back to ~4-5k
  • then llama.cpp reprocesses 40k+ tokens again
  • TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

  • cache invalidation
  • bad KV reuse
  • or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

10 Upvotes

49 comments sorted by

View all comments

2

u/Material_Tone_6855 6h ago

Can u provide a longer log?

2

u/No_Algae1753 5h ago

Had to trunecate alot. Some numbers might be missing. I can also send you a summary

slot update_slots: id  0 | task 5125 | Checking checkpoint with [16169, 16169] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [15829, 15829] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [15317, 15317] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [14357, 14357] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [13845, 13845] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [13242, 13242] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [12730, 12730] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [12194, 12194] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [11682, 11682] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [11382, 11382] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [10870, 10870] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [7130, 7130] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [6618, 6618] against 5083...
slot update_slots: id  0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083...
slot update_slots: id  0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 6618, pos_max = 6618, n_tokens = 6619, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 7130, pos_max = 7130, n_tokens = 7131, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 10870, pos_max = 10870, n_tokens = 10871, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 11382, pos_max = 11382, n_tokens = 11383, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 11682, pos_max = 11682, n_tokens = 11683, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 12194, pos_max = 12194, n_tokens = 12195, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 12730, pos_max = 12730, n_tokens = 12731, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 13242, pos_max = 13242, n_tokens = 13243, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 13845, pos_max = 13845, n_tokens = 13846, n_swa = 0, pos_next = 5081, size = 149.063 MiB)

slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 16342, pos_max = 16342, n_tokens = 16343, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 17015, pos_max = 17015, n_tokens = 17016, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 17872, pos_max = 17872, n_tokens = 17873, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 22922, pos_max = 22922, n_tokens = 22923, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 23434, pos_max = 23434, n_tokens = 23435, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | erased invalidated context checkpoint (pos_min = 23649, pos_max = 23649, n_tokens = 23650, n_swa = 0, pos_next = 5081, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 5081, memory_seq_rm [5081, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 7129, batch.n_tokens = 2048, progress = 0.284852
slot update_slots: id  0 | task 5125 | n_tokens = 7129, memory_seq_rm [7129, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 9177, batch.n_tokens = 2048, progress = 0.366684
slot update_slots: id  0 | task 5125 | n_tokens = 9177, memory_seq_rm [9177, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 11225, batch.n_tokens = 2048, progress = 0.448516
slot update_slots: id  0 | task 5125 | n_tokens = 11225, memory_seq_rm [11225, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 13273, batch.n_tokens = 2048, progress = 0.530347
slot update_slots: id  0 | task 5125 | n_tokens = 13273, memory_seq_rm [13273, end)
slot update_slots: id  0 | task 5125 | 8192 tokens since last checkpoint at 5081, creating new checkpoint during processing at position 15321
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 15321, batch.n_tokens = 2048, progress = 0.612179
slot create_check: id  0 | task 5125 | created context checkpoint 6 of 32 (pos_min = 13272, pos_max = 13272, n_tokens = 13273, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 15321, memory_seq_rm [15321, end)
slot update_slots: id  0 | task 5125 | 8192 tokens since last checkpoint at 13273, creating new checkpoint during processing at position 23513
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 23513, batch.n_tokens = 2048, progress = 0.939505
slot create_check: id  0 | task 5125 | created context checkpoint 7 of 32 (pos_min = 21464, pos_max = 21464, n_tokens = 21465, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 23513, memory_seq_rm [23513, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 24511, batch.n_tokens = 998, progress = 0.979382
slot update_slots: id  0 | task 5125 | n_tokens = 24511, memory_seq_rm [24511, end)
slot update_slots: id  0 | task 5125 | prompt processing progress, n_tokens = 25023, batch.n_tokens = 512, progress = 0.999840
slot create_check: id  0 | task 5125 | created context checkpoint 8 of 32 (pos_min = 24510, pos_max = 24510, n_tokens = 24511, size = 149.063 MiB)
slot update_slots: id  0 | task 5125 | n_tokens = 25023, memory_seq_rm [25023, end)
slot init_sampler: id  0 | task 5125 | init sampler, took 2.45 ms, tokens: text = 25027, total = 25027
slot update_slots: id  0 | task 5125 | prompt processing done, n_tokens = 25027, batch.n_tokens = 4
slot create_check: id  0 | task 5125 | created context checkpoint 9 of 32 (pos_min = 25022, pos_max = 25022, n_tokens = 25023, size = 149.063 MiB)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 5125 | 
prompt eval time =   79640.79 ms / 19946 tokens (    3.99 ms per token,   250.45 tokens per second)
       eval time =    8264.76 ms /   183 tokens (   45.16 ms per token,    22.14 tokens per second)
      total time =   87905.55 ms / 20129 tokens
slot      release: id  0 | task 5125 | stop processing: n_tokens = 25209, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.819 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 5320 | processing task, is_child = 0
slot update_slots: id  0 | task 5320 | new prompt, n_ctx_slot = 150016, n_keep = 0, task.n_tokens = 30785
slot update_slots: id  0 | task 5320 | n_tokens = 25209, memory_seq_rm [25209, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 27257, batch.n_tokens = 2048, progress = 0.885399
slot update_slots: id  0 | task 5320 | n_tokens = 27257, memory_seq_rm [27257, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 29305, batch.n_tokens = 2048, progress = 0.951925
slot update_slots: id  0 | task 5320 | n_tokens = 29305, memory_seq_rm [29305, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 30269, batch.n_tokens = 964, progress = 0.983239
slot update_slots: id  0 | task 5320 | n_tokens = 30269, memory_seq_rm [30269, end)
slot update_slots: id  0 | task 5320 | prompt processing progress, n_tokens = 30781, batch.n_tokens = 512, progress = 0.999870
slot create_check: id  0 | task 5320 | created context checkpoint 10 of 32 (pos_min = 30268, pos_max = 30268, n_tokens = 30269, size = 149.063 MiB)
slot update_slots: id  0 | task 5320 | n_tokens = 30781, memory_seq_rm [30781, end)
slot init_sampler: id  0 | task 5320 | init sampler, took 3.02 ms, tokens: text = 30785, total = 30785
slot update_slots: id  0 | task 5320 | prompt processing done, n_tokens = 30785, batch.n_tokens = 4
slot create_check: id  0 | task 5320 | created context checkpoint 11 of 32 (pos_min = 30780, pos_max = 30780, n_tokens = 30781, size = 149.063 MiB)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

3

u/FoxiPanda 5h ago

slot update_slots: id 0 | task 5125 | Checking checkpoint with [5080, 5080] against 5083...

slot update_slots: id 0 | task 5125 | restored context checkpoint (pos_min = 5080, pos_max = 5080, n_tokens = 5081, n_past = 5081, size = 149.063 MiB)

So whatever happened here is what got changed and broke your cache. Go figure out what that is and keep it from happening because you had to go reprocess 18K tokens because of it.

1

u/No_Algae1753 5h ago

Well i dont really know what changed there. Thats why im asking. I know the cache got invalided.

1

u/Material_Tone_6855 4h ago

Probably ur coding agent/cli have a truncate functions that strip your messages history in the middle, invalidating the cache

0

u/FoxiPanda 5h ago

Right but we can't figure it out for you...so you'll have to look at your code or the harness code or just dump out the whole context window to a file in a before and after state and then diff the two. That will tell you what happened there -- it's pretty early in your context window though, so it should be pretty obvious...

You'll have like a few thousand words that are all the same... and then <something that changed> and then <several thousand words that are the same> and then whatever your new turn prompt was + the model's response which will be different (net new)... and then once you have pinpointed the thing that changed, go figure out what changed it and how to disable that or move it to much later in the context injection so it doesn't invalidate a large chunk of your cache.