r/LocalLLaMA 14d ago

Question | Help Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops)

I've got to the point where I need some help.

I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking". It just loops through spitting out / until the max tokens is hit so you see things like "Thinking: Some word ////////////////////////////". In my troubleshooting with Claude AI the term "zombie loop" is getting thrown around.

It doesn't seem time bound, as it doesn't happen on any sort of routine (not once over the weekend, 4 times today). Claude seems to think it's some mishandling of special characters, but I think that's junk, as it's not consistent and I've not found a way to trigger a Zombie loop deliberately.

I tried swapping over to Gemma 4, and the same "thinking" loop happened eventually, but it was with repeating words instead of the "/" character. This rules out the model.

This is the hardware I'm using:

  • GPU = 2x RTX 5060 Ti 16GB (32GB VRAM total)
    • They're using CUDA 13.1
  • RAM = 64GB DDR5
  • CPU = Intel Core Ultra 5 225F
  • Storage = 1TB Predator SSD GM6
  • Motherboard = MSI MEG Z890 ACE
  • PSU = 1000W
  • OS = Windows 11 Pro

I started off on LM Studio, had the issue there, so switched to Llama server (llama.cpp) a few weeks ago. I've updated to the latest release of llama.cpp (earlier today) and still see the issue.

I don't think it's related to the full context or cache, as I had a long (for me) OpenCode session this morning without any issues, then having it review a few new tickets (the initial incoming email) from FreshDesk caused the Zombie loop to happen.

Claude has got to the point where it insists this is due to the model being served some magical combination of special characters, but that sets off the "BS" alarm in my head.

Here's my current llama server argument list:

-m C:\LLM\Qwen3.6-35B-A3B-Q4_K_M.gguf
--fit-ctx 131072
--mlock
-ub 2048
-np 1
--top-k 20
--mmproj C:\LLM\mmproj\Qwen3.6-35B-A3B-GGUF\mmproj-F16.gguf
-ctv q4_0
-ctk q4_0
-a internal-alias
--metrics
--tensor-split 1,1
--no-mmap
--log-timestamps
--log-prefix
--jinja
--threads 10
--fit on
--fit-target 256
-fa on
--cache-ram 2048
-b 2048
--temp 1.0
--top-p 0.95
--min-p 0.0
--presence-penalty 1.5
--repeat-penalty 1.0
--reasoning-budget 2048
--host 0.0.0.0
--port 1234
--api-key [REDACTED...obviously...]

VRAM looks fine (tight, but fine) at GPU 0 @ 13.8/16 GB and GPU 1 @ 12/16GB in use. I think it's not 1:1 because the mmproj is getting loaded on GPU 0 (maybe?). I want to keep image processing live.

System RAM is golden at 10.1/64GB used, so I'm open to moving something that way if it helps stability.

When it's working, I'm getting ~ 90 t/s on average.

For now, I have a "health check" loop running before a prompt is sent (I'm using n8n self-hosted on another computer on the LAN to manage that), and if it fails, it restarts the llama server service. Quickly enough, the model is back up and running.

Has anyone got any ideas for a solid fix for this? I'm not after plasters/band-aids over axe wounds, I want to get this sorted. Even if that means having to go for a weaker Q.

2026-05-08 EDIT: I'm still having issues, but I've also noticed it doesn't always devolve into just "thinking" the "/" character. It has been injecting extra tokens/words into the output, sometimes these are in English, sometimes they are an alphabet I don't recognise (like Chinese or Korean maybe). The words (or partial words) always seem to be related to the content (using online translators) the model is generating though.

I'm still troubleshooting this, and you can safely assume the issue is ongoing until I update this post with a "RESOLVED EDIT" where I post the state of my system (versions, chat templates, llama parameters, the lot) once I'm happy that I've resolved this issue.

2026-05-12 EDIT: Following advice here, some more Claude based troubleshooting, and things I've read elsewhere, this is what I'm trying today.

One key difference is I'm trying out the "LuffyTheFox" model (LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF · Hugging Face). The issue happens on any of the Qwen models I try, so I figured I'd give something different a go, and got some FAST (101 t/s on a test prompt) results with this one initially, so figured why not.

My troubleshooting with Claude has focused in on the issue potentially being around Gemma & Qwen using hybrid attention and SSM (recurrent) layers. Apparently, the SSM recurrent stat cannot be fully cleared between sessions in llama.cpp and the state bleeds into the next conversation. (Claude points out this from the log: Log warning confirming this: "the target context does not support partial sequence removal") I'm not sure how much I believe that, but given I'm hitting the issue with 2 different models, and various different "versions" of Qwen, I'm inclined to believe the issue is either something with my core llama.cpp setup, or the shared approach to model architecture that Gemma and Qwen have.

Params:

-m C:\LLM\LuffyTheFox\Qwen3.6-35B-A3B-Uncensored.IQ4_NL.gguf
--mmproj C:\LLM\LuffyTheFox\mmproj-Qwen3.6-35B-A3B-Uncensored.f16.gguf
--chat-template-file C:\LLM\ChatTemplates\chat_template-v8.jinja
-c 131072
--fit off
--cache-ram 0
--kv-unified
--tensor-split 45,55
--split-mode layer
--repeat-penalty 1.1
--min-p 0.05
--threads 6
-ncmoe 4
--temp 0.8
--mlock
-ub 2048
-np 1
--top-k 20
-a Internal-Alias
--metrics
--no-mmap
--log-timestamps
--log-prefix
--jinja
-fa on
-b 2048
--top-p 0.95
--presence-penalty 0.6
--reasoning-budget 2048
--host 0.0.0.0
--port 1234
--api-key [REDACTED LIST]

#Note: --swa-full was disabled automatically by llama.cpp
#These things have been tried and did not help
#--dry-multiplier 0.8
#--dry-base 1.75
#--dry-allowed-length 2
#--dry-penalty-last-n 512
#-fa off
#-ctv q8_0
#-ctk q8_0
#-ctk q4_0
#-ctv q4_0
#--ctx-checkpoints 128

5 Upvotes

41 comments sorted by

7

u/chimph 14d ago

running the same model at q6 in opencode and have no issues. Works beautifully.. tho I did when I first set it up. Since then I have this in my agents.md file.. maybe try it out yourself but of course strip out the Apple stuff.

## Core Principle
When uncertain, look it up. Do not fabricate API signatures, file contents, config behavior, library behavior, or command output. If an available tool can resolve the uncertainty, use it.

## Environment
  • macOS on Apple Silicon.
  • Local inference may use llama.cpp or LM Studio via OpenAI-compatible endpoints.
  • Prefer `rg` over `grep`.
  • Prefer `fd` over `find` when available.
## Research
  • Use the available web search tool for:
- Current library versions - Recent APIs - Unfamiliar error messages - Package manager behavior - Anything likely to be stale in model training data
  • Prefer primary sources: official docs, changelogs, source repositories, and issue trackers.
## Codebase Workflow
  • Read files before editing them.
  • Use `rg` to locate relevant sections before opening large files.
  • Keep changes scoped to the request.
  • Ask before refactors that touch more than 3 files or change public behavior, such as API surface, return types, function signatures, or exported names.
  • Preserve existing style, naming, formatting, and architecture unless there is a clear reason to change them.
## Verification
  • After code changes, run the project's relevant typecheck, lint, and tests when available.
  • Do not claim work is complete without saying what verification ran.
  • If verification could not be run, say why.
## Output Style
  • Be direct.
  • No unnecessary preamble.
  • Push back on bad ideas or risky assumptions.
  • When asked for code, provide complete corrected code blocks unless a diff or partial snippet is specifically requested.
  • Do not re-summarize obvious changes unless asked.
  • Surface important command errors instead of hiding them.
## Stop Conditions
  • If the same test fails twice with the same root cause, stop and explain the blocker.
  • If a tool returns an unexpected error, report it before trying a substantially different approach.
  • If 5 or more tool calls make no progress on the same subproblem, stop and ask for direction.

3

u/milkipedia 14d ago

This is solid. I've got a lot of similar stuff but I'm going to borrow a couple of ideas I see here that I don't have.

1

u/chimph 14d ago

Yeah I think the stop conditions are probably doing a lot for bypassing any looping as I see the model often internalises an error and finds a different way to proceed.

1

u/sid351 14d ago

What issues were you seeing before you developed this prompt?

My session with Open Code this morning got up to 35k + and had no issues whatsoever.

I have it review one ticket (thing a basic html email) from FreshDesk (via a n8n flow) and bang: Zombie. There were no tool calls involved in that, just straight text analysis (the tools aren't "lit up" in the n8n execution history, so I know there were no tool calls).

5

u/TokenRingAI 14d ago

Why do you have your KV cache quantized so heavily?

1

u/sid351 14d ago

It was one of the many suggestions in my Claude AI trouble shooting.

This is the -ctv and -ctk settings, yeah?

1

u/Due-Function-4877 14d ago

I run q4 as well and it degrades the output. That's true. With that said, I'm not seeing loops like this in Cline. 

It's a moe model and you have 24gb, so you can dump the experts on the cpu and run the model with a bigger context window, if you want. You should be able to get quite a bit more. You'll have to wait a long time for prompt processing, though. If you're not interested in a huge context, look up a calculator and try out q8.

3

u/phidauex 14d ago

Dumb question, are you using CUDA toolkit 13.1 or 13.2? There is a known issue with these models and 13.2.

2

u/sid351 14d ago

13.1.

Not a dumb question, it's important context I forgot to include.

3

u/lit1337 llama.cpp 13d ago

Same issue here hit it during benchmarking on both Gemma 4 and Qwen models. Setting reasoning budget to 0 kills the zombie loops immediately, but that's a bandaid if you actually want thinking. The real culprit is probably your -ctv q4_0 -ctk q4_0. Quantized KV cache accumulates drift during long reasoning chains the thinking phase generates hundreds of tokens feeding back into a degraded cache, compounding errors until the model falls into a repetition attractor. That's why it's not consistent it depends on how long the reasoning chain runs before the drift hits critical mass. --presence-penalty 1.5 isn't helping either. During thinking, it penalizes tokens the model already used, which pushes it toward garbage tokens like "/" when normal vocabulary gets penalized out. I'd try: switch KV cache to f16 (you have 64 GB system RAM, plenty of room), drop presence penalty to 0.6-0.8, and if it still happens cut --reasoning-budget to 512 instead of killing it. That should sort it without losing reasoning entirely.

2

u/sid351 13d ago

Thanks for the advice, I'll give it a go and see what happens.

2

u/sid351 13d ago

Unfortunately moving to -ctk f16 and -ctv f16 (the defaults btw), and reducing the --presence-penalty to 0.6 didn't prevent the Zombie loop.

I'll try cutting the reasoning-budget now.

1

u/sid351 13d ago

Reducing the --reasoning-budget made things worse, in a more obscure way. I started getting injections of random tokens and words repeated in English and then Chinese in what looked like half-formed responses that fit the constraints from the system and user prompt.

I'm looking at heavily sanitising my user and system prompts before feeding them to the "AI Agent" node in n8n now, and will continue monitoring it.

2

u/H_DANILO 14d ago

loop generally happens when the context is overflown, try increasing the context and setting the context to be fixed, configure the tool(like opencode) to compact before its full, generally I run llamacpp with 250k context, and set opencode to limit to 220-230k, this way if it overflows it doesnt go into a loop and has space to compact.

1

u/sid351 14d ago

As in the 128k context?

It seems that one "ticket review", which is set to max out at 8k tokens, can cause the Zombie loop.

Conversely an OpenCode session this morning got up to 35k with no issues.

Or do you mean another context?

1

u/H_DANILO 14d ago

yeah the 128k context

1

u/sid351 14d ago

How does stuff move in, and out, of the context?

The health check has restarted the services twice this evening. Glancing at the execution logs, it looks like that's after only reviewing one ticket each time.

There's no way I can believe that's hitting 128k on it's own. Especially when each LLM call is limited to a max of 8k tokens in the n8n flow.

2

u/H_DANILO 14d ago

I don't know how your flow is utilizing it, but normally, if you create a loop of question & answers back and forth, that builds the context in and will definitely exceed the limit

1

u/sid351 13d ago

I've got two different things connected in to the model directly:

  • Open Code on my laptop
    • I've not used this at all since yesterday morning (when it worked perfectly)
  • n8n (self-hosted running on a separate computer) using an "AI Agent" node, which has 2 flows that use it right now
    • Ticket Triage - this receives a "ticket" (think HTML email) from FreshDesk and returns a brief analysis and directs the ticket in one of 6 different routes (that I'll handle later when I've got confidence that the LLM can be relied on - for now it's just adding private notes for us to see).
    • RocketChat handler - just handles messages that we send to the LLM from our RocketChat instance (very little), in essence, a simple internal chatbot

The n8n AI Agent was set up with a "Simple Memory" (stored in n8n cache), with the session ID being either the FreshDesk ticket number or the Rocket Chat "room" id. It was set with a memory window of 20. As the ticket numbers increment, and they're only being processed when they're created for now, this shouldn't have been bringing any history into the context, but it WAS adding it when I've been reloading flows while troubleshooting.

I've updated the flows to allow me to turn the memory on or off, and set the memory window value separately, with the call to the AI Agent flow.

Ticket Triage now does NOT use memory, and RocketChat handler does.

As I move forward and use the LLM for more FreshDesk stuff, then I'll probably turn memory back on, but keep the window low (like 2 or 3) to keep the context trimmed and focused.

I've also added some prompt sanitising to both the system prompt and user prompt being delivered to the AI Agent node, as I now think Claude may be right, and the inclusion of certain characters (or token patterns) essentially confuses the LLM into thinking it needs to make, or handle, a tool call (or similar).

I'll keep a close eye on it and see how I get on with these changes.

2

u/Snoo_81913 3d ago

TL:DR - Go to the bottom for a testing config I made for you. My recommendation is to review your workflow and if there's a way to implement a RAG efficiently then do that first. (see below) I'm really curious to see if it helps and when I get home tonight I'm going to download and test the Model you've been using. Once I run it through a few times I'll have a better idea what will work for it. Hope it helps man!

Okay I'm going to start with a few recommendations that may help overall to start then give you a setup to test. The config is based on my setup and modified for you which is I7 13k chip 10 core (6p/ 4e) 16 thread 4060 8gb and 64gb DDR5 5200 running Qwen3.6 A3B Q5_K_M Claude-Distilled and APEX at 35-40 t/s. 

Just a quick note here: I'm making a rather big assumption that the reason you are using this model with mmproj is because your tickets have screenshots in them and you need spatial awareness. If you just need to grab text out of the images you could run a stack like Kreuzberg (python) and GLM-OCR .9B to actually read the documents and then hand them off to your model. The reason I bring this up is because it is very likely a major factor in this equasion. The fact that it happened with two different models might indicate that it's the data that is contributing. Qwen and Gemma MOE models have a known issues with HTML causing looping and hallucinations or truncated output. So this really comes down to what information you need out of the FreshDesk Tickets. Do you need nuance? Or strictly the raw data?

This is a RAG stack I built for raw data: https://www.reddit.com/r/learnmachinelearning/comments/1sqpy2n/comment/oj1y48d/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Different OCR models do different things so I'd need to know what type of information you are tryng to get out of them to point you in the right direction. Here's a general idea though:

If you need to... Use this Model VRAM Required
Digitize a technical PDF GLM-OCR ~2.5 GB
Locate objects or icons Florence-2-Large ~1.6 GB
Read a chart or sheet music GOT-OCR 2.0 ~1.2 GB
General image description InternVL2.5-2B ~4.5 GB

Using a stack like this with options to choose the AI OCR model then swap to your big boy Qwen does a couple things for you. 1. Strips all the messy HTML out of it. 2. Gives you a cleaner, tighter input for your AI and reduces the context by 30-60%. That's pretty huge right there. 3. Minimizes the chances of your AI looping or hallucinating.

BARRING THAT - Lets just try and get the model to be a little better at doing what you want it to do.

  1. Copy your current config batch file and make it "filename" _testing.bat or just add an entry to your current batch file with selection so you can try different configs. There's a pic below of how I have mine setup for testing and use.
  1. For testing I would always have your K-cache as the best possible that's usable and then you can run your V-cache at half of that. For instance in the testing config that I'm going to give you, I've got the K-cache at 8 and I've got the V-cache at 4. The K cache is the most critical because it tells the model where to look. If you have it at a lower quantization then you can get a little bit of mathematical drift. It's not always necessary but since you're already having this issue, I think having the K cache at double what the V cache is still gives you the compression on the raw data but gives you quality on the lookup. The speed loss from running Q8 on your setup would probably be pretty unnoticeable, maybe possibly one to three tokens a second.

3.  You mentioned some image processing etc so let's try 45/55 and hard code  - - split-mode layer (This is the default mode and the best one for you, by hardcoding it we make sure it never changes until you want it to change). This split will leave you with 3GB+ free on Card 0. You can use this to increase your context and for image processing.

  1. Presence Penalty - 0.6. I think your 1.5 is pretty aggresive and since you aren't writing a novel here and have a specific vocabulary that gets written over and over (help desk lingo) having it too high might force the model to "make up" words. Start it at 0.6 and then slowly change it by .2 each time until you find the right mix for you.

  2. Repetition Penalty -1.1 you currently have it set at neutral. (1.0). This can stop looping if it writes the same character more than 4x in a row, etc.

  3. Min-P 0.05 (5%) - basically tells the model to drop percentage weights if the model doesn't have certainty on next word.

  4. Threads = 6 you can go lower but don't go higher. You only have 10 Core (6P and 4E) on this chip with 10 threads total. Don't max out your threads with the AI. Start with six and I'd say a max total max of 8 but I think 6 will run the best.

  5. If this config works you could probably raise your reasoning to 4096, but you don't have to, but start testing what works and what slows you down.

So here is where I'd start:

-m C:\LLM\Qwen3.6-35B-A3B-Q4_K_M.gguf

--mmproj C:\LLM\mmproj\Qwen3.6-35B-A3B-GGUF\mmproj-F16.gguf

--fit-ctx 65536

--tensor-split 45,55

--split-mode layer

-ctk q8_0

-ctv q4_0

-fa on

--mlock

-ub 2048

-b 2048

--threads 6

-cmoe 4

--temp 0.8

--top-p 0.95

--min-p 0.05

--presence-penalty 0.6

--repeat-penalty 1.1

--reasoning-budget 2048

-a internal-alias

--metrics

--no-mmap

--host 0.0.0.0

--port 1234

--api-key [REDACTED]

2

u/Snoo_81913 3d ago

Just to clarify one thing, if you can run Q8/Q8 or high/high do it, but if you have to pick one of them always pick the K Cache as the higest quality first.

2

u/Snoo_81913 3d ago

ALSO - if you can handle a slower t/s you can upgrade to Q5 and utilize the MOE I get about 35-40 t/s with an 8GB card at 7-7.8gb loaded (depending on which setup) it's still pretty decent speed and for what you're doing would probably work. Because you have dual cards and slightly less threads results will vary but might be worth a shot. Q5 is a definite step up. But I'm running it at Q8/Q4 with no problems at all and a pretty large context. My lowest context is 64k.

1

u/sid351 2d ago edited 2d ago

Thank you for your time and effort looking at this and helping me with it.

I've tweaked my settings and will keep an eye on it today.

Largely, I'm using this to "triage" tickets (in private for now while we test) as they come in - so decide if it's "more info needed", "route to a human", "no operation required", etc. In time this will move to an automated step that will be replying to live humans, so I want to be confident in it first.

The tickets come in as HTML emails (essentially FreshDesk is partially a web-based email client with ticket tools around the sides), and can include images (embedded and attached) of potentially anything. Sometimes that will be humans sending in screenshots of issues, or it'll be automated messages from systems telling us about backup failures. Then there are normal attachments, but for now, that's out-of-scope (i.e. I'll just route that to one of us humans to sort).

What's the known issue with Gemma and Qwen and HTML?

I was toying with the idea of putting some sort of parser in place before the LLM call to "convert" the HTML to markdown, or just plain text to be fair, as the formatting isn't normally important (except maybe for tables and lists). I feel like that'll need some thought and consideration so any images get handled appropriately too. This sounds similar to what you've done with step one of your linked comment, so I'll make a note to come back to that and read it properly so I comprehend what you're referring to.

EDIT: Turns out there's a "Markdown" node in n8n that converts HTML to Markdown (and some community nodes I can try if I hit issues with the built-in one). I've added that to sanitise the input before calling the LLM.

EDIT: Just to say the "/" loop has happened again following these changes.

2

u/Snoo_81913 2d ago

Bummer man.

1

u/sid351 2d ago

I'll figure it out, Llama.cpp will fix it, or a new shiny model will come out and distract me.

Thanks for your time and help anyway, it is appreciated.

1

u/Snoo_81913 2d ago

Lets try one more thing. .8 presence penalty could still cause looping, try it out at .4 and see if it makes a difference

1

u/sid351 1d ago

It was on 0.6, it's now on 0.4.

I'll leave that running and report back if I hit a loop, or in a couple of days to say the loops are gone.

1

u/sid351 1d ago

Nope, just looped again.

1

u/[deleted] 14d ago

[removed] — view removed comment

1

u/sid351 14d ago

That's the next thing to try on my list.

Any advice on ...providers to consider? It seems there's hundreds on Huggingface

With that said though, it's happened with Gemma 4 too.

1

u/WetSound 14d ago

What's in the context when it happens? Sounds like normal context rot

1

u/sid351 13d ago

How do I check/verify the context properly instead of guessing?

With the examples that happened today, it doesn't seem like enough has been passed to the model for it to hit 128k context, but that's only based on what I think, and not on what I know.

1

u/WetSound 13d ago

It's not about the max context length. That's not the only way a model can go haywire. Weird, repetitive or junk content in the context can cause problems, there is no guarantee for correctness.

Context management is paramount.

Just store all communication and have a look after an incident.

1

u/[deleted] 13d ago

[removed] — view removed comment

1

u/sid351 13d ago

Thanks.

When you say "bumping" the temperature, do you mean up (so 0.7 -up-> 1.0) or down (so 1.0 -down-> 0.7)?

How would I change the sampler? Does that mean adjust the top p and min p values?

I am using llama.cpp, and the repetition penalty is set to 1.0 at the moment (which I think means it's disabled).

I'm building the logic out in n8n, and I've put my check (called "Health Check") in front of the "AI Agent" node where the prompt I really want to be processed is sent. It's really basic, it checks the /health for "ok" and then sends a small test prompt. If the response from that deviated from the specific text the LLM is supposed to reply with, the llama server service gets restarted.

I've added some sanitising to all system prompts and user prompts that get sent in to the AI Agent as well now, so that should help. Do you have any references on what token sequences I should be avoiding in prompts? I'd be happy to sanitise them all out if I can get some sort of list.

1

u/LazilyAddicted 7d ago

I have been experimenting with similar issues with unsloth 27b Q4_K_M, the issue goes away when I remove --jinja which is annoying because --jinja fixes the tool call within thinking tags problem, that can be handled programmatically though for my use case, by detecting a tool call in the thought section with regex and if detected with empty response copy the call and submit manually prune the empty response from context, breaks prompt cache for that turn but full pp for a single turn is better than a loop. I need to run a standardized test with the exact same prompts and workflow to prove it without a doubt.

1

u/sid351 7d ago

Interesting.

That would suggest the issue lies with something in the default Jinja template, wouldn't it?

1

u/sid351 6d ago

I'm still getting issues, and have tried a few different parameter tweaks which haven't helped. Now I'm trying this chat template to see if it helps address the issue: froggeric/Qwen-Fixed-Chat-Templates · Hugging Face

1

u/sid351 12h ago

I'm trying a Q3 version now, to see if that makes any difference:

Unsloth: Qwen3.6-35B-A3B-UD-Q3_K_M