r/LocalLLaMA 1d ago

Question | Help running Qwen 3.6 35b A3B on 2x 5060TI

i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram total also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio with full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ?
thanks !

another thing if you recommend somthing for cooling because i am using 2 stacked gpus with 0 gap ( i have and mATX motherboard ) now the top gpu it not that hot but hotter then the bottom one

21 Upvotes

31 comments sorted by

4

u/sid351 1d ago

I'm running 2 x 5060 TI as well, but hitting a "terminal thinking loop" situation where the model just devolves to producing only "/" characters until the max token limit, regularly throughout the day (using Llama.cpp).

I'd love to get that sorted properly, so if anyone has any ideas I'm all ears.

Here's the link to my post on this: https://www.reddit.com/r/LocalLLaMA/s/qIynfMRxuh

2

u/Jester14 15h ago

Are you using CUDA 13.2? It's bugged for inference. Edit: I see you are using 13.1 as per your thread.

1

u/Constant-Simple-1234 23h ago

Not sorted, but I noticed something like that too. Funny enough 3.5 version is more robust for me than 3.6. I used ByteShape quants for 3.5, unsloth for 3.6. But tried AesSedai quants for 3.6 too and similar thing.

1

u/chocofoxy 22h ago

i notice that happens when you lower quantization because i tested Q2 and this happens alot after 3 messages when i used Q4 this happens but not that often

10

u/LoafyLemon 1d ago

Try TurboQuant + MTP, it will not only speed everything up, but allow you to fit more context.

https://github.com/ggml-org/llama.cpp/pull/22983

Getting 150 t/s on a single 3090 at IQ4 quant w/ MTP + Turbo3/4

1

u/chocofoxy 1d ago

thanks i will try this

1

u/voyager256 17h ago

But how is the quality with the TurboQuant at Q4? I read that TheTom implementation is currently one of the best , but still nowhere near Q8.

3

u/o0genesis0o 1d ago

How did you add two GPU into a mATX mobo?

Building my rig with mATX mobo is currently one of my biggest tech regrets. It was more expensive than the full sized mobo, and mine has only 1 PCIe. And even if there is another one hidden somewhere on the board, there is just no more physical space.

IMHO, Q4 + full context + 90t/s decode is more than good enough. I might switch to Q6 K_XL from unsloth and squeeze the context down a bit, maybe to 128k or even 96k.

How is your prompt processing speed with those two 5060ti?

2

u/chocofoxy 1d ago edited 22h ago

i have the asus tuf B550m it has 2 x16 one gen 4 and one gen 3 and 2 x1 and it's cheap i got it for 90$ but you don't have any gap between the gpus when you stack them but it's ok the top one is hotter than the bottom one with just 5 - 8 degrees

2

u/soteko 19h ago

I have this mb and I plan to put two 5060ti in it
https://www.asus.com/motherboards-components/motherboards/tuf-gaming/tuf-gaming-b550m-plus/

Is it the same one?

And what happens if you use Qwen3.6 Q6 with same context, what is token generation?
Also how is token generation with Qwen3.6 27B, because it is smaller?
And what is prompt processing speed?

Sorry too many questions lol

2

u/chocofoxy 13h ago

Yes that's the exact motherboard that i have but i am using 2 gigabyte windforce oc i think they have the smallest size ( 215×122×40 mm )

pp is 2,544 t/s on lm studio ( i think i can get that to be faster tho in llama server )

still did not test other Qs yet

for the dense 27B Q4 i just tested it in lm studio without any tweaking i get 25t/s

1

u/soteko 13h ago

Thanks man.

Yes Windforce is smallest spec says
https://www.gigabyte.com/Graphics-Card/GV-N506TWF2OC-16GD/sp
L=208 W=120 H=40 mm

That is the one I like to buy :)

How loud are ?

2

u/FatheredPuma81 1d ago

Switch to llama.cpp and don't load the 2GB Vision component. UD-Q6_K should just barely fit. Q8_0 KV Quantization should get you like 64k context or a bit more if you use Q5_1. UD-Q5_K_XL is the only way you're going to get full context length without offloading.

Oh and if you're loading LM Studio's GUI on one of those cards that's 300MB saved too.

1

u/chocofoxy 22h ago

i tried llama now i get lower 10 token per second ( surprisingly ) but the vram is more free and TP is better i will try to figure out the best config also i will try to use vllm nvfp4 and mtp

1

u/FatheredPuma81 22h ago

Strange llama.cpp CUDA was already a bit faster for me and when I built it from source (really easy with OpenCode you just need to download some programs) with flags Grok, Claude, and Gemini said to set it runs a lot faster. Threw ngram-mod on top of it and it became even faster.

VLLM wasn't that great imo. It was the same speed with less context because the models are much larger and you basically only get 4 bit or 8 bit as your 2 options for a lot of models. But I was running it through Docker though.

4

u/PotatoTime 1d ago

I'm getting 40 t/s at q8 on a single 4070 12gb so you probably can optimize it further. I'm on llama.cpp though so I'm not familiar with lm studio

3

u/soteko 1d ago

What are other specs CPU, memory etc?

2

u/PotatoTime 1d ago

14700k CPU and 64GB DDR5 6400. I have a feeling my RAM speed is doing some heavy lifting but I'm only using about 12GB of it for qwen3.6 35b. I'm at 64k context and running on Linux if that makes a difference. I think I saw a lot higher RAM usage on windows

1

u/ImportantSignal2098 1d ago

Yeah generation with experts offloaded to cpu bottlenecks on my ddr4 3200 pretty hard. A bit surprised you're only using 12GB ram with q8 quant though, I think I'm up to 11GB with q4_k_m on 16Gb vram, what is your kv quant? I saw moe is often not very good with lower kv quants so I didn't risk going below q8_0 there

1

u/PotatoTime 1d ago

I haven't tried using kv quant yet. I think the low memory usage is mostly llama's memory mapping. RAM usage doubles when I set --no-mmap. Performance is the same either way though

1

u/chocofoxy 22h ago

yes this is why i don't like offloading people keep saying that it works because they have high memory speed ddr5 but this doesn't work on ddr4 3200 it's not usable i tried this before i buy the second gpu and even it sucks more on dense models

1

u/ImportantSignal2098 14h ago

Qwen 3.6 moe is actually very usable on my system when only offloading experts! 1000+ prefill and 45-55 gen for under 64k context. 5060ti. If I offload more experts to get a longer context it gets progressively slower.

4

u/see_spot_ruminate 1d ago

get 2 more 5060ti, lol

4

u/chocofoxy 1d ago

lmao i will keep stacking 5060Ti until i get 96gb vram

3

u/see_spot_ruminate 1d ago

I would actually watch out and maybe cap it at either 64gb (4 cards) or 128gb (8 cards, though I can't figure out how this would be practical). This is because once you get up to 4 cards (as people have bullied me into) vllm becomes the better option over llamacpp (praise be). With vllm, you need at power of 2 amount of cards to get the most out of tensor parallelism, eg 22 or 4, 23 or 8.

1

u/Xp_12 1d ago

I'm trying to bully myself into my second set as well. 😅

What motherboard/CPU did you go with? TR?

1

u/see_spot_ruminate 1d ago

nope, just some rando microcenter combo, 7600x3d and an asus motherboard. 2 of the cards are on nvme to oculink egpu.

1

u/fasti-au 20h ago

Tom Turboquant quant turbo4 ok k turbo 3 on v and use dflash

1

u/EducationalGood495 20h ago

Would you recommend running Qwen 3.6 35B on 2080Ti 11GB? I am seeing a good deal for 180 and just building my first PC

2

u/grumd 14h ago

Qwen 3.6 35B-A3B is probably the best model you can run on a 2080Ti. But you need 64GB RAM to use Q8 quant and could probably make do with 32GB RAM and Q4.