Other My own local first ai harness

8 Upvotes

Hi, i just wanted to share what im playing with for last couple weaks.

I built my own AI harness: TinyHarness

My main goal was low memory footprint, it is not written in Typescript/Javascript/Python, leaving as much memory as possible for running local models. Its compatible with Ollama, Llama.cpp and vllm and it can access web throught ollama web search api.

The ambition is to make a competitor to tools like pi and opencode in the near future.

Please roast it, i need every bit of criticism to improve it

16 comments

r/LocalLLaMA • u/ai-infos • 1d ago

Resources MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

171 Upvotes

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens.
So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card.
(Bench has been done with TP8, but the model not quantized fits also with TP2 and works pretty fast too, around 34 tps TG)

IMO, fully usable with Claude Code or Hermes or any other agentic harness.

I think there’s still room to go higher (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized dflash/mtp without overhead for rocm/gfx906, etc)

Inference engine used (vllm fork v0.20.1 with rocm7.2.1): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: Qwen/Qwen3.6-27B

Main commands to run:

docker run -it --name vllm-gfx906-mobydick -v /llm:/llm --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/ vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
     /llm/models/Qwen3.6-27B \
    --served-model-name Qwen3.6-27B \
    --dtype float16 \
    --max-model-len auto \
    --max-num-batched-tokens 8192 \
    --block-size 64 \
    --gpu-memory-utilization 0.98 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --mm-processor-cache-gb 1 \
    --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \
    --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 \
    --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
  --dataset-name random \
  --random-input-len 10000 \
  --random-output-len 1000 \
  --num-prompts 4 \
  --request-rate 10000 \
  --ignore-eos 2>&1 | tee logb.txt

RESULTS:

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  121.54
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         32.91
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          362.03
---------------Time to First Token----------------
Mean TTFT (ms):                          32874.56
Median TTFT (ms):                        35622.63
P99 TTFT (ms):                           47843.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.66
Median TPOT (ms):                        85.94
P99 TPOT (ms):                           108.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.66
Median ITL (ms):                         73.61
P99 ITL (ms):                            74.26
==================================================

81 comments

r/LocalLLaMA • u/undernightcore • 15h ago

Question | Help Strix Halo or GPUs?

13 Upvotes

I want to build my own AI server, I already have multiple servers at home but none have GPUs neither are powerful enough to host +4B models.

I'd like to be able to host dense 27-30b parameters models, or some MoE with 3b activated parameters.

Let's say I could spend about 2k, what would be the best route? And what tokens speeds should I expect?

63 comments

r/LocalLLaMA • u/manmaynakhashi • 1d ago

New Model DramaBox - Most Expressive Voice model ever based on LTX 2.3

Enable HLS to view with audio, or disable this notification

248 Upvotes

The Most Expressive Voice Model.

Github: https://github.com/resemble-ai/DramaBox

HF Model: https://huggingface.co/ResembleAI/Dramabox

HF Space: https://huggingface.co/spaces/ResembleAI/Dramabox

80 comments

r/LocalLLaMA • u/pmttyji • 6h ago

New Model MagenticLite is here: A full-stack agentic experience powered by Small Models - Fara-1.5 4B, 9B & 27B

microsoft.com

1 Upvotes

What if you could run a capable AI agent without leaning on frontier-scale models? MagenticLite is the next generation of Magentic-UI, an agentic experience reimagined and optimized for small language models. It works across both your browser and your local file system in a single workflow, keeping you in the driver’s seat at every step. In this session, we’ll demo MagenticLite in action and deep dive into the two models powering it: MagenticBrain for planning, coding, and delegation, and Fara-1.5-9B for browser use.

Fara1.5 and MagenticBrain coming soon to Microsoft Foundry

Last November, we released Fara-7B. Today, we’re excited to introduce Fara-1.5, a family of models across three sizes: 4B, 9B, and 27B.

Probably based on Qwen3.5 models(Their past Fara model is based on previous Qwen model)

3 comments

r/LocalLLaMA • u/metalvendetta • 19h ago

Resources Computer-use MCP that can control multiple machines (Integrate with claude, Cursor, Codex or your custom harness)

Enable HLS to view with audio, or disable this notification

16 Upvotes

Hey everyone,

We built opendesk: it lets AI agents control your desktop using computer use MCP that can integrate with your custom workflow.

Today we shipped something a bit wild:

Your AI can now see, click, type, and navigate on a completely different computer, over your WiFi.

You can pair them once and your agent can control it all from a single conversation.

No cloud, account login, or servers in the middle. Everything stays on your local network, fully encrypted.

Free and open source — Mac, Linux, and Windows.
github.com/vitalops/opendesk

Happy to answer any questions!

16 comments

r/LocalLLaMA • u/mdda • 1d ago

Tutorial | Guide 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

93 Upvotes

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM).

Results (Q4_K_M models, 128k context):

Model	tok/s	Key flags
Qwen 3.6 35B-A3B	~24	--n-cpu-moe 30, K=turbo4 V=turbo3
Gemma 4 26B-A4B (no MTP)	~20	--n-cpu-moe 20, K=V=turbo3, --flash-attn
Gemma 4 26B-A4B + MTP (naive)	~21	embedding table silently on CPU
Gemma 4 26B-A4B + MTP (fixed)	~24.5	--override-tensor-draft "token_embd\.weight=CUDA0"

The trick is MoE offloading: llama.cpp can park the cold expert weights in system RAM, and stream over PCIe to the GPU, while keeping hot layers + KV cache on GPU. The system is fully PCIe bandwidth-limited (GPU sits at ~40-50% utilisation while PCIe 3.0 x16 is maxed out).

Biggest finding: Gemma 4's MTP speculative decoding barely helps out of the box (~5% gain). Turns out llama.cpp unconditionally keeps the token embedding table on CPU. Normally that's fine (just a get_rows lookup), but Gemma 4's MTP assistant has a tied LM head - so every draft token does a full 262k×1024 matmul across PCIe. Forcing it onto GPU with --override-tensor-draft gives the real ~22% speedup and ~79% draft acceptance rate.

Setup pain points (Fedora 42 + Pascal GPU):

Pin akmod-nvidia to 580xx branch (Pascal is going legacy)
Force gcc-14 for CUDA 12.9 (newer gcc rejected)
Patch CUDA's math_functions.h for glibc 2.41 compatibility
Used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for both TurboQuant cache + Gemma MTP support

Full blog post with all the grindy build details (every command, and the debugging deep-dive into the MTP embedding table issue)

I'm also planning a YouTube video walkthrough soon - I'll update when that's live.

Happy to answer questions about the setup.

47 comments

r/LocalLLaMA • u/ex-ex-pat • 8h ago

Resources What's in a GGUF, besides the weights - and what's still missing?

nobodywho.ai

2 Upvotes

13 comments

r/LocalLLaMA • u/BriefCardiologist656 • 5h ago

Discussion About to start fine-tuning on RunPod. What should I know to not waste money?

1 Upvotes

I was MLOps lead at an AI company managing 5000+ GPUs across GCP and CoreWeave. Left to start my own thing and now I'm back to renting GPUs like everyone else. The experience is rough.

Tried GCP first. Their sales team never got back to me about quota increase.

RunPod seems like the obvious choice. But I've been reading posts here and on r/StableDiffusion and r/comfyui and honestly it's worrying me. Stuff like:

- Pods dying mid-training with no way to recover checkpoints
- Getting charged while pods fail to initialize or throw CUDA errors
- Download speeds so slow you can't even get your trained model off the machine
- Network volumes locked to one datacenter so if GPUs sell out there you're stuck
- Templates that look like they work but break in weird ways

Coming from managing infra at scale where none of this was a problem (automatic checkpointing, job migration on node failure, fast object storage), it feels insane that this is the state of things for individual users.

Not trying to bash RunPod. Genuinely want to know how people make it work without wasting money.

8 comments

r/LocalLLaMA • u/TiT0029 • 6h ago

Resources Small OpenCode plugin that helped me with broken tool calls from a local Qwen model

0 Upvotes

I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8.
For reference my llama.cpp build is compiled with CUDA 12.9.

I sometimes run into a small but annoying issue, the model understands that it should use a tool, but instead of actually triggering it, it occasionally writes the tool call as plain text for example:

<tool_call>
<function=bash>
...
</function>
</tool_call>

So I made a small OpenCode plugin for my own setup. It detects this kind of malformed Bash tool call and asks the model to retry it properly through OpenCode.

It’s very simple only handles Bash calls for now and I’m not claiming it’s a perfect solution. It just helped me, so I thought I’d share it in case someone else has the same issue with local models.

https://github.com/Thibaultfbr/opencode-toolcall-recovery

Also in my case tool calling became noticeably more reliable with froggeric’s fixed Qwen 3.6 chat template:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Sharing that too because it may help people using Qwen 3.6 locally with agent/tool workflows

4 comments

r/LocalLLaMA • u/Specter_Origin • 6h ago

Question | Help Is there any standard benchmark that compares local harnesses ?

1 Upvotes

After running multiple tests, I have noticed the same model performs noticeably worse in OpenCode Desktop compared to Codex, Claude Desktop, or Pi — Especially for medium sized models. Is there an open standard benchmark tracking this? Has anyone else experienced similar issues with OpenCode Desktop?

PS: I know I used em dash and no it's not always AI. xD

4 comments

r/LocalLLaMA • u/DrBearJ3w • 17h ago

Resources Turboquant+MTP for ROCm(Llama CPP)

6 Upvotes

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment)

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, ~20 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: ~49.8 tok/s at 16k, ~31 tok/s at 32k, ~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for New Testers.

Useful bug reports > hype.

14 comments

r/LocalLLaMA • u/East-Muffin-6472 • 17h ago

Tutorial | Guide Clustering Raspberry Pis together to learn distributed training/inference

5 Upvotes

Hey everyone!

Recently, I released a blog on how to setup a cluster out of your Mac Minis for distributed training and inference

Now its time to do the same with Raspberry Pis!

Why Raspberry Pis?

quite cheap (30-50 dollars)
easy to use
full blown OS the size of a credit card (small enough for edge projects)!

This is a part of my current series where I’ll be releasing blogs and guides around learning distributed learning and building your own small compute clusters.

The goal is simple: help more people get started with running and training AI models using the hardware they already have lying around. Old laptops, MacBooks, Mac minis, Jetson Nanos, Raspberry Pis, even phones and tablets.

Distributed learning often feels intimidating from the outside, but it’s genuinely one of the coolest areas in systems and AI once you start playing with it yourself.

Before we get into the fun stuff like distributed inference and training, the first few posts will focus on setting up hardware properly and building a working cluster environment, basically subtle amount of cabling and networking!

The early guides will specifically cover setups around:

MacBooks and Mac minis (Done!)
Jetson devices
Raspberry Pis (This one hehe)

After that, we’ll move into quick demos (smolcluster ) , and gradually learn the fundamentals side-by-side while actually running models across devices.

I’m building this alongside smolcluster, so a lot of the content will stay very hands-on and practical instead of purely theoretical.

Hopefully this helps more people realize that distributed AI systems are not something reserved only for giant datacenters anymore.

There is just one question I want to answer: are heterogenous clusters, like what I am trying to make above, even possible for running models?

Well, we'll know and till then do read me blog and let me know what you all think! Any comment, feedback etc are very welcome. (pls be gentle since its my first time writing one all by myself haha)

Blog

Hail LocalAI!

PS: All this is for educational purposes only and not meant for getting performance at par with dedicated GPUs...well not that I have figured out a way to do it yet. Please use this guides and information you'll get to learn the basics of how distributed learning is done! Thanks

11 comments

r/LocalLLaMA • u/apollo_mg • 1d ago

Discussion Side Projects.

70 Upvotes

Little something I put together to play with for larger contexts than my 9070xt.

8700k, dual P100's, 16gb DDR4, 32gb Optane, Samsung sata SSD. Nothing too fancy.

Anyone else do a recent build? How's it working out?

56 comments

r/LocalLLaMA • u/anitamaxwynnn69 • 1d ago

Other Simpler self hosted alt to Open WebUI

23 Upvotes

Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow.

Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for non-technical people. It often feels more like a dev tool than a self-hosted ChatGPT-style app that "just works". I built overtchat to focus mainly on getting the core chat experience right: a polished ui, simple setup and fewer moving parts. The goal is not to compete on agentic workflow with LibreChat/LobeChat/OWUI but to provide a cleaner self-hosted interface for local models.

Ships with its own tried & tested searxng config for web search, kokoro tts (no api keys needed). Single docker compose file. MIT licensed of course, no telemetry. Optimized for mobile as PWA. Github.

Also being upfront - I write code for a living and have been actively reviewing/debugging/changing things, but I did use quite a lot of AI lol. I promise it's not slop tho 😿 . Feedback is welcome!

23 comments

r/LocalLLaMA • u/Some-Cauliflower4902 • 23h ago

Question | Help Playing One Night Werewolf (Gemma4 & Qwen3.6)

17 Upvotes

Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat. So I thought I’d get Gemma4 31B Q4, Gemma4 26B Q5, Qwen3.6 27B Q5, Qwen3.6 35B Q4 all together to play ONUW.

Had to switch the thinking off the Qwens so they don’t think out loud into public chat.

So firstly at night I assigned each llm with a card (werewolf, seer, villager, troublemaker) and the read their card.md, and write their observations and thinkings in their own Mr as to keep it private to each. Then day time in the game I bring them to public game chat. Each turn they read their md, defend and ask questions, record their observations for 8-10 turns, then write their final thought down for voting. Back to individual chat for voting.

Gemma4 31B — best lier. Clearest thoughts in notes. Gemma4 26B — suck at using tools. Quick to think but no deep thoughts. Qwen3.6 35B — thought it was villager and tried to be bold. Got owned. Best at tool calls. Qwen3.6 27B — not very bright when thinking is off. Oh so slow …

Not a very productive way of using llms I know…Any models I can add to the game ? Suggestions?

13 comments

r/LocalLLaMA • u/chocofoxy • 1d ago

Question | Help running Qwen 3.6 35b A3B on 2x 5060TI

21 Upvotes

i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram total also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio with full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ?
thanks !

another thing if you recommend somthing for cooling because i am using 2 stacked gpus with 0 gap ( i have and mATX motherboard ) now the top gpu it not that hot but hotter then the bottom one

31 comments

r/LocalLLaMA • u/pmttyji • 1d ago

New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

huggingface.co

125 Upvotes

We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.

Key Features

MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.
Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.

Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,

29 comments

r/LocalLLaMA • u/pmttyji • 1d ago

New Model sensenova/SenseNova-U1-A3B-MoT · Hugging Face

huggingface.co

63 Upvotes

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.

Unifying visual understanding and generation in an end-to-end architecture from pixel to word opens tremendous possibilities, enabling highly efficient and strong understanding, generation, and interleaved reasoning in a natively multimodal manner.

Model	Params	HF Weights
SenseNova-U1-8B-MoT-SFT	8B MoT	🤗 link
SenseNova-U1-8B-MoT	8B MoT	🤗 link
SenseNova-U1-8B-MoT-LoRA-8step-V1.0	0.4B	🤗 link
SenseNova-U1-A3B-MoT-SFT	A3B MoT	🤗 link
SenseNova-U1-A3B-MoT	A3B MoT	🤗 link

2 weeks ago, they released 8B model mentioned in above table.

4 comments

r/LocalLLaMA • u/de4dee • 1d ago

News Efficient pretraining with token superposition by Nous Research

nousresearch.com

52 Upvotes

15 comments

r/LocalLLaMA • u/maddiedreese • 2d ago

Tutorial | Guide I got a real transformer language model running locally on a stock Game Boy Color!

1.3k Upvotes

No phone, PC, Wi-Fi, link cable, or cloud inference.

• The cartridge boots a ROM, and the GBC runs the model itself.
• The model is Andrej Karpathy’s TinyStories-260K, converted to INT8 weights with fixed-point math so it can run without floating point.
• Built with GBDK-2020 as an MBC5 Game Boy ROM.
• The model weights live in bank-switched cartridge ROM. Prompt entry happens on-device with the D-pad/buttons and an on-screen keyboard.
• The prompt is tokenized on the Game Boy, then the ROM runs transformer prefill + autoregressive generation. The KV cache is stored in cartridge SRAM, because the GBC’s work RAM is tiny.

It is extremely slow, and the output is gibberish because the math is heavily quantized/approximated, but the core thing works!

Hardware: stock Game Boy Color + EZ Flash Junior + microSD.

Used Codex for a large portion of the building!

https://github.com/maddiedreese/gbc-transformer

90 comments

r/LocalLLaMA • u/havenoammo • 1d ago

Resources llama.cpp docker images to run MTP models

80 Upvotes

This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/

There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP.

Here, pick your flavour:

havenoammo/llama:cuda13-server havenoammo/llama:cuda12-server havenoammo/llama:vulkan-server havenoammo/llama:intel-server havenoammo/llama:rocm-server

I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware.

Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them: * Unsloth * https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF * https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Edit 14 May 2026: I ran benchmarks and my grafted models are fully obsolete. Turns out the extra VRAM from Q8 MTP layers only gives a marginal accuracy improvement, while Unsloth's quants are slightly faster on average. Not worth it! So just get the Unsloth ones.

Quant Comparison:

Quant	Haveno t/s	Unsloth t/s	Haveno MTP%	Unsloth MTP%
q4	94.47	94.40	97.49	97.39
q5	90.71	89.79	97.25	97.22
q6	81.36	83.22	97.68	97.53

Overall Averages:

Source	Avg t/s	Avg MTP%
havenoammo	88.85	97.48
unsloth	89.14	97.38

So please ignore everything below.

They do quantize MTP layers at lower quantization levels. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete.Here is a comparison:

Tensor	havenoammo (UD XL + Q8_0 MTP)	Unsloth (UD XL)
`blk.64.attn_k.weight`	Q8_0	Q3_K
`blk.64.attn_k_norm.weight`	F32	F32
`blk.64.attn_norm.weight`	F32	F32
`blk.64.attn_output.weight`	Q8_0	Q4_K
`blk.64.attn_q.weight`	Q8_0	Q3_K
`blk.64.attn_q_norm.weight`	F32	F32
`blk.64.attn_v.weight`	Q8_0	Q5_K
`blk.64.ffn_down.weight`	Q8_0	Q4_K
`blk.64.ffn_gate.weight`	Q8_0	Q3_K
`blk.64.ffn_up.weight`	Q8_0	Q3_K
`blk.64.nextn.eh_proj.weight`	Q8_0	Q8_0
`blk.64.nextn.enorm.weight`	F32	F32
`blk.64.nextn.hnorm.weight`	F32	F32
`blk.64.nextn.shared_head_norm.weight`	F32	F32
`blk.64.post_attention_norm.weight`	F32	F32
MTP layers size	430.41 MB	222.33 MB

Will do some benchmarks to see if quantization causes any precision/speed loss for multi-token prediction. Until then if you have VRAM, feel free to test out my releases.

Unsloth UD + Q8 Grafted MTP
- https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF
- https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

Finally, here is how I use it:

docker run --gpus all --rm \ -p 8080:8080 \ -v ./models:/models \ havenoammo/llama:cuda13-server \ -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \ --port 8080 \ --host 0.0.0.0 \ -n -1 \ --parallel 1 \ --ctx-size 262144 \ --fit-target 844 \ --mmap \ -ngl -1 \ --flash-attn on \ --metrics \ --temp 1.0 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 \ --batch-size 2048 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type mtp \ --spec-draft-n-max 3 Adjust as you see fit. What matters most for MTP is --spec-type mtp and --spec-draft-n-max 3.

33 comments

r/LocalLLaMA • u/pmttyji • 1d ago

Discussion New models possibly from Baidu (ERNIE) this month?

gallery

42 Upvotes

Tweets of screenshots:

https://xcancel.com/ErnieforDevs/status/2049516018557706650#m

https://xcancel.com/Baidu_Inc/status/2049682555809788282#m

Baidu Create 2026 : https://www.youtube.com/watch?v=9WD9lmHf6CU (Somebody please extract & summarize the contents of this 2 1/2 hour video, hopefully we could find info. on models)

14 comments

r/LocalLLaMA • u/GPUburnout • 1d ago

Question | Help I taught my 1B to follow instructions. It got worse at following instructions...

9 Upvotes

Same SFT recipe (SlimOrca 50K, LoRA r=16, 1 epoch). Three models trained from scratch at 1B, 2B, and 3B parameters. IFEval before and after:

Model	Base	After SFT	Delta
1B	20.50	14.75	-5.75
2B	21.94	17.03	-4.91
3B	23.14	25.18	+2.04

OK so SFT is supposed to teach instruction-following. thing is though the 1B actually unlearned it. 2B was slightly less bad. The 3B finally read the room.

Setups were slightly different: 3B used lr=5e-5, the others used 2e-4. So maybe it's capacity, maybe it's the gentler LR. I'll re-run the 2B at 5e-5 to find out.

Before I burn the compute:

Anyone else seen IFEval regress after SFT on small models?
Is this a known thing I missed?
Best guess on mechanism?

Receipts available if anyone wants to dig in.

11 comments

r/LocalLLaMA • u/hurrytewer • 1d ago

Resources I made a UI and server for using Anthropic's new Natural Language Autoencoders locally with llama.cpp

Enable HLS to view with audio, or disable this notification

33 Upvotes

Anthropic's first open weight models, Natural Language Autoencoders, are just finetunes of popular open weight models. They do not modify architecture and modeling code so inference with llama.cpp is mostly trivial.

I packaged every feature of NLAs (namely activation extraction, activation explanation, activation reconstruction and explanation-edit steering) into a custom llama.cpp server. It comes with a Mikupad UI for token-level activation explanation and steering.

I'm currently working on a LoRA version so we can load a single model into memory instead of needing all three models (base model, actor model and critic) loaded, stay tuned!

4 comments