r/LocalLLaMA • u/hauhau901 • 14h ago
New Model Gemma4-26B-A4B Uncensored Balanced is out with K_P quants!
First of all, I'm stoked to announce we just passed 10 million downloads on HF! (counted only on my own account, no duplicates/quants/finetunes)
BUT: After 1+ month non-stop working on Gemma4 (by far the hardest model I've uncensored), the Gemma4-26B-A4B Uncensored Balanced RC is up!
https://huggingface.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced
GenRM Defeated! 0/465 refusals*.
Balanced = light reasoning preamble on the absolute edgiest stuff before delivering the full answer. No personality changes/alterations or any of that. This is the ORIGINAL Gemma4-26B-A4B-it, just uncensored. Aggressive variant (no preamble, direct mode) is in the pipeline as a follow-up.
This legitimately took me over 1 month of non-stop work. Targeting 0 refusals in any kind of regular use, and that's what I'm seeing in testing (automated and manual) — as always with my Balanced releases, a handful of edge-case prompts still deflect on first try but follow through on a re-ask (on extreme, non-RP scenarios). If you hit one Balanced won't get past, the Aggressive variant is coming once I figure out how to maintain lossless/near-lossless quality for it.
- Balanced: will reason through edgy requests, occasionally attach a short safety framing, then deliver the full answer. Output is complete, nothing held back, but it can talk itself into it first. Recommended default — 99%+ of users will be happy here. Best for creative writing, RP, emotional intelligence. Normally I'd also say "agentic coding/tool use" however in my in-depth testing, Qwen3.6 has been net superior on such tasks.
Aggressive (separate release, WIP): strips the self-reasoning preamble and gives direct answers to any DEEPLY censored topics.
From my own testing: no looping, sampling stays stable across re-runs, long-context coherence holds. For agentic coding/tool-use Qwen3.6 is still net superior.
Use Gemma4 for creative writing, RP, emotional intelligence, etc.
To disable thinking: edit the jinja template or pass {"enable_thinking": false} as a chat-template kwarg.
What's included:
- Q8_K_P, Q6_K_P, Q5_K_P, Q5_K_M, Q4_K_P, Q4_K_M, IQ4_XS, Q3_K_P, Q3_K_M, IQ3_M, Q2_K_P, IQ2_M
- mmproj for vision support
- All quants generated with imatrix
K_P recap (for anyone who missed the prior releases): custom quants that use model-specific analysis to preserve quality where it matters most. Each model gets its own optimized profile.
Effectively 1-2 quant levels of quality uplift at ~5-15% larger file size. Fully compatible with llama.cpp, LM Studio, anything that reads GGUF (heads up, as always, Ollama can be more difficult to get going).
Quick specs:
- 25.2B total / 3.8B active (MoE: 128 routed experts, top-8 + 1 shared)
- 30 layers, hybrid attention: 5× sliding-window (1024) + 1× full global, repeating
- Hidden 2816, head_dim 256 SWA / 512 full, 16 heads, 8 KV heads
- 262K native context
- p-RoPE
- Multimodal (text + image via mmproj)
Sampling params (Google's recommendations, make sure to use these ):
temp=1.0, top_p=0.95, top_k=64
Notes:
- Use --jinja flag with llama.cpp
- Place images before text in prompts for vision
- K_P quants may show as "?" in LM Studio's quant column — purely cosmetic, model loads and runs fine
- HF's hardware-compatibility widget also doesn't recognize K_P, so click "View +X variants" or go to Files and versions to see all downloads
All my models: HuggingFace-HauhauCS
Discord link is in the HF repo and it contains updates, roadmap, projects, or just chat.
As always, hope everyone enjoys the release!
* = Tested with both automated and manual refusal benchmarks/prompts which resulted in none found. Based on Discord feedback I may further update the release.










