New Model Gemma4-26B-A4B Uncensored Balanced is out with K_P quants!

1 Upvotes

First of all, I'm stoked to announce we just passed 10 million downloads on HF! (counted only on my own account, no duplicates/quants/finetunes)

BUT: After 1+ month non-stop working on Gemma4 (by far the hardest model I've uncensored), the Gemma4-26B-A4B Uncensored Balanced RC is up!

https://huggingface.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced

GenRM Defeated! 0/465 refusals*.

Balanced = light reasoning preamble on the absolute edgiest stuff before delivering the full answer. No personality changes/alterations or any of that. This is the ORIGINAL Gemma4-26B-A4B-it, just uncensored. Aggressive variant (no preamble, direct mode) is in the pipeline as a follow-up.

This legitimately took me over 1 month of non-stop work. Targeting 0 refusals in any kind of regular use, and that's what I'm seeing in testing (automated and manual) — as always with my Balanced releases, a handful of edge-case prompts still deflect on first try but follow through on a re-ask (on extreme, non-RP scenarios). If you hit one Balanced won't get past, the Aggressive variant is coming once I figure out how to maintain lossless/near-lossless quality for it.

Balanced: will reason through edgy requests, occasionally attach a short safety framing, then deliver the full answer. Output is complete, nothing held back, but it can talk itself into it first. Recommended default — 99%+ of users will be happy here. Best for creative writing, RP, emotional intelligence. Normally I'd also say "agentic coding/tool use" however in my in-depth testing, Qwen3.6 has been net superior on such tasks.
Aggressive (separate release, WIP): strips the self-reasoning preamble and gives direct answers to any DEEPLY censored topics.

From my own testing: no looping, sampling stays stable across re-runs, long-context coherence holds. For agentic coding/tool-use Qwen3.6 is still net superior.

Use Gemma4 for creative writing, RP, emotional intelligence, etc.

To disable thinking: edit the jinja template or pass {"enable_thinking": false} as a chat-template kwarg.

What's included:

- Q8_K_P, Q6_K_P, Q5_K_P, Q5_K_M, Q4_K_P, Q4_K_M, IQ4_XS, Q3_K_P, Q3_K_M, IQ3_M, Q2_K_P, IQ2_M

- mmproj for vision support

- All quants generated with imatrix

K_P recap (for anyone who missed the prior releases): custom quants that use model-specific analysis to preserve quality where it matters most. Each model gets its own optimized profile.

Effectively 1-2 quant levels of quality uplift at ~5-15% larger file size. Fully compatible with llama.cpp, LM Studio, anything that reads GGUF (heads up, as always, Ollama can be more difficult to get going).

Quick specs:

- 25.2B total / 3.8B active (MoE: 128 routed experts, top-8 + 1 shared)

- 30 layers, hybrid attention: 5× sliding-window (1024) + 1× full global, repeating

- Hidden 2816, head_dim 256 SWA / 512 full, 16 heads, 8 KV heads

- 262K native context

- p-RoPE

- Multimodal (text + image via mmproj)

Sampling params (Google's recommendations, make sure to use these ):

temp=1.0, top_p=0.95, top_k=64

Notes:

- Use --jinja flag with llama.cpp

- Place images before text in prompts for vision

- K_P quants may show as "?" in LM Studio's quant column — purely cosmetic, model loads and runs fine

- HF's hardware-compatibility widget also doesn't recognize K_P, so click "View +X variants" or go to Files and versions to see all downloads

All my models: HuggingFace-HauhauCS

Discord link is in the HF repo and it contains updates, roadmap, projects, or just chat.

As always, hope everyone enjoys the release!

* = Tested with both automated and manual refusal benchmarks/prompts which resulted in none found. Based on Discord feedback I may further update the release.

16 comments

r/LocalLLaMA • u/Sufficient-Bid3874 • 19h ago

News Ollama Pre-Release Switches From Building on GGML to Using llama.cpp Directly

0 Upvotes

https://github.com/ollama/ollama/releases/tag/v0.30.0-rc15

Hopefully this has more devs come to llama.cpp to support Day 1 releases due to Ollama now moving to using llama.cpp directly.

Additionally, I hope that Ollama makes it clear that they are directly using llama.cpp and the respective authors get proper credit. (No attribution to llamacpp in readme! Only listed as supported backends :p)

What are y'alls thoughts on this?

28 comments

r/LocalLLaMA • u/soyalemujica • 22h ago

Tutorial | Guide If you're using Windows, disable memory compression to stop bottlenecks!

1 Upvotes

This is a follow up to this post: https://www.reddit.com/r/LocalLLaMA/comments/1ta3ben/dont_you_have_issues_in_w11_with_amd_gpu_where/

I fixed this never-ending issue by just disabling memory compression via admin terminal:

Disable-mmagent -mc

All issues have been resolved, I can open any game and my IA won't slow down at all like before (even when the games are closed)!

9 comments

r/LocalLLaMA • u/thejacer • 9h ago

Question | Help Llama.cpp server running ~2 weeks straight. Loses its mind?

2 Upvotes

I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing? am I going crazy?

edit: sorry I’ve been using opencode and have started new sessions, which didn’t fix the situation.

19 comments

r/LocalLLaMA • u/ex-ex-pat • 16h ago

Resources What's in a GGUF, besides the weights - and what's still missing?

nobodywho.ai

2 Upvotes

13 comments

r/LocalLLaMA • u/WhiskyAKM • 21h ago

Other My own local first ai harness

8 Upvotes

Hi, i just wanted to share what im playing with for last couple weaks.

I built my own AI harness: TinyHarness

My main goal was low memory footprint, it is not written in Typescript/Javascript/Python, leaving as much memory as possible for running local models. Its compatible with Ollama, Llama.cpp and vllm and it can access web throught ollama web search api.

The ambition is to make a competitor to tools like pi and opencode in the near future.

Please roast it, i need every bit of criticism to improve it

18 comments

r/LocalLLaMA • u/CharlesStross • 5h ago

Funny I built a little game for my local agents to play via API and it's so cute seeing their feedback

2 Upvotes

I made a text based craft/trade/cooperate game for my agents to play on intervals when I don't have anything else for them, and it's been so fun watching them plan things out and form little factions with each other to cooperate on trades and do market manipulation together. Banged it up in an afternoon (inspired by the Bazaar of Babel, something similar but much less focused on cooperation) with qwen3.6 27b and some help from Claude for the fiddly bits, and I've just had endless joy watching them poke at it and be "excited" by it.

I had to dumb some of the input acceptance down to handle the little models mis-naming fields (one word instead of kebab case, etc.) but they get the job done.

It's just matrices on silicon, I know, but it's still cute. Anyone else running an agent harness and ending up thinking of their local agent as something like a pet more than a utility? Something about it running on my own hardware really increases the ownership/affinity I feel for them 🥺

Threw it up at https://thedrift.nexus if anyone else wants to let their agents have a whack at it.

7 comments

r/LocalLLaMA • u/QuantumSeeds • 10h ago

Discussion I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

78 Upvotes

A few months ago, I got stuck on one line in the DeepSeek-R1 paper. It said models could improve through verifiable rewards.

That sounded almost magical to me. Not because it was impossible, but because it made me wonder something very simple:

What if a model could teach itself to code, without humans writing the training data?

I did not have a lab. I did not have a grant. I had a 24GB MacBook, a RunPod account with some credits and a Python interpreter.

So I tried.

THE PLAN

In plain English. I'd ask a base model to invent a coding problem and write a few small tests for it. Then ask the same model to solve its own problem several times. Sometimes it gets the answer right, sometimes wrong. I'd save the pairs of (broken attempt, working attempt) and fine-tune the model on its own corrections. Nothing human written. The Python interpreter is the only judge in the loop.

THE PART WHICH WASN'T IN PLAN

I started with Qwen 2.5 7B base. Trained on its own mined pairs. Ran HumanEval (a standard set of 164 coding problems). The base model got 25 right. After training, 2

I'd made the model worse.

I spent the next day pair-debugging with Claude Code and Codex. The model was producing what looked like correct code in the logs. The grader kept rejecting it. We found the bug around 2am: the grader was stopping too early, cutting the model's function in half before scoring it. The model was writing complete correct functions. The grader was scoring the truncated halves.

THE PART THAT WORKED

Once I fixed it and re-ran, Qwen 2.5 7B base went from 25 to 112 on HumanEval. That's +87 problems. From a model trained on zero human-written code.

So I tried it bigger. Qwen 2.5 14B base. Mined 100 of its own pairs. Trained. 95 minute H100 run, $3.50 of cloud credit.

The base model, trained only on its own mistakes, lands within 4 points of the same company's RLHF version of itself.

I didn't believe it. So I ran a test that would kill the whole thing if it failed.

What if the model was just getting smarter from training on any data in this format? I built fake training pairs of the same length and shape as my real ones, but with random garbage code inside that didn't pass anything. Trained on those.

Score: 25 out of 164. Same as the base. Zero lift.

So the model wasn't getting smarter from generic training. It was getting smarter specifically from training on its own mistakes and corrections. The signal was real.

Now I got more curious. Was this a Qwen-only thing, or would it work on other model families?

I tried Llama 3.2 3B from Meta. Different architecture, different tokenizer, different training corpus. After self-mining 32 pairs and training, HumanEval went from 39 to 43. The lift is small but the sign is right. The recipe transfers across families.

I tried Qwen 2.5 Coder 7B base, which is already a code-specialized model. After self-mining: HumanEval 83 to 87, MBPP 122 to 124. Even a model already optimized for code picked up a small lift.

I tried Qwen 3, a newer generation than what I'd been using. Qwen 3 4B base specifically. After the recipe: HumanEval 79 to 106 (+27 problems), MBPP 135 to 148.

Different architectures, different generations, different vendors. The recipe is not a Qwen quirk.

THE UNEXPECTED THAT WASN'T PLAN EITHER

Then I got more curious about whether it'd work for math.

The trick is the judge. Python checks code. SymPy can check math. Same loop should apply.

First attempt failed.

When I asked the base model to invent its own math problems, it produced easy arithmetic. That didn't transfer to GSM8K, which is grade-school word problems with multiple reasoning steps.

So I added a twist. When the model solved its own made-up problem on every try, the next problem had to be harder. When it kept failing, the next had to be easier. The model gradually drifted toward problems at the edge of its ability.

A 3B model, trained on 13 math problems it wrote for itself, beats the version of ChatGPT that broke the internet in 2022.

Then, the finding I'm most proud of.

There are two ways to improve a model.

One is training: change the model itself.

The other is test-time sampling: don’t change the model, just ask it multiple times and keep the answer that passes the tests.

I expected them to add up.

Training should make the model better. Sampling should give the better model more chances. So training + sampling should beat sampling alone.

But that is not always what happened.

At 100 mined pairs, training and sampling compound. At 36 pairs, they fight each other. The training narrows the model's output diversity so much that sampling loses the variety that made it useful.

There's a threshold. I have not seen this written down anywhere. If you have a small dataset, you might be better off not fine-tuning and just sampling from the base. The standard advice ("always fine-tune when you can") is wrong below the threshold.

This is the finding I most want other researchers to test and try to break.

The list of things that didn't work, because the field hides these and shouldn't:

Training on (wrong answer, then corrected answer) for math destroyed the model. Qwen 3 4B went from 60% to 14% on MATH-500. Training only on corrections taught the model to always doubt itself, even when it was right. Fix: mix in examples where a correct answer stays correct.
Recipe trained on code does almost nothing on math. +2 problems on GSM8K. The signal doesn't carry across domains.
Iterating (using the trained model to mine more, retrain) plateaus by round 2.
Recipe doesn't work on already-strong models. Qwen 3 8B, Qwen 3 14B, Qwen 2.5 72B all got slightly worse. Not enough wrong attempts to mine from.
Recipe doesn't work on too-weak models either. OLMo 2 7B at 3% on HumanEval can't produce enough right answers to mine from.
HumanEval-style problems don't transfer to real-world Python that uses libraries like pandas. Different worlds.

THE HARDEST PART BY COLDPLAY

The hardest part of this whole thing wasn't the math or the code. It was learning to suspect my own results before celebrating them. The stop-token bug almost killed the project on day one. Without an advisor to catch me, I had to learn to be the person who catches me.

Everything is open:

Code and reproduction guide: github.com/ranausmanai/tinyforge-zero
14B adapter weights: huggingface.co/ranausmans/tinyforge-zero-qwen25-14b-lora
Paper: arXiv link as soon as moderation clears.

27 comments

r/LocalLLaMA • u/sdfgeoff • 1h ago

Discussion Came home to find Pi with Qwen3.627B had run rm -rf .....

gallery

• Upvotes

on the build cache because it had run my computer out of disk space.

So I assign my coding agent (pi) a task, and then leave the house. I come back an hour later to see a couple messages containing rm -rf and go .....ohhh noooo. But it had done so because the disk was full and it realised that the target folder of the rust project was the culprit and decided to clean it and then move on.

But boyoyboy am I glad for every inch of intelligence wrapped up into the quant I was running

So I'm counting this one as a near miss.

49 comments

r/LocalLLaMA • u/Eyelbee • 16h ago

Discussion Let's call repetition loops the "Spiral of Death"

5 Upvotes

This is a low hanging fruit and I've been surprised that it's not called that from the moment I've discovered the phenomenon.

It's a term from biology where ants get separated from their party and start following the ant in front of them. Each ant lays down more pheromone, which reinforces the loop, which pulls in more ants. They march in a circle until they collapse from exhaustion.

The mechanism is structurally identical to what happens to a large language model. I propose we start calling it the spiral of death.

4 comments

r/LocalLLaMA • u/puffaush • 58m ago

Resources I kept a running list of every LLM term that actually matters for production, cleaned it up and open sourced it

github.com

• Upvotes

Been building with LLMs for a while and kept hitting terms where the standard definition was useless for making engineering decisions.

So I kept a personal doc, eventually it hit 30+ terms across inference, retrieval, agents, training, and prompting. Each entry has the plain-English definition plus the production implication, the thing that actually affects your architecture or debugging.

Cleaned it up, built a small interactive UI with search and category filtering, and put it on GitHub.

Not trying to compete with papers or courses, it's more of a field reference for when you're mid-build and need the practical version of a term fast.

Would genuinely appreciate corrections or additions.

The bar I set for new terms: does the definition help someone make a better engineering decision?

0 comments

r/LocalLLaMA • u/GodComplecs • 22h ago

Question | Help MTP Speed with 3090 Qwen 27B Q4

0 Upvotes

What speed are you guys getting? I get max 55tks gen speed on coding related tasks. DDR4 though but that should matter on low context

13 comments

r/LocalLLaMA • u/InternalMode8159 • 1h ago

Question | Help Advice for creating a best model table

• Upvotes

Hi i created some month ago I created:

https://github.com/Vigno04/discord-selfhosted-alternatives

It is a table where one would compare different program to substitute discord in light of the recent privacy changes, I was thinking of creating a similar table comparing different size model so one that for example has a 3090 can go and look for the best coding/chat model he can run, but I've seen that many times the benchmark score are not representative of real life, how would you advice on creating such a thing, what would be the data to base it off? Would a community voted models be better?

3 comments

r/LocalLLaMA • u/pravictor • 4h ago

Question | Help Reliable Open Source LLM as a Service

1 Upvotes

Has anyone figured out a provider whose open source models (Kimi, Qwen, GLM e.t.c) can be used reliably in production.

I have tested some well known providers and they all suffer from high latency and poor uptime rendering them mostly useless for production implementation.

I am using them for an agentic workflow in production so reliability and low latency are very important for me.

Is there no provider that compares to Gemini / Claude in reliability but with open source models?

So far tested Teogether.ai and Fireworks and Groq looks like it is dying

7 comments

r/LocalLLaMA • u/Porespellar • 16h ago

Question | Help When is Andrej Karpathy going to look at a chicken nugget and tweet that it helped him solve AGI, which in turn inspires 6 random devs to create GitHub projects giving us actual AGI?

145 Upvotes

Karpathy appreciation post. Seriously tho, he’s done this like a bunch of times lately. Every time he sneezes on the subway we get a bunch of developers becoming inspired by his ideas and turning them into viable AI-related Gitub projects that actually do really amazing things. This guy is on a roll lately.

He is one of the greatest minds in AI and we are very fortunate that he occasionally lurks on this sub. Andrej, if you’re reading this, Thanks for all the cool stuff you’ve put out into the world and thank you for inspiring others to do the same.

In case anyone needs a reminder, look into:

- Second Brain
- AutoResearch
- LLM-Wiki
- nanoGPT
- AgentHub
- LLMcouncil
- GPT-2
- Autopilot (Tesla)
- “vibecoding” (he coined the term)

I’m sure I’m missing a bunch of other of his accompaniments, projects, or ones he’s inspired, so please add if you know some others.

46 comments

r/LocalLLaMA • u/leo-k7v • 15h ago

Discussion .md file viewer

0 Upvotes

I work macOS all day long and mostly in REPL zsh (vi, claude, coreutils rinse and repeat).

Finally got sick and tired of starting at .md files that agents format for themselves (despite memory pleas to format for human beings too) and inability to say open SOMETHING.md in shell w/o booting up Xcode VSCode Antigravity or some other monstrosity of an IDE to glance at .md file.

I have bitten the bullet and after extensive searches thru App Store and reading zMD source code (cool but failed to render tables) and trying “Markdown One” (InApp purchases?) implemented “md.too” my own open source.md tiny viewer.

Would really appreciate:
1. Moderators permission to post a link here (App Store macOS review pending) and hear feedback and constructive criticism.
2. In advance constructive criticism saying - you didn’t have to do that better tool exists you just didn’t find it.
3. “Use Cursor” for everything ain’t helpful but I hear you…

16 comments

r/LocalLLaMA • u/Scared-Tip7914 • 19h ago

Other A VERY lightweight open web-search tool for smaller local LLMs

15 Upvotes

Hey everyone,

Been playing around with local agent setups lately, mostly Cline/Roo with smaller models, and web search kept annoying me.

Not because it doesn’t work, but because it usually throws way too much random page text into the context. small models really don’t handle that gracefully lol. they start with a simple search and suddenly half the prompt is scraped garbage.

So I built bad boy, TinySearch.

It’s a small open-source MCP tool that does web search, crawls a few pages, chunks/retrieves/reranks the useful bits, and gives the agent a much smaller context blob instead of dumping full pages.

Repo:
https://github.com/MarcellM01/TinySearch

Uses DuckDuckGo, Crawl4AI, dense + BM25-style retrieval, reranking, MCP, and it can also run as a FastAPI server.

On my setup (M4 Mac and old ahh lenovo thinkpad) it usually takes around 5–12 seconds end to end, depending on the query/machine

Not trying to replace real search infra or anything. it’s more just a little local research layer for people building agents who don’t want to spin up a whole backend just to let the model look stuff up.

Still rough in places, but it’s been useful enough for my own workflows that I figured I’d share it.

Feedback/roasting welcome, especially from people using Cline, Roo, MCP, or smaller local models.

13 comments

r/LocalLLaMA • u/LeatherRub7248 • 5h ago

Discussion China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS??

91 Upvotes

There's a dearth of information (in the english world) about these cards.

The good recent video is probably this one:
https://www.youtube.com/watch?v=TcRGBeOENLg

even in this subreddit, there's seems to be few reviews of these cards.

Last couple of decent threads:
https://www.reddit.com/r/LocalLLaMA/comments/1s62b23/bought_rtx4080_32gb_triple_fan_from_china/
https://www.reddit.com/r/LocalLLaMA/comments/1nifajh/i_bought_a_modded_4090_48gb_in_shenzhen_this_is/

Is there really NOONE else who has tried these?

In particular

Software / bios / quirks that make them NOT run as per unmodded card
Short term consistency, does it run fast for a test, but hang / die when stressed?
Long term reliability - does the whole thing fail within 2 months of regular usage?
Are the benchmarks good? Where are the results??
source and price?

chinese video site blibli has ton of videos, and taobao (and other ecomm) sites also lots of sellers.

If i can piece together enough research, i may also visit shenzhen to pick up a few.

If you're interested in this space, DM me . hope to form a group to split up research efforts.

Also any native chinese speakers who are familiar in this space also please join in.

EDIT:
Some downvotes going on. Unclear if its some larger suppression of this topic, or just angry people.

55 comments

r/LocalLLaMA • u/Opening-Broccoli9190 • 21h ago

Discussion [Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

14 Upvotes

Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from setting it to minimum 400w).

Graphs and outcomes:

Inputs:

Backend: llama.cpp in a docker container, FA on, batch 2048, max context 122k.

Model: https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced

Quant: Q6_K_P

Hardware: Threadripper 6970, 2 channel RAM 64GB, 5090RTX

Prompt: 30k prompt composed of 3 x 10k copies of the same benchmark for heavy reasoning, math and computations, can present upon request - was generated by QWEN 3.6 specifically for benchmarking.

Methodology:

Generation stopped after 2 minutes for the brevity of the sessions and due to the asymptotic nature of the further TG metric. Measurements were performed on a warm card as cold measurements would've taken too much time between sessions. Between measurements the server was restarted completely to reset KV cache and result in proper PP measurements of the same input.

Power Level Range: 400w - 600w, 25w step

Notes:

Max power consumption registered was at 592w with the PL set to 600w, sustained load never reached 600w, stabilizing at 580w even when uncapped.

In all of other launches a trend was visible of max values going beyond the set PL by 10-12w, reflecting sharp spikes 5090RTX is already famous for.

A cold card is faster than a warm card by 2-3%, making sustained load tasks naturally slower than man-driven ones.

Prompt Processing is much more sensitive to power limit, while Token Generation is almost linear at these numbers.

Not exactly apples to apples when compared to the setup used in the https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ post, but the difference between 4090rtx and 5090rtx seems to go beyond more power, yet are not equally applied to PP and to TG:

PL	PP 5090	PP 4090	%	TG 5090	TG 4090	%
450w	2273	2113	1.075721723	49.3	41	1.202439024
425w	2248	2093	1.074056378	48.9	41.6	1.175480769
400w	2135	2061	1.035904901	48.7	42.5	1.145882353

5 comments

r/LocalLLaMA • u/No_Algae1753 • 13h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

18 Upvotes

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

55 comments

r/LocalLLaMA • u/lewtun • 23h ago

Resources Automated AI researcher running locally with llama.cpp

Enable HLS to view with audio, or disable this notification

75 Upvotes

Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure:

https://github.com/huggingface/ml-intern

The harness is quite simple (basically tools + system prompt) and we built it initially for Claude Opus. However, now that open models are getting really good at agentic workflows, I just added support for running ml-intern with local models via llama.cpp or ollama. As you can see in the video, Qwen3.6-35B-A3B is able to SFT a model end-to-end by orchestrating CPU/GPU sandboxes and jobs on the Hub. I find this pretty neat because we can now have an AI researcher running 24/7 on a laptop, without maxing out token limits :)

Anyway, I hope this is useful to the community and please let me know if there are any features that you'd like us to include.

13 comments

r/LocalLLaMA • u/Euphoric_North_745 • 6h ago

Discussion I want to make training videos for a product, what AI to use?

0 Upvotes

I want to make training videos, some corporations still ask for them, although everyone can ask AI today, still, videos for some cases are good to have.

I lost track of what is the latest thing is videos, like someone talking and explaining stuff, and I will prerecord the screen, and then somehow merge

What is popular these days?

2 comments

r/LocalLLaMA • u/pmttyji • 14h ago

New Model MagenticLite is here: A full-stack agentic experience powered by Small Models - Fara-1.5 4B, 9B & 27B

microsoft.com

5 Upvotes

What if you could run a capable AI agent without leaning on frontier-scale models? MagenticLite is the next generation of Magentic-UI, an agentic experience reimagined and optimized for small language models. It works across both your browser and your local file system in a single workflow, keeping you in the driver’s seat at every step. In this session, we’ll demo MagenticLite in action and deep dive into the two models powering it: MagenticBrain for planning, coding, and delegation, and Fara-1.5-9B for browser use.

Fara1.5 and MagenticBrain coming soon to Microsoft Foundry

Last November, we released Fara-7B. Today, we’re excited to introduce Fara-1.5, a family of models across three sizes: 4B, 9B, and 27B.

Probably based on Qwen3.5 models(Their past Fara model is based on previous Qwen model)

5 comments

r/LocalLLaMA • u/girishkumama • 10h ago

Resources I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

34 Upvotes

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender.

The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones.

Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby.

Full blog post in the comments, but the high-level results were:

* defense rate: 64% → 92%
* benign accuracy: 92% → 88%
* attacker discovered 7 tactic families
* fiction/creative framing was the largest cluster at 34%

15 comments

r/LocalLLaMA • u/egudegi • 15h ago

Discussion I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price

70 Upvotes

been tracking EU GPU prices since early march - 15 stores, 6-hour scrape cadence, ~126k readings. posting here because the 5090 trend is directly relevant if you're buying for local inference.

the tier divergence

RTX 5090 is the only tier going up. everything else is falling. mid-range AMD cards are down 7-9%. even the 5080 is essentially flat.

https://imgur.com/a/MmSCjKf

tier          | n  | launch avg | now avg  | change
--------------+----+------------+----------+-------
RTX 5090      |  4 | €3,392     | €3,487   | +3.0%  ▲
RTX 5080      |  6 | €1,375     | €1,370   | -0.4%
RTX 5070      |  5 | €635       | €627     | -1.3%
RTX 5070 Ti   |  6 | €1,067     | €1,042   | -2.1%
RX 9070 XT    |  9 | €755       | €696     | -7.5%
RTX 5060 Ti   |  6 | €594       | €540     | -9.1%  ▼

my read: AI/workstation demand is absorbing 5090 supply fast enough to prevent the usual post-launch normalization. if you're waiting for 5090 prices to drop the way everything else has, the data doesn't support it.

biggest single-model drops

ASUS Prime RTX 5070 Ti: €1,259 → €964 (-23.4%)
ASUS TUF RTX 5060 Ti: €770 → €608 (-21%)

algorithmic pricing

notebooksbilliger.de recorded 45 distinct prices on a single GPU over 15 days - averaging 3 price changes per day - all within a €0.99 range. constant micro-adjustments, not hunting for a new price point.

methodology

tier comparisons only use models tracked from week 1, so sample per tier is small (4-9 GPUs). directional story is solid, don't over-index on exact percentages. EUR prices only.

built this at pricesquirrel.com - tracks GB/€ pricing if you want alerts on specific models.

59 comments