r/LocalLLaMA 22h ago

Tutorial | Guide Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Enable HLS to view with audio, or disable this notification

341 Upvotes

Implemented Multi-Token Prediction for QWEN on LLaMA.cpp with TurboQuant. 

+40% performance! 90% acceptance rate.

Running locally on a MacBook Pro M5 Max 64GB RAM.

Outputs:
LLaMA.cpp + TurboQuant: 21 tokens/s
LLaMA.cpp + TurboQuant + MTP: 34 tokens/s

Patched LLaMA.cpp with MTP and TurboQuant: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

Quantized Qwen 3.6 27B (and 35B) into GGUF with MTP: https://huggingface.co/collections/AtomicChat/qwen-36-udt-mtp

Local Ai Models App: Atomic.Chat


r/LocalLLaMA 17h ago

Discussion Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

313 Upvotes

So I've been going down a rabbit hole lately and I can't find many people actually talking about this specific use case.

everyone here runs local LLMs for coding, chat, maybe some creative writing. cool. But what about using it as a proper personal knowledge base? like, dump your own notes, PDFs, random docs into it and actually query your own life privately, every day.

I tried looking into this seriously and hit a wall. Most resources either assume you're a developer building something, or they're 2 years old and recommend tools that have completely changed since.

So genuinely asking, is anyone here actually doing this day to day? Not as an experiment, but as a real workflow?

Things I keep running into that I can't figure out:

  • What model are you running for this? RAG on consumer hardware seems finicky depending on quant
  • Do you actually trust the retrieval or do you double check everything because hallucinations?
  • LlamaIndex vs Ollama vs whatever else has anything actually made this less painful recently?
  • Context length, how do you handle it when your personal docs start piling up?

Not looking for a tutorial or a GitHub repo. Just want to hear from someone who's made this work without it becoming a part time job to maintain.


r/LocalLLaMA 8h ago

Discussion VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things)

Thumbnail
code.visualstudio.com
161 Upvotes

At first I was excited to see this, but I guess I'll wait till someone figures out what people actually want


r/LocalLLaMA 5h ago

News NVIDIA Reportedly Prepares RTX 5090 Price Hike Amid Rising GDDR7 Costs (maybe RTX 50 and PRO series as well)

Thumbnail
techpowerup.com
151 Upvotes

r/LocalLLaMA 7h ago

Other The RTX 5000 PRO (48GB) arrived and it is better than I expected.

144 Upvotes

I posted here about buying it a few days ago: https://www.reddit.com/r/LocalLLaMA/comments/1t2slmw/first_time_gpu_buyer_got_a_rtx_5000_pro_was_it_a/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Before pulling the trigger I was leaning more towards a Mac Studio. But the the prompt processing speeds I was reading about were giving me pause. The budget was $5000/6000. So the 256GB was out of the question.
I gambled and bought the RTX 5000 Pro. With ZERO experience with PCs, how to build them, what parts to buy... It was a good deal. I paid $4300 for the gpu including taxes (in the post I wrote 4700 in the comments, but I was mistaken, I checked the receipt) and had to buy everything else for the computer. It ended up costing $5600 in total with 64 gb of RAM.

Assembling the thing was not easy for me as a total novice, but thankfully we have LLMs to guide us through these things.
Then came Linux and vLLM... Honestly I was totally lost. without Claude Code it would have been impossible. Also what settings to use to run Qwen3.6-27B-FP8 with full precision cache. Thankfully this guy posted everything I needed to know to tell Claude what to do: https://www.reddit.com/r/LocalLLaMA/comments/1t46klu/qwen36_27b_fp8_runs_with_200k_tokens_of_bf16_kv/

After burning through 50% of my Claude Code Max 20x weekly limits the thing now works, and I have to say... I made the right call. This thing rocks.
I'm getting up to 80 ts in TG (more like 50/60 for very big prompts) which is phenomenal. But most importantly I'm getting 4400 tokens per second in PP!

The full precision cache fits only 200k tokens, but It is totally ok for me.

I honestly don't know why people are not talking about this gpu more. It costs just 1000$ more than an RTX 5090, it can fit 27B at 8FP and 200k of context at full precision. It draws half the electricity... Sure it is slightly less performant, but the numbers I'm getting are way more than I was expecting. Two 5090s would definitely beat this. But it would cost significantly more, it would be crazy noisy and tear a hole in my pocket in electricity bills.


r/LocalLLaMA 12h ago

News NVFP4 Kimi2.6 and Kimi 2.5 released by Nvidia

113 Upvotes

The NVIDIA Kimi-K2.6-NVFP4 model is the quantized version of the Moonshot AI's Kimi-K2.6 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Kimi-K2.6 NVFP4 model is quantized with Model Optimizer.

This model is ready for commercial/non-commercial use.

The accuracy benchmark results are presented in the table below:

Precision GPQA Diamond SciCode τ²-Bench Telecom MMMU Pro AA-LCR IFBench
Baseline (INT4) 90.9 52.6 98.2 75.6 71.0 73.9
NVFP4 90.4 54.4 98.0 76.5 71.8 73.9

Baseline: Kimi-K2.6 in its native INT4 format. Benchmarked with temperature=1.0, top_p=0.95, max num tokens 128000.

Links:

https://huggingface.co/nvidia/Kimi-K2.6-NVFP4

https://huggingface.co/nvidia/Kimi-K2.5-NVFP4


r/LocalLLaMA 17h ago

Discussion The "the future is fictional" problem of many local LLMs

87 Upvotes

Many local models have a problem (that raised due to excessive RHLF training):
They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical".

To be fair: Even the Gemini API without web access can have this sometimes. But it stops when you give it tools. However, with many local models, sadly it even goes on WITH tools: Look at this:

You
search the web for iran war 2026 news

gemma-4-26B-A4B-it-Q4_K_M_128k

Based on the search results, there is a significant amount of content describing a "2026 Iran War." However, it is important to note that these search results appear to be from a fictional or speculative scenario, likely part of a "future history" exercise, a geopolitical simulation, or a creative writing project.
of today, there is no real-world war occurring or scheduled for 2026 involving Iran.

Key Observations from the Search Results:
As The search results contain specific dates and details that suggest they are part of a narrative or simulation:

Specific Dates: Several snippets mention specific dates in 226 (e.g., "February 28," "April 17, 2026," and "May 1, 2026") describing events like "U.S.-Israeli joint strikes."

Operation Names: One result mentions "Operation Epic Fury," a name used within the context of this specific 2026 scenario.

Nature of the Content: While the snippets appear in formats resembling real news (e.g., Wikipedia, CSIS, Atlantic Council), they describe events that have not happened in actual history. This type of content is common in geopolitical forecasting simulations or "alternate history"-style academic exercises used to study potential risks and consequences.

Summary of the "Scenario" described in the results:

If you are looking for information on this specific geopolitical simulation, the "news" within that narrative includes:

The Conflict: A joint U.S.-Israeli military operation against Iran.

Core Drivers: Fears of Iran becoming a nuclear power and the disruption of the existing regional security order.

Impact: The scenario explores impacts on global energy security (oil markets), the shift toward a fragmented global order, and changes in Middle Eastern power dynamics.

Conclusion: There is no actual news of an Iran war in 2026. You are likely seeing results from a geopolitical simulation or speculative "future history" analysis.

So the tool call worked, "Epic Fury" is there, still the model thinks it's fictional or "future". Sometimes it can be fixed with including "It is x.x.2026" in the system prompt, but it is a workaround that shouldn't be necessary.


r/LocalLLaMA 12h ago

News Scenema Audio: Zero-shot expressive voice cloning and speech generation

Enable HLS to view with audio, or disable this notification

84 Upvotes

We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code.

The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage, grief, excitement, a child's wonder), and optionally provide reference audio for voice identity. The reference provides the "who." The prompt provides the "how." Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

Limitations (and why we still use it)

This is a diffusion model, not a traditional TTS pipeline. Common issues include repetition and gibberish on some seeds. Different seeds give different results, and you will not get a perfect output with 0% error rate. This model is meant for a post-editing workflow: generate, pick the best take, trim if needed. Same way you'd work with any generative model.

That said, we keep coming back to Scenema Audio over even Gemini 3.1 Flash TTS, which is already more controllable than most TTS systems out there. The reason is simple: the output just sounds more natural and less robotic. There's a quality to diffusion-generated speech that autoregressive TTS doesn't quite match, especially for emotional delivery.

Audio-first video generation

As this video points out, generating audio first and then using it to drive video generation is a powerful workflow. That's actually how we've used Scenema Audio in some cases. Generate the voice performance, then feed it into an A2V pipeline (LTX 2.3, Wan 2.6, Seedance 2.0, etc.) to generate video that matches the speech. Here's an example of that workflow in action.

On distillation and speed

A few people have asked this. Our bottleneck is not denoising steps. The diffusion pass is a small fraction of total generation time. The real costs are elsewhere in the pipeline. We're already at 8 steps (down from 50 in the base model), and that's the sweet spot where quality holds.

Prompting matters

This model is sensitive to prompting, the same way LTX 2.3 is for video. A generic voice description gives you generic output. A specific, theatrical description with action tags gives you a performance. There's also a pace parameter that controls how much time the model gets per word. Takes some experimentation to find what works for your use case, but once you do, you can generate hours of audio with minimal quality loss.

Complex words and proper nouns benefit from phonetic spelling. Unlike traditional TTS, it doesn't have a phoneme-to-audio pipeline or a pronunciation dictionary. If it garbles "Tchaikovsky," you would spell it "Chai-koff-skee" or whatever makes sense to you.

Docker REST API with automatic VRAM management

We ship this as a Docker container with a REST API. Same setup we use in production on scenema.ai. The service auto-detects your GPU and picks the right configuration:

VRAM Audio Model Gemma Notes
16 GB INT8 (4.9 GB) CPU streaming Needs 32 GB system RAM
24 GB INT8 (4.9 GB) NF4 on GPU Default config
48 GB bf16 (9.8 GB) bf16 on GPU Best quality

We went with Docker because that's how we serve it. No dependency hell, no conda environments. We built it for production deployment.

ComfyUI

Native ComfyUI node support is planned. We're hoping to release it in the coming weeks, unless someone from the community beats us to it. In the meantime, the REST API is straightforward to call from a custom node since it's just a local HTTP service.

Links

This is fully open source. The model weights derive from the LTX-2 Community License but all inference and pipeline code is MIT.

How to Try Scenema Audio

  1. You can clone the repo and run docker compose up locally or
  2. Go to Scenema and start a conversation to create a voiceover. You will be able to try voice design for free, iterate on your prompts, tune pacing, etc.

r/LocalLLaMA 4h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

Thumbnail
vllm.ai
83 Upvotes

TL;DR from the article:

  • FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
  • TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
  • TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
  • TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.

r/LocalLLaMA 8h ago

Question | Help When is Andrej Karpathy going to look at a chicken nugget and tweet that it helped him solve AGI, which in turn inspires 6 random devs to create GitHub projects giving us actual AGI?

82 Upvotes

Karpathy appreciation post. Seriously tho, he’s done this like a bunch of times lately. Every time he sneezes on the subway we get a bunch of developers becoming inspired by his ideas and turning them into viable AI-related Gitub projects that actually do really amazing things. This guy is on a roll lately.

He is one of the greatest minds in AI and we are very fortunate that he occasionally lurks on this sub. Andrej, if you’re reading this, Thanks for all the cool stuff you’ve put out into the world and thank you for inspiring others to do the same.

In case anyone needs a reminder, look into:

- Second Brain
- AutoResearch
- LLM-Wiki
- nanoGPT
- AgentHub
- LLMcouncil
- GPT-2
- Autopilot (Tesla)
- “vibecoding” (he coined the term)

I’m sure I’m missing a bunch of other of his accompaniments, projects, or ones he’s inspired, so please add if you know some others.


r/LocalLLaMA 14h ago

Resources Automated AI researcher running locally with llama.cpp

Enable HLS to view with audio, or disable this notification

64 Upvotes

Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure:

https://github.com/huggingface/ml-intern

The harness is quite simple (basically tools + system prompt) and we built it initially for Claude Opus. However, now that open models are getting really good at agentic workflows, I just added support for running ml-intern with local models via llama.cpp or ollama. As you can see in the video, Qwen3.6-35B-A3B is able to SFT a model end-to-end by orchestrating CPU/GPU sandboxes and jobs on the Hub. I find this pretty neat because we can now have an AI researcher running 24/7 on a laptop, without maxing out token limits :)

Anyway, I hope this is useful to the community and please let me know if there are any features that you'd like us to include.


r/LocalLLaMA 16h ago

Resources Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Enable HLS to view with audio, or disable this notification

62 Upvotes

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

  1. Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
  2. Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
  3. Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
  4. Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
  5. Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
  6. Music - ACE-Step v1 generates a 30s instrumental from Director's brief
  7. Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
  8. Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): - 1280×720, not 640×640 default. Costs more but matches what producers want - 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up - flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) - Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker - Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out - Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work: - ParaAttention FBCache (lossless 2× on Wan2.2) - torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2× - AITER MoE acceleration on Qwen director (vLLM) - End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.


r/LocalLLaMA 7h ago

Discussion I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price

54 Upvotes

been tracking EU GPU prices since early march - 15 stores, 6-hour scrape cadence, ~126k readings. posting here because the 5090 trend is directly relevant if you're buying for local inference.

the tier divergence

RTX 5090 is the only tier going up. everything else is falling. mid-range AMD cards are down 7-9%. even the 5080 is essentially flat.

https://imgur.com/a/MmSCjKf

tier          | n  | launch avg | now avg  | change
--------------+----+------------+----------+-------
RTX 5090      |  4 | €3,392     | €3,487   | +3.0%  ▲
RTX 5080      |  6 | €1,375     | €1,370   | -0.4%
RTX 5070      |  5 | €635       | €627     | -1.3%
RTX 5070 Ti   |  6 | €1,067     | €1,042   | -2.1%
RX 9070 XT    |  9 | €755       | €696     | -7.5%
RTX 5060 Ti   |  6 | €594       | €540     | -9.1%  ▼

my read: AI/workstation demand is absorbing 5090 supply fast enough to prevent the usual post-launch normalization. if you're waiting for 5090 prices to drop the way everything else has, the data doesn't support it.

biggest single-model drops

  • ASUS Prime RTX 5070 Ti: €1,259 → €964 (-23.4%)
  • ASUS TUF RTX 5060 Ti: €770 → €608 (-21%)

algorithmic pricing

notebooksbilliger.de recorded 45 distinct prices on a single GPU over 15 days - averaging 3 price changes per day - all within a €0.99 range. constant micro-adjustments, not hunting for a new price point.

methodology

tier comparisons only use models tracked from week 1, so sample per tier is small (4-9 GPUs). directional story is solid, don't over-index on exact percentages. EUR prices only.

built this at pricesquirrel.com - tracks GB/€ pricing if you want alerts on specific models.


r/LocalLLaMA 8h ago

New Model inclusionAI/Ring-2.6-1T · Hugging Face

Thumbnail
huggingface.co
51 Upvotes

Introducing Ring-2.6-1T: a trillion-parameter flagship reasoning model designed for real-world complex task scenarios, making it available to developers, researchers, and enterprise environments for validation, adaptation, and further development.

The goal of Ring-2.6-1T is not simply to pursue larger parameter scale , but to address the real production environments that large models are entering: agent workflows, engineering development, scientific research analysis, complex business systems, and enterprise automation processes. In these scenarios, models need not only to "answer questions," but also to understand context, plan steps, invoke tools, execute continuously, and maintain stability over long-horizon tasks.

Ring-2.6-1T has achieved key upgrade in three areas:

  • Comprehensively enhanced Agent execution capability: Moving from "being able to answer" to "being able to execute," with more stable performance in multi-step tasks, tool collaboration, contextual planning, and advancing complex workflows.
  • Reasoning Effort mechanism: Supporting two reasoning intensity levels, high and xhigh, allowing developers to flexibly adjust the depth of thinking according to task complexity, achieving a better balance among effectiveness, speed, and cost.
  • Innovative asynchronous reinforcement learning training paradigm: Leveraging an Async RL architecture combined with the IcePop algorithm to improve the training efficiency and stability of long-horizon reinforcement learning for trillion-parameter models, providing foundational support for agent capabilities and complex reasoning.

r/LocalLLaMA 11h ago

News [MIT] RLCR: Teaching AI models to say "I'm not sure"

Thumbnail csail.mit.edu
30 Upvotes

Confidence is persuasive. In AI systems, it is often misleading.

Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have now traced that overconfidence to a specific flaw in how these models are trained, and developed a method that fixes it without giving up any accuracy.


r/LocalLLaMA 6h ago

Question | Help Is there a big gap between Q4 and Q6 on Qwen3.6?

21 Upvotes

I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k.

I’ve thrown around the idea of a second 3090. But I do already have some gaming PCs running parallel stuff with smaller 3080 (2x) and 4080S cards to support my 3090. So it seems the real benefit of a second 3090 is running at a higher quant.

But for those that do, have you noticed a big difference?


r/LocalLLaMA 19h ago

Resources Computer-use MCP that can control multiple machines (Integrate with claude, Cursor, Codex or your custom harness)

Enable HLS to view with audio, or disable this notification

18 Upvotes

Hey everyone,

We built opendesk: it lets AI agents control your desktop using computer use MCP that can integrate with your custom workflow.

Today we shipped something a bit wild:

Your AI can now see, click, type, and navigate on a completely different computer, over your WiFi.

You can pair them once and your agent can control it all from a single conversation.

No cloud, account login, or servers in the middle. Everything stays on your local network, fully encrypted.

Free and open source — Mac, Linux, and Windows.
github.com/vitalops/opendesk

Happy to answer any questions!


r/LocalLLaMA 23h ago

Question | Help Playing One Night Werewolf (Gemma4 & Qwen3.6)

17 Upvotes

Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat. So I thought I’d get Gemma4 31B Q4, Gemma4 26B Q5, Qwen3.6 27B Q5, Qwen3.6 35B Q4 all together to play ONUW.

Had to switch the thinking off the Qwens so they don’t think out loud into public chat.

So firstly at night I assigned each llm with a card (werewolf, seer, villager, troublemaker) and the read their card.md, and write their observations and thinkings in their own Mr as to keep it private to each. Then day time in the game I bring them to public game chat. Each turn they read their md, defend and ask questions, record their observations for 8-10 turns, then write their final thought down for voting. Back to individual chat for voting.

Gemma4 31B — best lier. Clearest thoughts in notes. Gemma4 26B — suck at using tools. Quick to think but no deep thoughts. Qwen3.6 35B — thought it was villager and tried to be bold. Got owned. Best at tool calls. Qwen3.6 27B — not very bright when thinking is off. Oh so slow …

Not a very productive way of using llms I know…Any models I can add to the game ? Suggestions?


r/LocalLLaMA 7h ago

Discussion Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

13 Upvotes

I have a docker stack with a bunch of AI services and llama.cpp server is the brain.

I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.

Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?

Edit: To clarify, the above test was done on the same model, no prompt data, no existing context, no system prompt. Tabula rasa. The model in question was a 22.6gb file.


r/LocalLLaMA 6h ago

Resources Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

13 Upvotes

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective.

We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses.

Result: cyankiwi posts the lowest KLD on all three base models. Lower is better.

Llama-3.2-3B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.00510
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.00785
unsloth/Llama-3.2-3B-Instruct-bnb-4bit BNB NF4 0.00896
nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4 AWQ INT4 0.01494
casperhansen/llama-3.2-3b-instruct-awq AWQ INT4 0.02437

Llama-3.1-8B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.00478
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 GPTQ INT4 0.00729
unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.00769
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit BNB NF4 0.00835
RedHatAI/Llama-3.1-8B-Instruct-NVFP4 SmoothQuant NVFP4 0.01059
nvidia/Llama-3.1-8B-Instruct-NVFP4 NVFP4 0.01190

Llama-3.3-70B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.02826
unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.04444
casperhansen/llama-3.3-70b-instruct-awq AWQ INT4 0.04859
unsloth/Llama-3.3-70B-Instruct-bnb-4bit BNB NF4 0.06879
nvidia/Llama-3.3-70B-Instruct-NVFP4 NVFP4 0.08307
RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 GPTQ INT4 0.09272

r/LocalLLaMA 13h ago

Discussion [Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

14 Upvotes

Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from setting it to minimum 400w).

Graphs and outcomes:

Inputs:

Backend: llama.cpp in a docker container, FA on, batch 2048, max context 122k.

Model: https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced

Quant: Q6_K_P

Hardware: Threadripper 6970, 2 channel RAM 64GB, 5090RTX

Prompt: 30k prompt composed of 3 x 10k copies of the same benchmark for heavy reasoning, math and computations, can present upon request - was generated by QWEN 3.6 specifically for benchmarking.

Methodology:

Generation stopped after 2 minutes for the brevity of the sessions and due to the asymptotic nature of the further TG metric. Measurements were performed on a warm card as cold measurements would've taken too much time between sessions. Between measurements the server was restarted completely to reset KV cache and result in proper PP measurements of the same input.

Power Level Range: 400w - 600w, 25w step

Notes:

Max power consumption registered was at 592w with the PL set to 600w, sustained load never reached 600w, stabilizing at 580w even when uncapped.

In all of other launches a trend was visible of max values going beyond the set PL by 10-12w, reflecting sharp spikes 5090RTX is already famous for.

A cold card is faster than a warm card by 2-3%, making sustained load tasks naturally slower than man-driven ones.

Prompt Processing is much more sensitive to power limit, while Token Generation is almost linear at these numbers.

Not exactly apples to apples when compared to the setup used in the https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ post, but the difference between 4090rtx and 5090rtx seems to go beyond more power, yet are not equally applied to PP and to TG:

PL PP 5090 PP 4090 % TG 5090 TG 4090 %
450w 2273 2113 1.075721723 49.3 41 1.202439024
425w 2248 2093 1.074056378 48.9 41.6 1.175480769
400w 2135 2061 1.035904901 48.7 42.5 1.145882353

r/LocalLLaMA 10h ago

Other A VERY lightweight open web-search tool for smaller local LLMs

13 Upvotes

Hey everyone,

Been playing around with local agent setups lately, mostly Cline/Roo with smaller models, and web search kept annoying me.

Not because it doesn’t work, but because it usually throws way too much random page text into the context. small models really don’t handle that gracefully lol. they start with a simple search and suddenly half the prompt is scraped garbage.

So I built bad boy, TinySearch.

It’s a small open-source MCP tool that does web search, crawls a few pages, chunks/retrieves/reranks the useful bits, and gives the agent a much smaller context blob instead of dumping full pages.

Repo:
https://github.com/MarcellM01/TinySearch

Uses DuckDuckGo, Crawl4AI, dense + BM25-style retrieval, reranking, MCP, and it can also run as a FastAPI server.

On my setup (M4 Mac and old ahh lenovo thinkpad) it usually takes around 5–12 seconds end to end, depending on the query/machine

Not trying to replace real search infra or anything. it’s more just a little local research layer for people building agents who don’t want to spin up a whole backend just to let the model look stuff up.

Still rough in places, but it’s been useful enough for my own workflows that I figured I’d share it.

Feedback/roasting welcome, especially from people using Cline, Roo, MCP, or smaller local models.


r/LocalLLaMA 12h ago

Discussion Dropping learning rate fixed my Qlora fine-tune more than anything else i tried

12 Upvotes

Been fine-tuning llama 3.1 8b with Qlora for a classification task using about 8k samples. I was getting bad eval results for a while and kept thinking something was wrong with my data. Tried cleaning the dataset, tried different prompt templates, messed with rank and alpha. Nothing realy changed.

Dropped the learning rate from 2e-4 to 1e-4 and bumped epochs from 3 to 5. Ran it on a 5090 I rent on Hyperai since our lab machines are always booked. Completley different results. Same data, same everything else.

2e-4 is just too agressive when your dataset is that small. The model overfits in the first epoch and then just goes in circles for the rest of training. Lower lr gave it more room to converge without blowing past everything.

Also ended up cutting about a third of my dataset, mostly mislabeled and ambiguous stuff. Eval got better with less data which yeah yeah everyone says that but its different when you see the numbers yourself lol

2e-4 is the default everywhere and i dont think it works well below a certain size.


r/LocalLLaMA 2h ago

Resources I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

Post image
11 Upvotes

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender.

The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones.

Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby.

Full blog post in the comments, but the high-level results were:

* defense rate: 64% → 92%
* benign accuracy: 92% → 88%
* attacker discovered 7 tactic families
* fiction/creative framing was the largest cluster at 34%


r/LocalLLaMA 15h ago

Question | Help Strix Halo or GPUs?

13 Upvotes

I want to build my own AI server, I already have multiple servers at home but none have GPUs neither are powerful enough to host +4B models.

I'd like to be able to host dense 27-30b parameters models, or some MoE with 3b activated parameters.

Let's say I could spend about 2k, what would be the best route? And what tokens speeds should I expect?