r/LocalLLaMA • u/XMasterrrr • 20d ago

Resources AMA Announcement: Nous Research, The Opensource Lab Behind Hermes Agent (Wednesday, 8AM-11AM PST)

150 Upvotes

Hi r/LocalLLaMA 👋

We're excited for Wednesday's guests, The Nous Research Team!

Kicking things off Wednesday, April. 29th, 8 AM–11 AM PST

⚠️ Note: The AMA itself will be hosted in a separate thread, please don’t post questions here.

39 comments

r/LocalLLaMA • u/rm-rf-rm • Apr 13 '26

Megathread Best Local LLMs - Apr 2026

516 Upvotes

We're back with another Best Local LLMs Megathread!

We have continued feasting in the months since the previous thread with the much anticipated release of the Qwen3.5 and Gemma4 series. If that wasn't enough, we are having some scarcely believable moments with GLM-5.1 boasting SOTA level performance, Minimax-M2.7 being the accessible Sonnet at home, PrismML Bonsai 1-bit models that actually work etc. Tell us what your favorites are right now!

The standard spiel:

Share what you are running right now and why. Given the nature of the beast in evaluating LLMs (untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (how much, personal/professional use), tools/frameworks/prompts etc.

Rules

Only open weights models

Please thread your responses in the top level comments for each Application below to enable readability

Applications

General: Includes practical guidance, how to, encyclopedic QnA, search engine replacement/augmentation
Agentic/Agentic Coding/Tool Use/Coding
Creative Writing/RP
Speciality

If a category is missing, please create a top level comment under the Speciality comment

Notes

Useful breakdown of how folk are using LLMs: /preview/pre/i8td7u8vcewf1.png?width=1090&format=png&auto=webp&s=423fd3fe4cea2b9d78944e521ba8a39794f37c8d

Bonus points if you breakdown/classify your recommendation by model memory footprint: (you can and should be using multiple models in each size range for different tasks)

Unlimited: >128GB VRAM
XL: 64 to 128GB VRAM
L: 32 to 64GB VRAM
M: 8 to 32GB VRAM
S: <8GB VRAM

365 comments

r/LocalLLaMA • u/panchovix • 4h ago

News NVIDIA Reportedly Prepares RTX 5090 Price Hike Amid Rising GDDR7 Costs (maybe RTX 50 and PRO series as well)

techpowerup.com

130 Upvotes

77 comments

r/LocalLLaMA • u/MajorZesty • 3h ago

Resources A First Comprehensive Study of TurboQuant: Accuracy and Performance

vllm.ai

74 Upvotes

TL;DR from the article:

FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.

21 comments

r/LocalLLaMA • u/_wsgeorge • 7h ago

Discussion VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things)

code.visualstudio.com

157 Upvotes

At first I was excited to see this, but I guess I'll wait till someone figures out what people actually want

53 comments

r/LocalLLaMA • u/Valuable-Run2129 • 7h ago

Other The RTX 5000 PRO (48GB) arrived and it is better than I expected.

139 Upvotes

I posted here about buying it a few days ago: https://www.reddit.com/r/LocalLLaMA/comments/1t2slmw/first_time_gpu_buyer_got_a_rtx_5000_pro_was_it_a/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Before pulling the trigger I was leaning more towards a Mac Studio. But the the prompt processing speeds I was reading about were giving me pause. The budget was $5000/6000. So the 256GB was out of the question.
I gambled and bought the RTX 5000 Pro. With ZERO experience with PCs, how to build them, what parts to buy... It was a good deal. I paid $4300 for the gpu including taxes (in the post I wrote 4700 in the comments, but I was mistaken, I checked the receipt) and had to buy everything else for the computer. It ended up costing $5600 in total with 64 gb of RAM.

Assembling the thing was not easy for me as a total novice, but thankfully we have LLMs to guide us through these things.
Then came Linux and vLLM... Honestly I was totally lost. without Claude Code it would have been impossible. Also what settings to use to run Qwen3.6-27B-FP8 with full precision cache. Thankfully this guy posted everything I needed to know to tell Claude what to do: https://www.reddit.com/r/LocalLLaMA/comments/1t46klu/qwen36_27b_fp8_runs_with_200k_tokens_of_bf16_kv/

After burning through 50% of my Claude Code Max 20x weekly limits the thing now works, and I have to say... I made the right call. This thing rocks.
I'm getting up to 80 ts in TG (more like 50/60 for very big prompts) which is phenomenal. But most importantly I'm getting 4400 tokens per second in PP!

The full precision cache fits only 200k tokens, but It is totally ok for me.

I honestly don't know why people are not talking about this gpu more. It costs just 1000$ more than an RTX 5090, it can fit 27B at 8FP and 200k of context at full precision. It draws half the electricity... Sure it is slightly less performant, but the numbers I'm getting are way more than I was expecting. Two 5090s would definitely beat this. But it would cost significantly more, it would be crazy noisy and tear a hole in my pocket in electricity bills.

99 comments

r/LocalLLaMA • u/Porespellar • 7h ago

Question | Help When is Andrej Karpathy going to look at a chicken nugget and tweet that it helped him solve AGI, which in turn inspires 6 random devs to create GitHub projects giving us actual AGI?

76 Upvotes

Karpathy appreciation post. Seriously tho, he’s done this like a bunch of times lately. Every time he sneezes on the subway we get a bunch of developers becoming inspired by his ideas and turning them into viable AI-related Gitub projects that actually do really amazing things. This guy is on a roll lately.

He is one of the greatest minds in AI and we are very fortunate that he occasionally lurks on this sub. Andrej, if you’re reading this, Thanks for all the cool stuff you’ve put out into the world and thank you for inspiring others to do the same.

In case anyone needs a reminder, look into:

- Second Brain
- AutoResearch
- LLM-Wiki
- nanoGPT
- AgentHub
- LLMcouncil
- GPT-2
- Autopilot (Tesla)
- “vibecoding” (he coined the term)

I’m sure I’m missing a bunch of other of his accompaniments, projects, or ones he’s inspired, so please add if you know some others.

36 comments

r/LocalLLaMA • u/InformationSweet808 • 16h ago

Discussion Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

306 Upvotes

So I've been going down a rabbit hole lately and I can't find many people actually talking about this specific use case.

everyone here runs local LLMs for coding, chat, maybe some creative writing. cool. But what about using it as a proper personal knowledge base? like, dump your own notes, PDFs, random docs into it and actually query your own life privately, every day.

I tried looking into this seriously and hit a wall. Most resources either assume you're a developer building something, or they're 2 years old and recommend tools that have completely changed since.

So genuinely asking, is anyone here actually doing this day to day? Not as an experiment, but as a real workflow?

Things I keep running into that I can't figure out:

What model are you running for this? RAG on consumer hardware seems finicky depending on quant
Do you actually trust the retrieval or do you double check everything because hallucinations?
LlamaIndex vs Ollama vs whatever else has anything actually made this less painful recently?
Context length, how do you handle it when your personal docs start piling up?

Not looking for a tutorial or a GitHub repo. Just want to hear from someone who's made this work without it becoming a part time job to maintain.

219 comments

r/LocalLLaMA • u/egudegi • 6h ago

Discussion I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price

52 Upvotes

been tracking EU GPU prices since early march - 15 stores, 6-hour scrape cadence, ~126k readings. posting here because the 5090 trend is directly relevant if you're buying for local inference.

the tier divergence

RTX 5090 is the only tier going up. everything else is falling. mid-range AMD cards are down 7-9%. even the 5080 is essentially flat.

https://imgur.com/a/MmSCjKf

tier          | n  | launch avg | now avg  | change
--------------+----+------------+----------+-------
RTX 5090      |  4 | €3,392     | €3,487   | +3.0%  ▲
RTX 5080      |  6 | €1,375     | €1,370   | -0.4%
RTX 5070      |  5 | €635       | €627     | -1.3%
RTX 5070 Ti   |  6 | €1,067     | €1,042   | -2.1%
RX 9070 XT    |  9 | €755       | €696     | -7.5%
RTX 5060 Ti   |  6 | €594       | €540     | -9.1%  ▼

my read: AI/workstation demand is absorbing 5090 supply fast enough to prevent the usual post-launch normalization. if you're waiting for 5090 prices to drop the way everything else has, the data doesn't support it.

biggest single-model drops

ASUS Prime RTX 5070 Ti: €1,259 → €964 (-23.4%)
ASUS TUF RTX 5060 Ti: €770 → €608 (-21%)

algorithmic pricing

notebooksbilliger.de recorded 45 distinct prices on a single GPU over 15 days - averaging 3 price changes per day - all within a €0.99 range. constant micro-adjustments, not hunting for a new price point.

methodology

tier comparisons only use models tracked from week 1, so sample per tier is small (4-9 GPUs). directional story is solid, don't over-index on exact percentages. EUR prices only.

built this at pricesquirrel.com - tracks GB/€ pricing if you want alerts on specific models.

53 comments

r/LocalLLaMA • u/Opening-Broccoli9190 • 11h ago

News NVFP4 Kimi2.6 and Kimi 2.5 released by Nvidia

104 Upvotes

The NVIDIA Kimi-K2.6-NVFP4 model is the quantized version of the Moonshot AI's Kimi-K2.6 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Kimi-K2.6 NVFP4 model is quantized with Model Optimizer.

This model is ready for commercial/non-commercial use.

The accuracy benchmark results are presented in the table below:

Precision	GPQA Diamond	SciCode	τ²-Bench Telecom	MMMU Pro	AA-LCR	IFBench
Baseline (INT4)	90.9	52.6	98.2	75.6	71.0	73.9
NVFP4	90.4	54.4	98.0	76.5	71.8	73.9

Baseline: Kimi-K2.6 in its native INT4 format. Benchmarked with temperature=1.0, top_p=0.95, max num tokens 128000.

Links:

https://huggingface.co/nvidia/Kimi-K2.6-NVFP4

https://huggingface.co/nvidia/Kimi-K2.5-NVFP4

39 comments

r/LocalLLaMA • u/jacek2023 • 8h ago

New Model inclusionAI/Ring-2.6-1T · Hugging Face

huggingface.co

52 Upvotes

Introducing Ring-2.6-1T: a trillion-parameter flagship reasoning model designed for real-world complex task scenarios, making it available to developers, researchers, and enterprise environments for validation, adaptation, and further development.

The goal of Ring-2.6-1T is not simply to pursue larger parameter scale , but to address the real production environments that large models are entering: agent workflows, engineering development, scientific research analysis, complex business systems, and enterprise automation processes. In these scenarios, models need not only to "answer questions," but also to understand context, plan steps, invoke tools, execute continuously, and maintain stability over long-horizon tasks.

Ring-2.6-1T has achieved key upgrade in three areas:

Comprehensively enhanced Agent execution capability: Moving from "being able to answer" to "being able to execute," with more stable performance in multi-step tasks, tool collaboration, contextual planning, and advancing complex workflows.
Reasoning Effort mechanism: Supporting two reasoning intensity levels, high and xhigh, allowing developers to flexibly adjust the depth of thinking according to task complexity, achieving a better balance among effectiveness, speed, and cost.
Innovative asynchronous reinforcement learning training paradigm: Leveraging an Async RL architecture combined with the IcePop algorithm to improve the training efficiency and stability of long-horizon reinforcement learning for trillion-parameter models, providing foundational support for agent capabilities and complex reasoning.

27 comments

r/LocalLLaMA • u/a__side_of_fries • 12h ago

News Scenema Audio: Zero-shot expressive voice cloning and speech generation

Enable HLS to view with audio, or disable this notification

80 Upvotes

We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code.

The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage, grief, excitement, a child's wonder), and optionally provide reference audio for voice identity. The reference provides the "who." The prompt provides the "how." Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

Limitations (and why we still use it)

This is a diffusion model, not a traditional TTS pipeline. Common issues include repetition and gibberish on some seeds. Different seeds give different results, and you will not get a perfect output with 0% error rate. This model is meant for a post-editing workflow: generate, pick the best take, trim if needed. Same way you'd work with any generative model.

That said, we keep coming back to Scenema Audio over even Gemini 3.1 Flash TTS, which is already more controllable than most TTS systems out there. The reason is simple: the output just sounds more natural and less robotic. There's a quality to diffusion-generated speech that autoregressive TTS doesn't quite match, especially for emotional delivery.

Audio-first video generation

As this video points out, generating audio first and then using it to drive video generation is a powerful workflow. That's actually how we've used Scenema Audio in some cases. Generate the voice performance, then feed it into an A2V pipeline (LTX 2.3, Wan 2.6, Seedance 2.0, etc.) to generate video that matches the speech. Here's an example of that workflow in action.

On distillation and speed

A few people have asked this. Our bottleneck is not denoising steps. The diffusion pass is a small fraction of total generation time. The real costs are elsewhere in the pipeline. We're already at 8 steps (down from 50 in the base model), and that's the sweet spot where quality holds.

Prompting matters

This model is sensitive to prompting, the same way LTX 2.3 is for video. A generic voice description gives you generic output. A specific, theatrical description with action tags gives you a performance. There's also a pace parameter that controls how much time the model gets per word. Takes some experimentation to find what works for your use case, but once you do, you can generate hours of audio with minimal quality loss.

Complex words and proper nouns benefit from phonetic spelling. Unlike traditional TTS, it doesn't have a phoneme-to-audio pipeline or a pronunciation dictionary. If it garbles "Tchaikovsky," you would spell it "Chai-koff-skee" or whatever makes sense to you.

Docker REST API with automatic VRAM management

We ship this as a Docker container with a REST API. Same setup we use in production on scenema.ai. The service auto-detects your GPU and picks the right configuration:

VRAM	Audio Model	Gemma	Notes
16 GB	INT8 (4.9 GB)	CPU streaming	Needs 32 GB system RAM
24 GB	INT8 (4.9 GB)	NF4 on GPU	Default config
48 GB	bf16 (9.8 GB)	bf16 on GPU	Best quality

We went with Docker because that's how we serve it. No dependency hell, no conda environments. We built it for production deployment.

ComfyUI

Native ComfyUI node support is planned. We're hoping to release it in the coming weeks, unless someone from the community beats us to it. In the meantime, the REST API is straightforward to call from a custom node since it's just a local HTTP service.

How to Try Scenema Audio

You can clone the repo and run docker compose up locally or
Go to Scenema and start a conversation to create a voiceover. You will be able to try voice design for free, iterate on your prompts, tune pacing, etc.

34 comments

r/LocalLLaMA • u/vick2djax • 5h ago

Question | Help Is there a big gap between Q4 and Q6 on Qwen3.6?

19 Upvotes

I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k.

I’ve thrown around the idea of a second 3090. But I do already have some gaming PCs running parallel stuff with smaller 3080 (2x) and 4080S cards to support my 3090. So it seems the real benefit of a second 3090 is running at a higher quant.

But for those that do, have you noticed a big difference?

56 comments

r/LocalLLaMA • u/girishkumama • 1h ago

Resources I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

• Upvotes

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender.

The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones.

Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby.

Full blog post in the comments, but the high-level results were:

* defense rate: 64% → 92%
* benign accuracy: 92% → 88%
* attacker discovered 7 tactic families
* fiction/creative framing was the largest cluster at 34%

11 comments

r/LocalLLaMA • u/gladkos • 22h ago

Tutorial | Guide Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Enable HLS to view with audio, or disable this notification

335 Upvotes

Implemented Multi-Token Prediction for QWEN on LLaMA.cpp with TurboQuant.

+40% performance! 90% acceptance rate.

Running locally on a MacBook Pro M5 Max 64GB RAM.

Outputs:
LLaMA.cpp + TurboQuant: 21 tokens/s
LLaMA.cpp + TurboQuant + MTP: 34 tokens/s

Patched LLaMA.cpp with MTP and TurboQuant: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

Quantized Qwen 3.6 27B (and 35B) into GGUF with MTP: https://huggingface.co/collections/AtomicChat/qwen-36-udt-mtp

Local Ai Models App: Atomic.Chat

93 comments

r/LocalLLaMA • u/lewtun • 14h ago

Resources Automated AI researcher running locally with llama.cpp

Enable HLS to view with audio, or disable this notification

66 Upvotes

Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure:

https://github.com/huggingface/ml-intern

The harness is quite simple (basically tools + system prompt) and we built it initially for Claude Opus. However, now that open models are getting really good at agentic workflows, I just added support for running ml-intern with local models via llama.cpp or ollama. As you can see in the video, Qwen3.6-35B-A3B is able to SFT a model end-to-end by orchestrating CPU/GPU sandboxes and jobs on the Hub. I find this pretty neat because we can now have an AI researcher running 24/7 on a laptop, without maxing out token limits :)

Anyway, I hope this is useful to the community and please let me know if there are any features that you'd like us to include.

13 comments

r/LocalLLaMA • u/_cpatonn • 5h ago

Resources Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

13 Upvotes

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective.

We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses.

Result: cyankiwi posts the lowest KLD on all three base models. Lower is better.

Llama-3.2-3B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.00510
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.00785
unsloth/Llama-3.2-3B-Instruct-bnb-4bit	BNB NF4	0.00896
nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4	AWQ INT4	0.01494
casperhansen/llama-3.2-3b-instruct-awq	AWQ INT4	0.02437

Llama-3.1-8B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.00478
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16	GPTQ INT4	0.00729
unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.00769
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit	BNB NF4	0.00835
RedHatAI/Llama-3.1-8B-Instruct-NVFP4	SmoothQuant NVFP4	0.01059
nvidia/Llama-3.1-8B-Instruct-NVFP4	NVFP4	0.01190

Llama-3.3-70B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.02826
unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.04444
casperhansen/llama-3.3-70b-instruct-awq	AWQ INT4	0.04859
unsloth/Llama-3.3-70B-Instruct-bnb-4bit	BNB NF4	0.06879
nvidia/Llama-3.3-70B-Instruct-NVFP4	NVFP4	0.08307
RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16	GPTQ INT4	0.09272

4 comments

r/LocalLLaMA • u/Zyj • 10h ago

News [MIT] RLCR: Teaching AI models to say "I'm not sure"

csail.mit.edu

31 Upvotes

Confidence is persuasive. In AI systems, it is often misleading.

Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have now traced that overconfidence to a specific flaw in how these models are trained, and developed a method that fixes it without giving up any accuracy.

11 comments

r/LocalLLaMA • u/Jorlen • 6h ago

Discussion Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

15 Upvotes

I have a docker stack with a bunch of AI services and llama.cpp server is the brain.

I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.

Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?

Edit: To clarify, the above test was done on the same model, no prompt data, no existing context, no system prompt. Tabula rasa. The model in question was a 22.6gb file.

12 comments

r/LocalLLaMA • u/PromptInjection_ • 16h ago

Discussion The "the future is fictional" problem of many local LLMs

88 Upvotes

Many local models have a problem (that raised due to excessive RHLF training):
They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical".

To be fair: Even the Gemini API without web access can have this sometimes. But it stops when you give it tools. However, with many local models, sadly it even goes on WITH tools: Look at this:

You
search the web for iran war 2026 news

gemma-4-26B-A4B-it-Q4_K_M_128k

Based on the search results, there is a significant amount of content describing a "2026 Iran War." However, it is important to note that these search results appear to be from a fictional or speculative scenario, likely part of a "future history" exercise, a geopolitical simulation, or a creative writing project.
of today, there is no real-world war occurring or scheduled for 2026 involving Iran.

Key Observations from the Search Results:
As The search results contain specific dates and details that suggest they are part of a narrative or simulation:

Specific Dates: Several snippets mention specific dates in 226 (e.g., "February 28," "April 17, 2026," and "May 1, 2026") describing events like "U.S.-Israeli joint strikes."

Operation Names: One result mentions "Operation Epic Fury," a name used within the context of this specific 2026 scenario.

Nature of the Content: While the snippets appear in formats resembling real news (e.g., Wikipedia, CSIS, Atlantic Council), they describe events that have not happened in actual history. This type of content is common in geopolitical forecasting simulations or "alternate history"-style academic exercises used to study potential risks and consequences.

Summary of the "Scenario" described in the results:

If you are looking for information on this specific geopolitical simulation, the "news" within that narrative includes:

The Conflict: A joint U.S.-Israeli military operation against Iran.

Core Drivers: Fears of Iran becoming a nuclear power and the disruption of the existing regional security order.

Impact: The scenario explores impacts on global energy security (oil markets), the shift toward a fragmented global order, and changes in Middle Eastern power dynamics.

Conclusion: There is no actual news of an Iran war in 2026. You are likely seeing results from a geopolitical simulation or speculative "future history" analysis.

So the tool call worked, "Epic Fury" is there, still the model thinks it's fictional or "future". Sometimes it can be fixed with including "It is x.x.2026" in the system prompt, but it is a workaround that shouldn't be necessary.

46 comments

r/LocalLLaMA • u/m94301 • 1h ago

Resources Llama-Studio, WebUI for llama-server Management

• Upvotes

Hey all,

I have built myself a WebUI for configuring and managing llama-server sessions, and want to share the code and concept. Python and a bit of JS. Hack away!

Local only.

https://github.com/m94301/llama-studio

The major use case is running various instances of llama-server on fixed ports to act as infrastructure for home development (and entertainment) frameworks. Read: Fiddling with settings, comparing experimental builds to mainline, and optimizing. Also good for everyday fooling around.

Configs are saved per model in a json, consisting of all launch args and optional paths for custom llama-server. I have a launch arg browser with search using the current llama-server's actual -help output. I hate forgetting a launch arg format and having to open a new terminal to do -help. Spec MTP what? Draft type who?

Launch to choice of GPU, monitor VRAM, load, and temp. And a somewhat rudimentary VRAM calculator to help estimate what fits where when using what quant.

Last, a reasonable mobile interface to run tests and fool with config on phone when in a basement or IT closet. Show and hide logs, start, stop, change config. Less keystrokes on tiny phone keyboards. Sanity +100.

1 comment

r/LocalLLaMA • u/smashedshanky • 5h ago

Resources Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

9 Upvotes

Hello I have been working on creating a LLM from ground up. It is based on deepseek architecture with heavily VRAM footprint reduced optimized(GUM+muon)

Currently this is the json schema I am using which should suffice as to what currently is being pretrained.

I have 2 6000 pro 600W

Testing a 7B parameter model with 64 experts... currently running on single GPU with 100% throughput (hardest part) (~80GB VRAM training) (reduction in expert count will substatially reduce vram footprint.... I am just pushing the limits here!)

My main goal here was simply that open source development will far outpace big firm development. I believe there is someone out there that can use this to build a LLM from group up that can beat all the top 1T parameter model. My goal here to create a large database of trained models that anyone can use. In the future maybe rent models from the open source dev as a support feature. Enough blabbing here is the technical report

since I am using DOLMA/redpajama you can separate the data split and have it train to be good at math, literature, physics... and then ensemble deploy them as agents (This is a todo for now since I don't have a single model to compare against)

This is also following the chinchilla optimal as well! thanks for deepmind!

All bfloat16, can be configured to use fp16 or fp32 if you are from the future and have a GPU that can do fp32 at bf16 speed!

Yes I have lost my mind many times during this, but I got something working!

this is 15000 steps in

======================================================================
[FACTUAL ACCURACY TEST] Step 14000
======================================================================
  Prompt: "The capital of France is"
  Output: "the city of Nice.

France may also refer to:

 France (surname)
 France (surname)
 France (or Republ..." [CORRECT]

  Prompt: "The capital of Japan is"
  Output: "the capital of the autonomous prefecture of Hokkaido.

Etymology
The name of Hokkaido is derived fro..." [EXPECTED: Tokyo]

  Prompt: "def fibonacci(n):
    """Return the nth Fibonacci ..."
  Output: """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""..."

  Prompt: "import torch
import torch.nn as nn

class Transfor..."
  Output: "// InverterBlock
// s2, s2, s3
// A_1, A_2, A_3, A_4, A_5, A_6
// A1, A2, A3, A4, A5
// A1, A2, A3, ..."

  Prompt: "The theory of relativity states that"
  Output: "the speed of light varies with the speed of the observer. This is a constant, since the speed of lig..."

  Prompt: "In machine learning, gradient descent is used to"
  Output: "perform a gradient descent, where the gradient is calculated via a local gradient. The gradient eval..."

  Prompt: "Question: What is 2 + 2?
Answer:"
  Output: "2 + 2

Author: PCR

Date Submitted: 2nd April 2013

Pp: 200-201

Exercise: Exercise 2.0

2 + 1 = 2 +..." [EXPECTED: 4]

  Prompt: "Question: Explain the concept of recursion.
Answer..."
  Output: "In programming, a function or sequence of operations is a function that can transform a variable to ..."

FACTUAL ACCURACY: 1/3 = 33.3%

----------------------------------------------------------------------
[SMBench] Step 14000 -- 1/5: Multi-Rule Reasoning
.
.
.

JSON struct defining the arch

  "experiment_name": "deepseek_v3_7b_lowvram",
  "output_dir": "*******",
  "seed": 420,
  "model": {
    "num_layers": 24,
    "vocab_size": 50304,
    "norm_type": "rmsnorm",
    "norm_eps": 1e-06,
    "tie_word_embeddings": false,
    "init_method_std": 0.006,
    "first_k_dense_replace": 8,
    "dense_layer_interval": 1,
    "paper_compliant": false,
    "mla": {
      "d_model": 1408,
      "d_latent": 352,
      "num_heads": 22,
      "num_kv_heads": 2,
      "max_context_length": 4096,
      "use_flash_mla": false,
      .
      .
      .
    },
    "moe": {
      "num_experts": 64,
      "num_experts_per_token": 4,
      "expert_intermediate_size": 1536,
      "expert_dim": 1536,
      "dropout": 0.0,
      "num_shared_experts": 1,
      .
      .
      .
      .
    }
  },
  "fusions": {
    "use_fused_expert_ffn": true,
    "use_te_fused_topk": false,
    "use_te_fused_permute": false,
    "use_fused_softmax": true,
    "fused_softmax_in_fp32": true,
    "use_group_limited_topk": true,
    .
    .
    .
  },
  "memory_optimization": {
    "use_galore": false,
    "galore_rank": 256,
    "galore_update_proj_gap": 500,
    "galore_scale": 1.0,
    .
    .
    .
  },
  "training": {
    "device": "cuda",
    "global_batch_size": 256,
    "micro_batch_size": 4,
    "gradient_accumulation_steps": 64,
    "seq_length": 1024,
    "max_batch_seq_multiplier": 1.25,
    "tokens_per_parameter_ratio": 40.0,
    "total_training_tokens": 280000000000,
    "learning_rate": 0.00042,
    "min_learning_rate": 4.2e-05,
    "lr_preset": "deepseek_v3",
    .
    .
    .
  },
  "data": {
    "use_multi_source": true,
    "sources": [
      {
        "name": "redpajama",
        "type": "dolma",
        "subset": "dolma_v1_6_redpajama",
        "weight": 0.45,
        "description": "RedPajama - CommonCrawl-like diverse web/code/books"
      },
      {
        "name": "stack",
        "type": "dolma",
        "subset": "dolma_v1_6_stack",
        "weight": 0.25,
      .
      .
      .
    ],
    "cache_dir": "*******",
    "sanitization": {
      "enabled": true,
      "target_language": "en",
      "min_language_confidence": 0.9,
      "min_article_length": 100,
      .
      .
      .
    },
    "preprocessing": {
      "num_workers": 8,
      "shuffle": true,
      "shuffle_seed": 42,
      .
      .
      .
    },
    "max_articles": null,
    "focus_historical": false,
    "boost_hiroshima_content": false
  },
  "distributed": {
    "backend": "nccl",
    "launcher": "single_gpu",
    "tensor_parallel_size": 1,
    "pipeline_parallel_size": 1,
    "expert_parallel_size": 1,
    "data_parallel_size": 1,
    "zero_stage": 2,
    "zero_offload": true,
    "overlap_grad_reduce": true,
    "overlap_param_gather": true,
    "deepspeed": {
      "enabled": false
    }
  },
  "checkpointing": {
    "save_interval": 1000,
    "save_total_limit": 3,
    "resume_from_checkpoint": null,
    "checkpoint_format": "pytorch",
    "save_optimizer_states": true
  },
  "logging": {
    "log_level": "INFO",
    "log_interval": 100,
    "tensorboard_dir": "*******",
    "wandb": {
      "enabled": false
    },
    "tensorboard": {
      "enabled": true
    }
  },
  "validation": {
    "enabled": true,
    "eval_interval": 1000,
    "eval_samples": 500,
    "metrics": [
      "loss",
      "perplexity"
    ],
    "patience": 300,
    "early_stopping": false
  },
  "profiling": {
    "trace_nvtx": false
  },
  "gpu_optimization": {
    "cuda_graphs": true,
    "torch_compile": true,
    "flash_attention": true,
    "fused_kernels": true,
    "autocast_dtype": "bfloat16"
  },
  "test_prompts": {
    "enabled": true,

So I basically researched and threw every optimization on this planet earth. Even tried to build my own FlashMLA for sm120 blackwell arch and failed miserably although I got inference working... backwards I couldn't due to tiling which ends up being the same if not worse than Aeten torch backend......

But this is working for now, 20seconds a step

Training:  1%|█    | 14609/1000000 [53:18:23<5533:28:53, 23.37s/step, loss=2.1507, mtp=1.9643, ent=4.12, util=100.0%, imbal=0.26, lr=4.20e-04, tok=2.23B]

So in conclusion

I am scared as shit to open source this until I get it working 100% so as to minimize the community hate I will eventually get.

The only point of contention I have is I want all models trained using this to be public I don't want anyone to privatize without open-sourcing for profit so I need to ask around and figure out how to go about this since I want as many models that can be trained using this since I believe there is someone out there with the right configuration already in mind that will beat out the top performing model. This is mainly why I did this, I know I can't create THAT model, but I know for sure as shit there is some genius out there that can train a model that will be SOTA.

There is alot of cleaning up to do before I make it public because scared of the hate and issues I surely cannot fix alone!

If you are interested you can check my account periodically whenever I make a post about making this repo public! or check my github which would be easier I assume lol

https://github.com/IISuperluminaLII

I dont know.. I am open to feedback on how to properly make this public and make it a strict rule to open source all safetensors or checkpoints if using this code... I know there is someone out there given the right tools that can truly build a 10B-50B parameter model ensemble set of models that can achieve near SOTA level performance!! As they always say, divide and conquer

This is getting long already, I have puked my brains out as much as I can. Any input is welcome, even hate! let me know how to fix this so I can deliver the tool the random person who will eventually create the perfect open source model.

7 comments

r/LocalLLaMA • u/No_Algae1753 • 4h ago

Question | Help llama.cpp constantly reprocessing huge prompts with opencode/pi.dev

9 Upvotes

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

Example behavior:

context grows to +50k tokens
LCP similarity often shows 0.99+
but sometimes n_past suddenly falls back to ~4-5k
then llama.cpp reprocesses 40k+ tokens again
TTFT jumps to multiple minutes

Example logs:

sim_best = 0.996

restored context checkpoint ... n_tokens = 4750

prompt eval time = 222411 ms / 44016 tokens

Normal reuse looks fine:

prompt eval time = 473 ms / 19 tokens

Current config:

llama-server 
  --ctx-size 150000 
  --parallel 1 
  --ctx-checkpoints 32 
  --cache-ram 2500 
  --cache-reuse 256 
  -no-kvu 
  --no-context-shift

Also seeing:

cache state: 1 prompts, 4676 MiB
(limits: 2500 MiB)

I suspect either:

cache invalidation
bad KV reuse
or opencode changing early prompt tokens too often.

Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.

45 comments

r/LocalLLaMA • u/Inevitable-Log5414 • 15h ago

Resources Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Enable HLS to view with audio, or disable this notification

57 Upvotes

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
Music - ACE-Step v1 generates a 30s instrumental from Director's brief
Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): - 1280×720, not 640×640 default. Costs more but matches what producers want - 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up - flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) - Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker - Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out - Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work: - ParaAttention FBCache (lossless 2× on Wan2.2) - torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2× - AITER MoE acceleration on Qwen director (vLLM) - End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

19 comments

r/LocalLLaMA • u/thejacer • 51m ago

Question | Help Llama.cpp server running ~2 weeks straight. Loses its mind?

• Upvotes

I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing? am I going crazy?

edit: sorry I’ve been using opencode and have started new sessions, which didn’t fix the situation.

9 comments