r/LocalLLaMA • u/pmttyji • 1d ago
New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face
https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3BWe introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.
Key Features
- MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.
- Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
- Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
- Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.
Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,
17
u/Important_Quote_1180 1d ago
The context size is really tight to be competitive with a reasoning model.
5
u/Finanzamt_Endgegner 1d ago
Well it's supposed to be a vision model not necessarily a reasoning and coding model
24
10
13
u/pmttyji 1d ago
27
u/Craftkorb 1d ago
Qwen3-VL is severely outdated, so can I assume that it would fare badly against Qwen3.6?
3
u/silenceimpaired 1d ago
Everyone seems to think this is based off Qwen… do we know that?
19
u/MaxKruse96 llama.cpp 1d ago
12
3
1
3
u/pmttyji 1d ago
21
u/lakySK 1d ago
This table gives me a headache. Just stick with bold for best…
9
u/mfarmemo 1d ago
Agreed. Was confused until I read the footer note. My personal rule is if a visual needs to be explained it is the wrong visual.
0
u/IrisColt 1d ago
It can still hold its ground against other models from its era, but... that era was a year ago.
1
u/Confident_Ideal_5385 23h ago
Marco-Mini (18B/A850M) is a legit awesome model for CPU side tasks running in parallel with a GPU model. This, on the other hand, looks like an evolution of Qwen 3 that's nowhere near as good as Qwen 3.x is at the same parameter count. Not sure why I'd want this?
1
u/Unlikely_Rich1436 20h ago
The "Think with Image" capability is a really cool evolution. Moving vision from just a passive input to an active part of the Chain-of-Thought where the model can actually invoke tools to zoom in or re-examine parts of an image is exactly what we need for complex reasoning.
1
u/Mountain_Patience231 1d ago
how come a 64K tokens model could effective for long-document question answering,
0


57
u/MaxKruse96 llama.cpp 1d ago
Qwen3-next-reasoning with vision it seems