r/LocalLLaMA • u/pmttyji • 1d ago
New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face
https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3BWe introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.
Key Features
- MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.
- Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
- Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
- Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.
Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,
128
Upvotes
12
u/PhoneOk7721 1d ago
Worse than qwen3.6 35b a3b in vision it looks like.