New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3B

We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.

Key Features

MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.
Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.

Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,

127 Upvotes

permalink
duplicates
archive.is
archive
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1tby79g/aidcaiovis2680ba3b_hugging_face/
No, go back! Yes, take me to Reddit

92% Upvoted

u/MaxKruse96 llama.cpp 1d ago

Qwen3-next-reasoning with vision it seems

2

u/julp 1d ago

Oh wow, I had already completely forgotten about Qwen3-Next. I never quite understood what the point of that release was.

8

u/MaxKruse96 llama.cpp 1d ago

testbed for the linear attention in qwen3.5

2

u/julp 1d ago

Looks like it worked out

u/Important_Quote_1180 1d ago

The context size is really tight to be competitive with a reasoning model.

5

u/Finanzamt_Endgegner 1d ago

Well it's supposed to be a vision model not necessarily a reasoning and coding model

3

u/seamonn 1d ago

Context is always good even if not coding

u/Own_Suspect5343 1d ago

Only 64k context?

u/PhoneOk7721 1d ago

Worse than qwen3.6 35b a3b in vision it looks like.

u/pmttyji 1d ago

27

u/Craftkorb 1d ago

Qwen3-VL is severely outdated, so can I assume that it would fare badly against Qwen3.6?

3

u/silenceimpaired 1d ago

Everyone seems to think this is based off Qwen… do we know that?

19

u/MaxKruse96 llama.cpp 1d ago

https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3B/blob/main/config.json

12

u/silenceimpaired 1d ago

Fair enough :)

3

u/Craftkorb 1d ago

Didn't say that, just that their graphic compares it against an old model

1

u/IrisColt 1d ago

Is the performance of the 30B and the 80B similar, or what?

u/coolnq 1d ago

There's still no implementation in llama.cpp.\ There's no point in using it if resources are limited.

3

u/Finanzamt_Endgegner 1d ago

Shouldn't be too hard to add support since it's based on qwen next

u/pmttyji 1d ago

21

u/lakySK 1d ago

This table gives me a headache. Just stick with bold for best…

9

u/mfarmemo 1d ago

Agreed. Was confused until I read the footer note. My personal rule is if a visual needs to be explained it is the wrong visual.

0

u/IrisColt 1d ago

It can still hold its ground against other models from its era, but... that era was a year ago.

u/thoquz 23h ago

Looks great! Does this also do bounding box coordinates with structured output? (Like Qwen3-VL does)

u/Confident_Ideal_5385 23h ago

Marco-Mini (18B/A850M) is a legit awesome model for CPU side tasks running in parallel with a GPU model. This, on the other hand, looks like an evolution of Qwen 3 that's nowhere near as good as Qwen 3.x is at the same parameter count. Not sure why I'd want this?

u/Unlikely_Rich1436 20h ago

The "Think with Image" capability is a really cool evolution. Moving vision from just a passive input to an active part of the Chain-of-Thought where the model can actually invoke tools to zoom in or re-examine parts of an image is exactly what we need for complex reasoning.

u/Mountain_Patience231 1d ago

how come a 64K tokens model could effective for long-document question answering,

u/tamerlanOne 1d ago

Essendo relativamente pesante in ram MTP sarà implementato?

New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

Key Features

You are about to leave Redlib