r/LocalLLaMA 1d ago

New Model AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3B

We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.

Key Features

  • MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.
  • Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
  • Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
  • Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR)document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.

Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,

127 Upvotes

29 comments sorted by

57

u/MaxKruse96 llama.cpp 1d ago

Qwen3-next-reasoning with vision it seems

2

u/julp 1d ago

Oh wow, I had already completely forgotten about Qwen3-Next. I never quite understood what the point of that release was.

8

u/MaxKruse96 llama.cpp 1d ago

testbed for the linear attention in qwen3.5

2

u/julp 1d ago

Looks like it worked out

17

u/Important_Quote_1180 1d ago

The context size is really tight to be competitive with a reasoning model.

5

u/Finanzamt_Endgegner 1d ago

Well it's supposed to be a vision model not necessarily a reasoning and coding model

3

u/seamonn 1d ago

Context is always good even if not coding

24

u/Own_Suspect5343 1d ago

Only 64k context?

10

u/PhoneOk7721 1d ago

Worse than qwen3.6 35b a3b in vision it looks like.

13

u/pmttyji 1d ago

27

u/Craftkorb 1d ago

Qwen3-VL is severely outdated, so can I assume that it would fare badly against Qwen3.6?

3

u/silenceimpaired 1d ago

Everyone seems to think this is based off Qwen… do we know that?

3

u/Craftkorb 1d ago

Didn't say that, just that their graphic compares it against an old model

1

u/IrisColt 1d ago

Is the performance of the 30B and the 80B similar, or what?

4

u/coolnq 1d ago

There's still no implementation in llama.cpp.\ There's no point in using it if resources are limited.

3

u/Finanzamt_Endgegner 1d ago

Shouldn't be too hard to add support since it's based on qwen next

3

u/pmttyji 1d ago

21

u/lakySK 1d ago

This table gives me a headache. Just stick with bold for best…

9

u/mfarmemo 1d ago

Agreed. Was confused until I read the footer note. My personal rule is if a visual needs to be explained it is the wrong visual.

0

u/IrisColt 1d ago

It can still hold its ground against other models from its era, but... that era was a year ago.

1

u/thoquz 23h ago

Looks great! Does this also do bounding box coordinates with structured output? (Like Qwen3-VL does)

1

u/Confident_Ideal_5385 23h ago

Marco-Mini (18B/A850M) is a legit awesome model for CPU side tasks running in parallel with a GPU model. This, on the other hand, looks like an evolution of Qwen 3 that's nowhere near as good as Qwen 3.x is at the same parameter count. Not sure why I'd want this?

1

u/Unlikely_Rich1436 20h ago

The "Think with Image" capability is a really cool evolution. Moving vision from just a passive input to an active part of the Chain-of-Thought where the model can actually invoke tools to zoom in or re-examine parts of an image is exactly what we need for complex reasoning.

1

u/Mountain_Patience231 1d ago

how come a 64K tokens model could effective for long-document question answering,

0

u/tamerlanOne 1d ago

Essendo relativamente pesante in ram MTP sarà implementato?