Nanonets-OCR-s开源！复杂文档转Markdown SoTA，颠覆复杂文档工作流

Nanonets团队开源了 Nanonets-OCR-s，该模型基于Qwen2.5-VL-3B微调，9G显存就能跑。

魔搭ModelScope社区

11人浏览 · 2025-06-20 13:52:38

魔搭ModelScope社区 · 2025-06-20 13:52:38 发布

01.前言

Nanonets团队开源了 Nanonets-OCR-s，该模型基于Qwen2.5-VL-3B微调，9G显存就能跑。

Nanonets-OCR-s是一款强大的模型，能够通过智能内容识别和语义标记，将杂乱的文档转换为现代人工智能应用所需的干净、结构化且上下文丰富的 Markdown 格式。它的功能远超传统的文本提取，是目前图像转 Markdown 领域的SoTA模型。

大多数公开的图像转文本模型主要侧重于从图像中提取纯文本，然而它们通常无法区分常规内容和水印、签名或页码等元素。图像等视觉元素经常被忽略，表格、复选框和公式等复杂结构也无法有效处理，这使得这些模型不太适合下游任务。与仅提取纯文本的传统 OCR 系统不同，Nanonets-OCR-s 能够理解文档结构和内容上下文（如表格、方程式、图像、图表、水印、复选框等），从而提供智能格式的 markdown 输出，可供大型语言模型进行下游处理。

Nanonets-OCR-s 通过将结构化数据从非结构化格式中分离出来，能够简化各行各业复杂的文档工作流程。

-学术与研究：使用 LaTeX 公式和表格将论文数字化。

-法律与金融：从合同和财务文件中提取数据，包括签名和表格。

-医疗保健与制药：从医疗表格中准确捕获文本和复选框。

-企业：将报告转换为可搜索的图像感知知识库。

魔搭创空间：

https://www.modelscope.cn/studios/nanonets/Nanonets-ocr-s

02.主要特性和功能

LaTeX 公式识别
智能图像描述
签名检测与隔离
水印提取
智能复选框处理
复杂表格提取

1. LaTeX 公式识别

Nanonets-OCR-s可以实现自动将数学方程式和公式转换为正确格式的 LaTeX 语法。内联数学公式将转换为 LaTeX 内联公式。段间公式将转换为 LaTeX 段间公式，单独占据一行。Nanonets-OCR-s模型识别结果能完整的保留等式右侧的序号。

输入	输出	Mistral OCR 输出

原始模型输出

where $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{\text{ref}}$, namely the initial SFT model $\pi^{\text{SFT}}$. In practice, the language model policy $\pi_{\theta}$ is also initialized to $\pi^{\text{SFT}}$. The added constraint is important, as it prevents the model from deviating too far from the distribution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers. Due to the discrete nature of language generation, this objective is not differentiable and is typically optimized with reinforcement learning. The standard approach [51, 40, 1, 28] has been to construct the reward function $r(x,y) = r_{\phi}(x,y) - \beta(\log \pi_{\theta}(y \mid x) - \log \pi_{\text{ref}}(y \mid x))$, and maximize using PPO [39].

### 4 Direct Preference Optimization

Motivated by the challenges of applying reinforcement learning algorithms on large-scale problems such as fine-tuning language models, our goal is to derive a simple approach for policy optimization using preferences directly. Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. As we will describe next in detail, our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences, such as the Bradley-Terry model. In essence, the policy network represents both the language model and the (implicit) reward.

**Deriving the DPO objective.** We start with the same RL objective as prior work, Eq. [3], under a general reward function $r$. Following prior work [31, 30, 19, 15], it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective in Eq. [3] takes the form:

$$\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\left(\frac{1}{\beta} r(x,y)\right), \tag{4}$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp\left(\frac{1}{\beta} r(x,y)\right)$ is the partition function. See Appendix A.1 for a complete derivation. Even if we use the MLE estimate $r_{\phi}$ of the ground-truth reward function $r^*$, it is still expensive to estimate the partition function $Z(x)$ [19, 15], which makes this representation hard to utilize in practice. However, we can rearrange Eq. [4] to express the reward function in terms of its corresponding optimal policy $\pi_r$, the reference policy $\pi_{\text{ref}}$, and the unknown partition function $Z(\cdot)$. Specifically, we first take the logarithm of both sides of Eq. [4] and then with some algebra we obtain:

$$r(x,y) = \beta \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x). \tag{5}$$

We can apply this reparameterization to the ground-truth reward $r^*$ and corresponding optimal model $\pi^*$. Fortunately, the Bradley-Terry model depends only on the difference of rewards between two completions, i.e., $p^*(y_1 \succ y_2 \mid x) = \sigma(r^*(x,y_1) - r^*(x,y_2))$. Substituting the reparameterization in Eq. [5] for $r^*(x,y)$ into the preference model Eq. [1], the partition function cancels, and we can express the human preference probability in terms of only the optimal policy $\pi^*$ and reference policy $\pi_{\text{ref}}$. Thus, the optimal RLHF policy $\pi^*$ under the Bradley-Terry model satisfies the preference model:

$$p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)} - \beta \log \frac{\pi^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)}\right)}. \tag{6}$$

The derivation is in Appendix A.2. While Eq. [6] uses the Bradley-Terry model, we can similarly derive expressions under the more general Plackett-Luce models [32, 23], shown in Appendix A.3.

Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy $\pi_{\theta}$. Analogous to the reward modeling approach (i.e. Eq. [2]), our policy objective becomes:

$$\mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_{\theta}(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_{\theta}(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]. \tag{7}$$

This way, we fit an implicit reward using an alternative parameterization, whose optimal policy is simply $\pi_{\theta}$. Moreover, since our procedure is equivalent to fitting a reparametrized Bradley-Terry

2.智能图像描述

Nanonets-OCR-s使用结构化标签描述文档中的图像，使其易于 LLM 处理。该模型可以根据内容、样式和上下文描述单幅或多幅图像（徽标、图表、图形、二维码等）。该模型预测的图像描述结果存储于<img>标签内，预测的页码存储于<page_number>标签内，便于在RAG系统中进行使用。

输入	输出	Mistral OCR 输出

原始模型输出

```markdown
**Results** We report the FID score [10] on MJHQ-30k [15] for visual aesthetic quality, along with GenEval [8] and DPG-Bench [11] metrics for evaluating prompt alignment. We plot the results for each design choice at approximately every 3,200 training steps. Figure 4 shows that CLIP + Flow Matching achieves the best prompt alignment scores on both GenEval and DPG-Bench, while VAE + Flow Matching produces the lowest (best) FID, indicating superior aesthetic quality. However, FID has inherent limitations: it quantifies stylistic deviation from the target image distribution and often overlooks true generative quality and prompt alignment. In fact, our FID evaluation of GPT-4o on the MJHQ-30k dataset produced a score of around 30.0, underscoring that FID can be misleading in the image generation evaluation. In general, our experiments demonstrate CLIP + Flow Matching as the most effective design choice.

<img>
A line chart showing GenEval Score ↑ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents GenEval Score ↑, ranging from 0 to 65.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching

The green line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 30 by 35K Step. The blue line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 60 by 35K Step. The red line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 60 by 35K Step.

A line chart showing DPG Score ↑ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents DPG Score ↑, ranging from 0 to 80.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching

The green line starts at around 10 DPG Score ↑ at 3K Step and increases to around 50 by 35K Step. The blue line starts at around 10 DPG Score ↑ at 3K Step and increases to around 70 by 35K Step. The red line starts at around 10 DPG Score ↑ at 3K Step and increases to around 70 by 35K Step.

A line chart showing FID Score ↓ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents FID Score ↓, ranging from 0 to 80.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching

The green line starts at around 80 FID Score ↓ at 3K Step and decreases to around 20 by 35K Step. The blue line starts at around 80 FID Score ↓ at 3K Step and decreases to around 40 by 35K Step. The red line starts at around 80 FID Score ↓ at 3K Step and decreases to around 40 by 35K Step.
</img>

**Figure 4:** Comparison of different design choices.

**Discussion** In this section, we present a comprehensive evaluation of various design choices for image generation within a unified multimodal framework. Our results clearly show that CLIP's features produce more compact and semantically rich representations than VAE features, yielding higher training efficiency. Autoregressive models more effectively learn these semantic-level features compared to pixel-level features. Furthermore, flow matching proves to be a more effective training objective for modeling the image distribution, resulting in greater sample diversity and enhanced visual quality.

**Finding 1**

When integrating image generation into a unified model, autoregressive models more effectively learn the semantic-level features (CLIP) compared to pixel-level features (VAE).

**Finding 2**

Adopting flow matching as the training objective better captures the underlying image distribution, resulting in greater sample diversity and enhanced visual quality.

**4 Training Strategies for Unified Multimodal**

Building on our image generation study, the next step is to develop a unified model that can perform both image understanding and image generation. We use CLIP + Flow Matching for the image generation module. Since image understanding also operates in CLIP's embedding space, we align both tasks within the same semantic space, enabling their unification. In this context, we discuss two training strategies to achieve this integration.

**4.1 Joint Training Versus Sequential Training**

**Joint Training** Joint training of image understanding and image generation has become a common practice in recent works such as Metamorph [33], Janus-Pro [4], and Show-o [38]. Although these methods adopt different architectures for image generation, all perform multitask learning by mixing data for image generation and understanding.

<page_number>9</page_number>```

3.签名检测与分离

智能识别文档中的签名，并区别于文档中的其他文本，这个功能对于法律和商业文档处理至关重要。该模型可检测文档中的签名，模型预测的签名文本写入<signature>标签内。

输入	输出	Mistral OCR 输出

原始模型输出

Invoice
Tax Invoice

Your Business Name

+61200000000
email@yourbusinessname.com.au
www.yourbusinessname.com.au

| INVOICE NO. | 2022445 |
|---|---|
| REFERENCE | 2022445 |

ISSUE DATE | 19/7/2022 |
DUE DATE | 2/8/2022 |

FROM
Your Business Name
5 Martin Pl
Sydney NSW NSW 2000
Australia

TO
Your Client
100 Harris St
Sydney NSW NSW 2009
Australia

Total due
$30.00

<table>
<thead>
<tr>
<th>DESCRIPTION</th>
<th>QUANTITY</th>
<th>UNIT PRICE ($)</th>
<th>AMOUNT ($)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sample service</td>
<td>1 hour</td>
<td>30.00</td>
<td>30.00</td>
</tr>
</tbody>
</table>

Subtotal:
$30.00

Total (AUD):
$30.00

Issued by, signature:
<signature>J. Walker</signature>

4.水印提取

与签名检测类似，该模型可以检测并提取文档中的水印文本。该模型水印文本写入<watermark>标签内。

输入	输出	Mistral OCR 输出

原始模型输出

```
Invoice

Invoice Number: INV-20250609
Date: June 9, 2025
Bill To: Souvik Mandal
123 Business Street
Kolkata, India

<table>
  <tr>
    <th>Item</th>
    <th>Description</th>
    <th>Quantity</th>
    <th>Unit Price</th>
    <th>Total</th>
  </tr>
  <tr>
    <td>001</td>
    <td><watermark>PAID</watermark> Consulting Services</td>
    <td>10</td>
    <td>₹2000</td>
    <td>₹20,000</td>
  </tr>
  <tr>
    <td>002</td>
    <td>Design Work</td>
    <td>5</td>
    <td>₹1500</td>
    <td>₹7,500</td>
  </tr>
  <tr>
    <td colspan="4">Grand Total</td>
    <td>₹27,500</td>
  </tr>
</table>

Thank you for your business!
Payment was received on June 7, 2025```

5.智能复选框处理

Nanonets-OCR-s可以识别复选框状态。模型将表单复选框和单选按钮转换为标准化的 Unicode 符号，区别“已勾选”的复选框和“未勾选”的复选框。模型预测的复选框状态写入<checkbox>标签内。

输入	输出	Mistral OCR 输出

原始模型输出

```markdown
# Invoice

**Bill To:**
John Doe
123 Main Street
City, Country 456789
john.doe@example.com

<table>
<thead>
<tr>
<th>Select</th>
<th>Description</th>
<th>Quantity</th>
<th>Unit Price</th>
<th>Discount</th>
<th>Line Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>☑</td>
<td>Website Development</td>
<td>1</td>
<td>$800</td>
<td>$0</td>
<td>$800</td>
</tr>
<tr>
<td>☐</td>
<td>Monthly Maintenance</td>
<td>12</td>
<td>$50</td>
<td>$100</td>
<td>$500</td>
</tr>
<tr>
<td>☑</td>
<td>Email Hosting (1 year)</td>
<td>1</td>
<td>$100</td>
<td>$0</td>
<td>$100</td>
</tr>
<tr>
<td>☑</td>
<td>SSL Certificate</td>
<td>1</td>
<td>$75</td>
<td>$0</td>
<td>$75</td>
</tr>
</tbody>
</table>

<table>
<tbody>
<tr>
<td>Subtotal:</td>
<td>$1,475</td>
</tr>
<tr>
<td>Discounts:</td>
<td>- $100</td>
</tr>
<tr>
<td>Tax (10%):</td>
<td>$137.50</td>
</tr>
<tr>
<td><strong>Grand Total:</strong></td>
<td><strong>$1,512.50</strong></td>
</tr>
</tbody>
</table>

**Select Payment Method:**
☑ Credit Card
☐ PayPal
☐ Bank Transfer```

6.复杂表格提取

从文档中提取复杂表格并将其转换为 markdown 和 html 表格。

输入	输出	Mistral OCR 输出

原始模型输出

features extracted by the Vision Transformer (ViT), we first group spatially adjacent sets of four patch features. These grouped features are then concatenated and passed through a two-layer multi-layer perceptron (MLP) to project them into a dimension that aligns with the text embeddings used in the LLM. This method not only reduces computational costs but also provides a flexible way to dynamically compress image feature sequences of varying lengths.

In Table 1, the architecture and configuration of Qwen2.5-VL are detailed.

<table>
  <tr>
    <td><strong>Configuration</strong></td>
    <td><strong>Qwen2.5-VL-3B</strong></td>
    <td><strong>Qwen2.5-VL-7B</strong></td>
    <td><strong>Qwen2.5-VL-72B</strong></td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Vision Transformer (ViT)</strong></td>
  </tr>
  <tr>
    <td>Hidden Size</td>
    <td>1280</td>
    <td>1280</td>
    <td>1280</td>
  </tr>
  <tr>
    <td># Layers</td>
    <td>32</td>
    <td>32</td>
    <td>32</td>
  </tr>
  <tr>
    <td># Num Heads</td>
    <td>16</td>
    <td>16</td>
    <td>16</td>
  </tr>
  <tr>
    <td>Intermediate Size</td>
    <td>3456</td>
    <td>3456</td>
    <td>3456</td>
  </tr>
  <tr>
    <td>Patch Size</td>
    <td>14</td>
    <td>14</td>
    <td>14</td>
  </tr>
  <tr>
    <td>Window Size</td>
    <td>112</td>
    <td>112</td>
    <td>112</td>
  </tr>
  <tr>
    <td>Full Attention Block Indexes</td>
    <td>{7, 15, 23, 31}</td>
    <td>{7, 15, 23, 31}</td>
    <td>{7, 15, 23, 31}</td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Vision-Language Merger</strong></td>
  </tr>
  <tr>
    <td>In Channel</td>
    <td>1280</td>
    <td>1280</td>
    <td>1280</td>
  </tr>
  <tr>
    <td>Out Channel</td>
    <td>2048</td>
    <td>3584</td>
    <td>8192</td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Large Language Model (LLM)</strong></td>
  </tr>
  <tr>
    <td>Hidden Size</td>
    <td>2048</td>
    <td>3,584</td>
    <td>8192</td>
  </tr>
  <tr>
    <td># Layers</td>
    <td>36</td>
    <td>28</td>
    <td>80</td>
  </tr>
  <tr>
    <td># KV Heads</td>
    <td>2</td>
    <td>4</td>
    <td>8</td>
  </tr>
  <tr>
    <td>Head Size</td>
    <td>128</td>
    <td>128</td>
    <td>128</td>
  </tr>
  <tr>
    <td>Intermediate Size</td>
    <td>4864</td>
    <td>18944</td>
    <td>29568</td>
  </tr>
  <tr>
    <td>Embedding Tying</td>
    <td>☑</td>
    <td>☒</td>
    <td>☒</td>
  </tr>
  <tr>
    <td>Vocabulary Size</td>
    <td>151646</td>
    <td>151646</td>
    <td>151646</td>
  </tr>
  <tr>
    <td># Trained Tokens</td>
    <td>4.1T</td>
    <td>4.1T</td>
    <td>4.1T</td>
  </tr>
</table>

**Table 1:** Configuration of Qwen2.5-VL.

03.模型训练

为了训练全新的视觉语言模型 (VLM) 以实现精确的光学字符识别 (OCR)，该团队精心挑选了一个包含超过 25 万页的数据集。该数据集涵盖以下文档类型：研究论文、财务文档、法律文档、医疗保健文档、税务表单、收据和发票。此外，该数据集还包含包含图像、图表、公式、签名、水印、复选框和复杂表格的文档。另外使用了合成数据集和手动注释数据集，首先在合成数据集上训练模型，然后在手动注释数据集上对其进行微调。

模型结构：

选择该Qwen2.5-VL-3B模型作为视觉语言模型 (VLM) 的基础模型。随后，我们基于精选数据集对该模型进行了微调，以提高其在特定文档的光学字符识别 (OCR) 任务上的性能。

04.模型推理

使用transformers推理

从魔搭下载模型

modelscope download --model nanonets/Nanonets-OCR-s --local_dir nanonets/Nanonets-OCR-s

推理脚本

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "nanonets/Nanonets-OCR-s"
model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]
image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

环境依赖：

transformers需要大于等于4.52，torch需要大于等于2.5

资源要求：

推理显存占用8G，如您没有本地资源，可使用魔搭的Notebook免费GPU资源

魔搭notebook地址：

https://www.modelscope.cn/my/mynotebook

点击链接， 即可跳转体验~

https://www.modelscope.cn/studios/nanonets/Nanonets-ocr-s