zuminghuang commited on
Commit
2d9bb8a
·
verified ·
1 Parent(s): 669b83f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -6
README.md CHANGED
@@ -38,17 +38,38 @@ Overview of Infinity-Parser training framework. Our model is optimized via reinf
38
  ## Table Recognition
39
  ![image](assets/table.png)
40
 
 
 
 
 
41
  # Quick Start
42
 
43
- ## Vllm Inference
44
- We recommend using the vLLM backend for accelerated inference.
45
- It supports image and PDF inputs, automatically parses the document content, and exports the results in Markdown format to a specified directory.
 
46
 
 
 
 
 
 
 
47
  Before starting, make sure that **PyTorch** is correctly installed according to the official installation guide at [https://pytorch.org/](https://pytorch.org/).
48
 
 
 
49
  ```shell
50
- pip install .
 
 
 
51
 
 
 
 
 
 
52
  parser --model /path/model --input dir/PDF/Image --output output_folders --batch_size 128 --tp 1
53
  ```
54
 
@@ -67,11 +88,110 @@ output_folders/
67
 
68
  </details>
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Using Transformers to Inference
71
 
72
  <details>
73
  <summary> Transformers Inference Example </summary>
74
-
75
  ```python
76
  import torch
77
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
@@ -139,7 +259,6 @@ output_text = processor.batch_decode(
139
  )
140
  print(output_text)
141
  ```
142
-
143
  </details>
144
 
145
  # Visualization
@@ -147,6 +266,28 @@ print(output_text)
147
  ## Comparison Examples
148
  ![image](assets/case.jpeg)
149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  # Citation
151
 
152
  ```
 
38
  ## Table Recognition
39
  ![image](assets/table.png)
40
 
41
+ ## General Multimodal Capability Evaluation
42
+ ![image](assets/General.png)
43
+ > **Note:** The baseline model is **Qwen2.5-VL-7B**, and all metrics are evaluated using the **LMMS-Eval** framework.
44
+
45
  # Quick Start
46
 
47
+ ## Install Infinity_Parser
48
+ ```shell
49
+ conda create -n Infinity_Parser python=3.11
50
+ conda activate Infinity_Parser
51
 
52
+ git clone https://github.com/infly-ai/INF-MLLM.git
53
+ cd INF-MLLM/Infinity-Parser
54
+ # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
55
+ conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia
56
+ pip install .
57
+ ```
58
  Before starting, make sure that **PyTorch** is correctly installed according to the official installation guide at [https://pytorch.org/](https://pytorch.org/).
59
 
60
+ ## Download Model Weights
61
+
62
  ```shell
63
+ pip install -r requirements.txt
64
+
65
+ python3 tools/download_model.py
66
+ ```
67
 
68
+ ## Vllm Inference
69
+ We recommend using the vLLM backend for accelerated inference.
70
+ It supports image and PDF inputs, automatically parses the document content, and exports the results in Markdown format to a specified directory.
71
+
72
+ ```shell
73
  parser --model /path/model --input dir/PDF/Image --output output_folders --batch_size 128 --tp 1
74
  ```
75
 
 
88
 
89
  </details>
90
 
91
+ ### Online Serving
92
+
93
+ <details>
94
+ <summary> Example </summary>
95
+
96
+ - Launch the vLLM Server
97
+
98
+ ```shell
99
+ vllm serve /path/to/model --tensor-parallel-size=4 --served-model-name=Infinity_Parser
100
+ ```
101
+
102
+ - Python Client Example
103
+
104
+ ```python
105
+ import os
106
+ import re
107
+ import sys
108
+ import json
109
+ from PIL import Image
110
+ from openai import OpenAI, AsyncOpenAI
111
+ import base64, pathlib
112
+
113
+ prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
114
+
115
+ 1. Text Processing:
116
+ - Accurately recognize all text content in the PDF image without guessing or inferring.
117
+ - Convert the recognized text into Markdown format.
118
+ - Maintain the original document structure, including headings, paragraphs, lists, etc.
119
+
120
+ 2. Mathematical Formula Processing:
121
+ - Convert all mathematical formulas to LaTeX format.
122
+ - Enclose inline formulas with \( \). For example: This is an inline formula \( E = mc^2 \)
123
+ - Enclose block formulas with \\[ \\]. For example: \[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]
124
+
125
+ 3. Table Processing:
126
+ - Convert tables to HTML format.
127
+ - Wrap the entire table with <table> and </table>.
128
+
129
+ 4. Figure Handling:
130
+ - Ignore figures content in the PDF image. Do not attempt to describe or convert images.
131
+
132
+ 5. Output Format:
133
+ - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
134
+ - For complex layouts, try to maintain the original document's structure and format as closely as possible.
135
+
136
+ Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
137
+ '''
138
+
139
+ def encode_image(image_path):
140
+ with open(image_path, "rb") as image_file:
141
+ return base64.b64encode(image_file.read()).decode("utf-8")
142
+
143
+
144
+ def build_message(image_path, prompt):
145
+
146
+ content = [
147
+ {
148
+ "type": "image_url",
149
+ "image_url": {
150
+ "url": f"data:image/jpeg;base64,{encode_image(image_path)}"
151
+ }
152
+ },
153
+ {"type": "text", 'text': prompt}
154
+ ]
155
+ messages = [
156
+ {"role": "system", "content": "You are a helpful assistant."},
157
+ {'role': 'user', 'content': content}
158
+ ]
159
+
160
+ return messages
161
+
162
+ client = OpenAI(
163
+ api_key="EMPTY",
164
+ base_url="http://localhost:8000/v1",
165
+ )
166
+
167
+
168
+ def request(messages):
169
+ completion = client.chat.completions.create(
170
+ messages=messages,
171
+ extra_headers={
172
+ "Authorization": f"Bearer {Authorization}"
173
+ },
174
+ model="Infinity_Parser",
175
+ max_completion_tokens=8192,
176
+ temperature=0.0,
177
+ top_p=0.95
178
+ )
179
+
180
+ return completion.choices[0].message.content
181
+
182
+
183
+ if __name__ == "__main__":
184
+ img_path = "path/to/image.png"
185
+ res = build_message(img_path, prompt)
186
+ print(res)
187
+ ```
188
+ </details>
189
+
190
  ## Using Transformers to Inference
191
 
192
  <details>
193
  <summary> Transformers Inference Example </summary>
194
+
195
  ```python
196
  import torch
197
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 
259
  )
260
  print(output_text)
261
  ```
 
262
  </details>
263
 
264
  # Visualization
 
266
  ## Comparison Examples
267
  ![image](assets/case.jpeg)
268
 
269
+ # Synthetic Data Generation
270
+
271
+ The generation code is available at <a href="https://github.com/infly-ai/INF-MLLM/tree/main/Infinity-Parser/Infinity-Synth">Infinity-Synth.</a>
272
+
273
+ # Limitation & Future Work
274
+
275
+ ## Limitations
276
+ - **Layout / BBox**: The current model does not provide layout or bounding box (bbox) information, which limits its ability to support downstream tasks such as structured document reconstruction or reading order prediction.
277
+ - **Charts & Figures**: The model lacks perception and understanding of charts and figures, and therefore cannot perform visual reasoning or structured extraction for graphical elements.
278
+
279
+ ## Future Work
280
+
281
+ We are dedicated to enabling our model to **read like humans**, and we firmly believe that **Vision-Language Models (VLMs)** can make this vision possible. We have conducted **preliminary explorations of reinforcement learning (RL) for document parsing** and achieved promising initial results. In future research, we will continue to deepen our efforts in the following directions:
282
+
283
+ - **Chart & Figure Understanding**: Extend the model’s capability to handle chart detection, semantic interpretation, and structured data extraction from graphical elements.
284
+
285
+ - **General-Purpose Perception**: Move toward a unified **Vision-Language perception model** that integrates detection, image captioning, OCR, layout analysis, and chart understanding into a single framework.
286
+
287
+ # Acknowledgments
288
+ We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [MinerU](https://github.com/opendatalab/MinerU), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [EasyR1](https://github.com/hiyouga/EasyR1), [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
289
+ [OmniDocBench](https://github.com/opendatalab/OmniDocBench), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), for providing code and models.
290
+
291
  # Citation
292
 
293
  ```