- π§ THIS IS A WORK IN PROGRESS! More will be added soon!
- Feel free to contribute by submitting a pull request π
- Cells marked with β or β have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
- See the
resultsfolder to see the outputs from models.
Usually outputs as raw text or markdown
| Models | Source | Output | Needs prompt? | Table | Equation | Figure | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|---|
| PyMuPDF | Raw text | N | β | β | β | β | β | β | |
| PDFPlumber | Raw text | N | β (separate from text) | β | β | β | β | β |
| Models | Source | Output | Needs prompt? | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|
| Marker | Markdown | N | β (markdown) | β | β | β | β | |
| MonkeyOCR | Markdown | Y | β (html) | β | β | β | β | |
| Nougat | Markdown | N | β | β | β | β | β | |
| MinerU | Markdown | N | β (html) | β | β | β | β | |
| Llamaparse (balanced mode) | - | Markdown | Y | β (markdown) | β | β | β | β |
| Llamaparse (premium mode) | - | Markdown | Y | β (markdown) | β | β | β | β |
| Docling | Markdown | N | β (markdown) | β | β | β | β | |
| RolmOCR | Markdown | Y | β (markdown) | β | β | β | β | |
| olmOCR | Markdown | Y | β (markdown) | β | β | β | β | |
| Unstructured | Raw text | N | β | β | β | β | β | |
| Pytesseract | Raw text | N | β | β | β | β | β | |
| MarkItDown | Markdown | N | β | β | β | β | β | |
| Amazon textract | - | |||||||
| Azure AI Document Intelligence | - | |||||||
| Google Cloud OCR | - | |||||||
| Mathpix | - | |||||||
| MistralOCR | - | |||||||
| Upstage | - | |||||||
| OmniAI | - | |||||||
| ChatDoc PDF parser | - | |||||||
| Reducto | - | |||||||
| OCRFlux | ||||||||
| Nanonets | ||||||||
| PaddleOCR | ||||||||
| ClovaOCR | - | |||||||
| ParseExtract | - | |||||||
| Tensorlake | - | |||||||
| Vectorize | - | |||||||
| MassivePix | - | |||||||
| Dolphin | ||||||||
| GOT | ||||||||
| Manga OCR | ||||||||
| EasyOCR | ||||||||
| PDFeditify | - |
β Process took too long
Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)
π§ WORK IN PROGRESS
| Models | Source | Output | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|
| Chunkr | |||||||
| GroundX | - | ||||||
| ChatDOC | - | ||||||
| Unstract |
If you would like to contribute in any way, please read CONTRIBUTING.md and then make a contribution. Thank you!

