OpenSource-Hub

olmocr

CLI ツール

allenai/olmocr

PDFを線形化してLLMデータセットにするツールキット。

概要

PDFや画像をクリーンなMarkdownテキストに変換するツールキットで、自然な読書順序、複雑なレイアウト、表、数式、手書き文字をサポート。大規模言語モデル向けに高品質なトレーニングデータを準備するために設計されています。

README プレビュー

\n  \n\n\n\n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n\n\nA toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.\n\nTry the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)\n\nFeatures:\n - Convert PDF, PNG, and JPEG based documents into clean Markdown\n - Support for equations, tables, handwriting, and complex formatting\n - Automatically removes headers and footers\n - Convert into text with a natural reading order, even in the presence of\n   figures, multi-column layouts, and insets\n - Efficient, less than $200 USD per million pages converted\n - (Based on a 7B parameter VLM, so it requires a GPU)\n\n### News\n - October 21, 2025 - v0.4.0 - [New model release](https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8), boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.\n - August 13, 2025 - v0.3.0 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0825-FP8), fixes auto-rotation detection, and hallucinations on blank documents.\n - July 24, 2025 - v0.2.1 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0725-FP8), scores 3 points higher on [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench), also runs significantly faster because it's default FP8, and needs much fewer retries per document.\n - July 23, 2025 - v0.2.0 - New cleaned up [trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train), makes it much simpler to train olmOCR models yourself.\n - June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8.\n - May 23, 2025 - v0.1.70 - Official docker support and images are now available! [See Docker usage](#using-docker)\n - May 19, 2025 - v0.1.68 - [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench) launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeli