OpenSource-Hub

olmocr

CLI 도구

allenai/olmocr

PDF를 LLM 데이터셋으로 선형화하는 툴킷.

개요

PDF와 이미지를 깨끗한 Markdown 텍스트로 변환하는 도구 키트로, 자연스러운 읽기 순서, 복잡한 레이아웃, 표, 수식, 필기체를 지원합니다. 대규모 언어 모델을 위한 고품질 훈련 데이터 준비에 특화 설계되었습니다.

README 미리보기

\n  \n\n\n\n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n\n\nA toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.\n\nTry the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)\n\nFeatures:\n - Convert PDF, PNG, and JPEG based documents into clean Markdown\n - Support for equations, tables, handwriting, and complex formatting\n - Automatically removes headers and footers\n - Convert into text with a natural reading order, even in the presence of\n   figures, multi-column layouts, and insets\n - Efficient, less than $200 USD per million pages converted\n - (Based on a 7B parameter VLM, so it requires a GPU)\n\n### News\n - October 21, 2025 - v0.4.0 - [New model release](https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8), boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.\n - August 13, 2025 - v0.3.0 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0825-FP8), fixes auto-rotation detection, and hallucinations on blank documents.\n - July 24, 2025 - v0.2.1 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0725-FP8), scores 3 points higher on [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench), also runs significantly faster because it's default FP8, and needs much fewer retries per document.\n - July 23, 2025 - v0.2.0 - New cleaned up [trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train), makes it much simpler to train olmOCR models yourself.\n - June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8.\n - May 23, 2025 - v0.1.70 - Official docker support and images are now available! [See Docker usage](#using-docker)\n - May 19, 2025 - v0.1.68 - [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench) launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeli