OpenSource-Hub

opendataloader-pdf

ライブラリ

opendataloader-project/opendataloader-pdf

AI データ抽出とアクセシビリティのための PDF 解析器。

概要

構造化されたデータ(Markdown、JSON、HTML)を抽出するオープンソースの PDF ソリュータには、境界ボックス、読み取り順序、テーブルのサポートが含まれています。 タグなしの PDF をタグなしの PDF に自動的に変換し、エンタープライズクラスの PDF/UA をオプションで抽出します。 抽出精度ベンチャーテストで1位です。

README プレビュー

\n\n# OpenDataLoader PDF\n\n**PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.**\n\n[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)\n[](https://pypi.org/project/opendataloader-pdf/)\n[](https://www.npmjs.com/package/@opendataloader/pdf)\n[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)\n[](https://github.com/opendataloader-project/opendataloader-pdf#java)\n\n\n\n🔍 **PDF parser for AI data extraction** — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.907 overall). Deterministic local mode + AI hybrid mode for complex pages.\n\n- **How accurate is it?** — #1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages ([benchmarks](#extraction-benchmarks))\n- **Scanned PDFs and OCR?** — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))\n\n♿ **PDF accessibility automation** — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.\n\n- **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--