OpenSource-Hub

opendataloader-pdf

opendataloader-project/opendataloader-pdf

用于 AI 数据提取和无障碍化的 PDF 解析器。

项目简介

开源 PDF 解析器,提取结构化数据(Markdown、JSON、HTML),包含边界框、阅读顺序和表格支持。自动将未标记 PDF 转换为标记 PDF 以实现无障碍化,可选企业级 PDF/UA 导出。在提取准确性基准测试中排名第一。

README 预览

\n\n# OpenDataLoader PDF\n\n**PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.**\n\n[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)\n[](https://pypi.org/project/opendataloader-pdf/)\n[](https://www.npmjs.com/package/@opendataloader/pdf)\n[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)\n[](https://github.com/opendataloader-project/opendataloader-pdf#java)\n\n\n\n🔍 **PDF parser for AI data extraction** — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.907 overall). Deterministic local mode + AI hybrid mode for complex pages.\n\n- **How accurate is it?** — #1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages ([benchmarks](#extraction-benchmarks))\n- **Scanned PDFs and OCR?** — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))\n\n♿ **PDF accessibility automation** — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.\n\n- **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--