OpenSource-Hub

opendataloader-pdf

라이브러리

opendataloader-project/opendataloader-pdf

AI 데이터 추출 및 액세스 할 수없는 PDF 해독기.

개요

구조화된 데이터 (Markdown, JSON, HTML)를 추출하는 오픈 소스 PDF 솔루션에는 경계 상자, 읽기 순서 및 테이블 지원이 포함되어 있습니다. 자동으로 마크되지 않은 PDF를 마크되지 않은 PDF로 변환하여 편리한 PDF/UA를 가져올 수 있습니다.

README 미리보기

\n\n# OpenDataLoader PDF\n\n**PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.**\n\n[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)\n[](https://pypi.org/project/opendataloader-pdf/)\n[](https://www.npmjs.com/package/@opendataloader/pdf)\n[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)\n[](https://github.com/opendataloader-project/opendataloader-pdf#java)\n\n\n\n🔍 **PDF parser for AI data extraction** — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.907 overall). Deterministic local mode + AI hybrid mode for complex pages.\n\n- **How accurate is it?** — #1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages ([benchmarks](#extraction-benchmarks))\n- **Scanned PDFs and OCR?** — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))\n- **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))\n\n♿ **PDF accessibility automation** — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.\n\n- **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--