개요
MarkItDown 转换多种文件格式为 Markdown,适用于 LLM 和文本分析。支持 PDF、Office 文档、图片、音频、HTML 等,保留标题、表格、链接等结构。
README 미리보기
# MarkItDown\n\n[](https://pypi.org/project/markitdown/)\n\n[](https://github.com/microsoft/autogen)\n\n> [!IMPORTANT]\n> MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access. Sanitize your inputs in untrusted environments, and call the narrowest `convert_*` function needed for your use case (e.g., `convert_stream()`, or `convert_local()`). See the [Security Considerations](#security-considerations) section of the documentation for more information.\n\nMarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.\n\nMarkItDown currently supports the conversion from:\n\n- PDF\n- PowerPoint\n- Word\n- Excel\n- Images (EXIF metadata and OCR)\n- Audio (EXIF metadata and speech transcription)\n- HTML\n- Text-based formats (CSV, JSON, XML)\n- ZIP files (iterates over contents)\n- Youtube URLs\n- EPubs\n- ... and more!\n\n## Why Markdown?\n\nMarkdown is extremely close to plain text, with minimal markup or formatting, but still\nprovides a way to represent important document structure. Mainstream LLMs, such as\nOpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their\nresponses unprompted. This suggests that they have been trained on vast amounts of\nMarkdown-formatted text, and understand it well. As a side benefit, Markdown conventions\nare also highly token-efficient.\n\n## Prerequisites\nMarkItDown requires Python 3.10 or high
FAQ (2)
문제 해결markitdown[all] 설치 시 'No matching distribution found for youtube-transcript-api~=1.0.0' 오류를 해결하는 방법은 무엇인가요?
문제 해결
markitdown[all] 설치 시 'No matching distribution found for youtube-transcript-api~=1.0.0' 오류를 해결하는 방법은 무엇인가요?markitdown[all] extra가 youtube-transcript-api를 ~=1.0.0으로 고정했지만, 해당 버전이 PyPI에 없습니다. 해결 방법: extra 없이 설치(pip install markitdown)한 후 필요한 패키지를 별도로 추가하세요(예: pip install youtube-transcript-api). 또는 --no-deps로 강제 설치하고 수동으로 의존성을 해결하세요. 이 고정 문제를 해결한 최신 markitdown 릴리스가 있는지 확인하세요.
문제 해결markitdown 0.1.5가 대용량 PDF에서 과도한 메모리를 사용하고 OOM으로 인해 충돌하는 이유는 무엇인가요?
문제 해결
markitdown 0.1.5가 대용량 PDF에서 과도한 메모리를 사용하고 OOM으로 인해 충돌하는 이유는 무엇인가요?markitdown 0.1.4로 다운그레이드하세요. 이 알려진 회귀(PR #1499)로 인해 메모리 사용량이 급증합니다(예: 400페이지 PDF의 경우 200 MiB 대비 2.7 GiB). issue #1611의 수정 사항이 릴리스되면 업그레이드하세요.