项目简介
MarkItDown 转换多种文件格式为 Markdown,适用于 LLM 和文本分析。支持 PDF、Office 文档、图片、音频、HTML 等,保留标题、表格、链接等结构。
README 预览
# MarkItDown\n\n[](https://pypi.org/project/markitdown/)\n\n[](https://github.com/microsoft/autogen)\n\n> [!IMPORTANT]\n> MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access. Sanitize your inputs in untrusted environments, and call the narrowest `convert_*` function needed for your use case (e.g., `convert_stream()`, or `convert_local()`). See the [Security Considerations](#security-considerations) section of the documentation for more information.\n\nMarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.\n\nMarkItDown currently supports the conversion from:\n\n- PDF\n- PowerPoint\n- Word\n- Excel\n- Images (EXIF metadata and OCR)\n- Audio (EXIF metadata and speech transcription)\n- HTML\n- Text-based formats (CSV, JSON, XML)\n- ZIP files (iterates over contents)\n- Youtube URLs\n- EPubs\n- ... and more!\n\n## Why Markdown?\n\nMarkdown is extremely close to plain text, with minimal markup or formatting, but still\nprovides a way to represent important document structure. Mainstream LLMs, such as\nOpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their\nresponses unprompted. This suggests that they have been trained on vast amounts of\nMarkdown-formatted text, and understand it well. As a side benefit, Markdown conventions\nare also highly token-efficient.\n\n## Prerequisites\nMarkItDown requires Python 3.10 or high
常见问题 (2)
故障排除如何修复安装markitdown[all]时出现的'No matching distribution found for youtube-transcript-api~=1.0.0'错误?
故障排除
如何修复安装markitdown[all]时出现的'No matching distribution found for youtube-transcript-api~=1.0.0'错误?markitdown[all]的额外依赖将youtube-transcript-api固定为~=1.0.0,但该版本在PyPI上缺失。解决方法:不使用额外依赖进行安装(pip install markitdown),然后单独添加所需包(例如,pip install youtube-transcript-api)。或者使用--no-deps强制安装并手动解决依赖关系。检查是否有更新版本的markitdown修复了该固定版本问题。
故障排除为什么 markitdown 0.1.5 在处理大型PDF时会消耗过多内存并因OOM而崩溃?
故障排除
为什么 markitdown 0.1.5 在处理大型PDF时会消耗过多内存并因OOM而崩溃?降级至 markitdown 0.1.4。此已知回归问题(PR #1499)会导致内存使用量激增(例如,处理 400 页 PDF 时,内存占用从 200 MiB 飙升到 2.7 GiB)。请在 issue #1611 中的修复发布后升级。