OpenSource-Hub

markitdown

CLI Tool

microsoft/markitdown

Python tool for converting files and office documents to Markdown.

Overview

MarkItDown converts various file formats to Markdown for use with LLMs and text analysis. It supports PDF, Office documents, images, audio, HTML, and more, preserving structure like headings, tables, and links.

README Preview

# MarkItDown\n\n[](https://pypi.org/project/markitdown/)\n\n[](https://github.com/microsoft/autogen)\n\n> [!IMPORTANT]\n> MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access. Sanitize your inputs in untrusted environments, and call the narrowest `convert_*` function needed for your use case (e.g., `convert_stream()`, or `convert_local()`). See the [Security Considerations](#security-considerations) section of the documentation for more information.\n\nMarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.\n\nMarkItDown currently supports the conversion from:\n\n- PDF\n- PowerPoint\n- Word\n- Excel\n- Images (EXIF metadata and OCR)\n- Audio (EXIF metadata and speech transcription)\n- HTML\n- Text-based formats (CSV, JSON, XML)\n- ZIP files (iterates over contents)\n- Youtube URLs\n- EPubs\n- ... and more!\n\n## Why Markdown?\n\nMarkdown is extremely close to plain text, with minimal markup or formatting, but still\nprovides a way to represent important document structure. Mainstream LLMs, such as\nOpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their\nresponses unprompted. This suggests that they have been trained on vast amounts of\nMarkdown-formatted text, and understand it well. As a side benefit, Markdown conventions\nare also highly token-efficient.\n\n## Prerequisites\nMarkItDown requires Python 3.10 or high

FAQ (2)

Troubleshooting
How to fix 'No matching distribution found for youtube-transcript-api~=1.0.0' when installing markitdown[all]?

The markitdown[all] extra pins youtube-transcript-api to ~=1.0.0, but that version is missing from PyPI. Workaround: install without the extra (pip install markitdown) and then add the packages you need separately (e.g., pip install youtube-transcript-api). Or force install with --no-deps and manually resolve dependencies. Check for a newer markitdown release that may fix the pin.

GitHub Issue #1809
Troubleshooting
Why does markitdown 0.1.5 use excessive memory and crash with OOM on large PDFs?

Downgrade to markitdown 0.1.4. This known regression (PR #1499) causes memory usage to spike (e.g., 2.7 GiB vs 200 MiB for a 400-page PDF). Upgrade when the fix in issue #1611 is released.

GitHub Issue #1611