file2markdown
markitdownunstructuredpdf to markdownragllmpythondocument parsing

MarkItDown vs Unstructured: Which Should You Use for LLM Document Prep?

June 28, 2026

MarkItDown vs Unstructured: Which Should You Use for LLM Document Prep?

You are getting documents ready for a RAG pipeline or an LLM prompt, and two Python tools keep coming up: Microsoft's MarkItDown and Unstructured (unstructured.io). They sound similar — both turn messy files into something an LLM can read — but they solve different problems. This post walks through the real differences so you pick the right one.

If you would rather not run Python at all, file2markdown produces the same clean Markdown through a browser or REST API with zero setup.

The Quick Answer

Use MarkItDown when you want fast, plain Markdown out of many file types and your documents are reasonably simple. It is the lightest path from file to LLM-ready text.

Use Unstructured when you are building a serious ingestion pipeline and need structured elements — titles, narrative text, tables, list items — with metadata for chunking, filtering, and retrieval quality.

Use file2markdown when you want hosted conversion (web UI or API) for PDF, PPTX, XLSX, and more without managing dependencies or OCR engines.

What Each Tool Is

MarkItDown (by Microsoft) is a lightweight converter built to produce LLM-friendly Markdown. It wraps existing parsers (pdfminer, python-docx, openpyxl, and others) behind one consistent convert() call and returns a single Markdown string. It supports a wide range of inputs — PDF, Office files, images, audio, EPUB, HTML — and optimizes for simplicity.

Unstructured is a document preprocessing framework aimed at RAG. Instead of one Markdown blob, it partitions a document into a list of typed elements (Title, NarrativeText, Table, ListItem, etc.), each with metadata like page number and coordinates. It ships an open-source library plus a hosted Serverless API, and supports OCR and high-resolution layout models for complex documents.

Head-to-Head Comparison

MarkItDownUnstructured
Primary outputSingle Markdown stringList of typed elements (+ JSON)
Best forQuick file → MarkdownRAG ingestion + chunking
Format supportPDF, Office, images, audio, EPUB, HTMLPDF, Office, email, HTML, images, more
Table handlingBasicGood (hi-res strategy)
OCRLimitedYes (multiple strategies)
Metadata per chunkNoYes (page, type, coordinates)
Install footprintSmallLarge (layout/OCR models)
Hosted optionNoYes (Serverless API)

Installing and Using Each

MarkItDown

pip install markitdown
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("report.pdf")
print(result.text_content)  # ready-to-use Markdown

That is the whole workflow — one call, one Markdown string. Drop it straight into a prompt or a chunker.

Unstructured

pip install "unstructured[all-docs]"
from unstructured.partition.auto import partition

elements = partition("report.pdf", strategy="hi_res")
for el in elements:
    print(type(el).__name__, "->", el.text[:80])

You get a list of elements you can filter (drop headers/footers), group, and chunk by title — which is exactly what you want feeding a vector store. The trade-off is a heavier install and more configuration.

When to Reach for file2markdown Instead

Both libraries assume a Python environment, model downloads, and (for scanned PDFs) an OCR setup. If you just need the Markdown — or you are calling from a non-Python stack, a serverless function, or a no-code tool — file2markdown does the conversion as a hosted service:

  • Web UI for one-off PDF, DOCX, and image conversions
  • A REST API for batch and automated workflows
  • OCR handled server-side, no PyTorch install

Bottom Line

MarkItDown is the fastest way to get plain Markdown; Unstructured is the better backbone for a production RAG pipeline that needs structured, chunk-aware output. Pick MarkItDown for speed and simplicity, Unstructured for retrieval quality at scale — and reach for file2markdown when you want the result without running either.

The Markdown Memo

A fortnightly note for lawyers, researchers, accountants, and anyone else drowning in PDFs, scans, and decks. No spam.