Pandoc PDF to Markdown: What Works, What Doesn't, and Better Alternatives
Pandoc PDF to Markdown: What Works, What Doesn't, and Better Alternatives
Pandoc is the Swiss-army knife of document conversion. It handles Markdown to HTML, DOCX to LaTeX, RST to PDF — and just about everything in between. So when you need to extract content from a PDF and turn it into clean Markdown for an AI pipeline or knowledge base, Pandoc seems like the obvious first stop.
The problem: Pandoc was not built for PDF parsing. Converting PDF to Markdown is one of its most unreliable use cases, and understanding why will save you hours of debugging messy output. file2markdown was designed specifically for this job — but first, let's give Pandoc a fair assessment.
What Pandoc Actually Does With a PDF
When you run the basic command:
pandoc input.pdf -o output.md
Pandoc does not "read" the PDF the way a dedicated parser does. It calls an external helper — typically pdftotext from the Poppler toolkit — to extract raw text, then wraps that text in minimal Markdown structure. No OCR, no layout analysis, no table detection.
The result is a single block of extracted text. Formatting is almost entirely lost: multi-column layouts merge into one stream, headers look identical to body text, tables collapse into rows of space-separated values, and footnotes end up inline. For PDFs that are primarily text-based with simple layouts, the output is usable. For anything more complex, it's a cleanup project.
When Pandoc Works Reasonably Well
Pandoc's PDF-to-Markdown output is acceptable in a narrow set of cases:
- Simple single-column PDFs with minimal formatting (legal documents, plain reports, e-books converted from EPUB)
- Text-heavy academic papers where the content matters more than structure and you're feeding it straight into an LLM anyway
- Quick-and-dirty extraction where you need any text and will clean it by hand or with a script afterward
For these scenarios, Pandoc is fast, free, and already installed on most developer machines. Run the command, pipe the output, move on.
Where Pandoc Consistently Fails
Outside of the simple cases above, Pandoc's PDF handling breaks down quickly:
Scanned PDFs — Pandoc has no OCR capability. If your PDF is an image scan, you'll get an empty or near-empty Markdown file. Pandoc cannot read pixels.
Tables — Even in selectable-text PDFs, Pandoc does not detect table structure. Table data becomes garbled rows of text with no column alignment. For research papers, financial reports, or anything data-heavy, this is a dealbreaker.
Multi-column layouts — Academic papers and magazines use two- or three-column layouts. Pandoc reads across columns left-to-right, producing text that alternates between columns mid-sentence.
Footnotes and endnotes — These get merged into the main text flow at whatever position the PDF renders them, making the output structurally incorrect.
Mathematical equations — PDF has no native representation of equations as LaTeX; they're rendered as curves. Pandoc cannot reconstruct them.
Images and figures — Inline images are dropped entirely. If your PDF's key data is in charts or diagrams, none of it survives conversion.
Step-by-Step: Using Pandoc for PDF to Markdown
If you've assessed your PDF and it fits the "simple text" profile, here's the cleanest way to use Pandoc:
1. Install Pandoc and Poppler
# macOS
brew install pandoc poppler
# Ubuntu/Debian
sudo apt-get install pandoc poppler-utils
# Windows (via Chocolatey)
choco install pandoc
2. Run the basic conversion
pandoc input.pdf -o output.md
3. Improve output with format flags
pandoc input.pdf -o output.md --wrap=none
The --wrap=none flag prevents Pandoc from hard-wrapping lines at 80 characters, which makes the output cleaner for downstream processing.
4. Extract as plain text first (often cleaner)
For heavily formatted PDFs, you may get better results extracting raw text with pdftotext and then processing it yourself:
pdftotext -layout input.pdf - | pandoc -f markdown -o output.md
5. Automate with Python
If you're processing multiple PDFs in a pipeline, you can call Pandoc via subprocess:
import subprocess
result = subprocess.run(
["pandoc", "input.pdf", "-o", "output.md", "--wrap=none"],
capture_output=True,
text=True
)
print(result.stdout)
For Python-native PDF extraction that's more flexible, see our guide on automating PDF to Markdown with Python.
When to Use a Dedicated PDF-to-Markdown Converter
Pandoc's PDF limitations are architectural — they're not bugs that will be patched. If you need any of the following, you need a tool built specifically for PDF parsing:
- Accurate table extraction (financial reports, research data)
- OCR for scanned documents (legacy archives, printed materials)
- Layout-aware conversion (multi-column papers, newsletters)
- Image and figure preservation
- Clean output for RAG pipelines (vector databases need clean chunks, not noise)
file2markdown.ai's PDF to Markdown converter uses a dedicated document intelligence pipeline that handles all of these cases. Upload your PDF in the browser — no install, no command line — and get structured Markdown with tables intact, headings detected, and OCR applied to any scanned content.
For developers who need this at scale, the PDF to Markdown API accepts REST calls and returns clean Markdown with metadata. You can also batch-process dozens of PDFs at once — see the batch conversion guide for details.
If you're choosing between tools more broadly, the file2markdown vs Pandoc comparison lays out the full trade-off matrix, and our best PDF to Markdown converters roundup covers more options including MarkItDown, Marker, and Docling.
Frequently Asked Questions
Can Pandoc convert a scanned PDF to Markdown?
No. Pandoc has no OCR capability. If your PDF contains scanned images of text rather than selectable text, Pandoc will produce an empty or near-empty output file. You need a tool with built-in OCR, such as file2markdown.ai, Marker, or Adobe Acrobat's export feature.
Why does Pandoc's PDF to Markdown output look so garbled?
Pandoc delegates PDF reading to pdftotext, which extracts raw character streams without understanding layout. Multi-column text gets merged, headers are indistinguishable from body text, and tables collapse. The output is often usable as raw text input for an LLM, but it requires significant cleanup before it's readable as structured Markdown.
Is there a way to improve Pandoc's PDF output quality?
The --wrap=none flag and piping through pdftotext -layout can help with simple PDFs. For structured content with tables or complex layouts, no Pandoc flags will fix the fundamental limitation — you'll need a dedicated PDF parser.
What is the best alternative to Pandoc for PDF to Markdown?
For a no-install web-based option, file2markdown.ai handles PDFs including scanned documents, tables, and multi-column layouts. For Python-native workflows, Marker and Docling are strong open-source options. The MarkItDown vs Pandoc post compares two popular Python alternatives in detail.
The Bottom Line
Pandoc is an excellent tool, but it's a generalist format converter — not a PDF parser. For simple, selectable-text PDFs where you just need the raw content, the pandoc input.pdf -o output.md command gets you there in seconds. For anything with tables, scans, multi-column layouts, or structure that needs to survive the conversion, you'll need a tool purpose-built for the job.
The free PDF to Markdown converter at file2markdown.ai handles all the cases Pandoc can't — and produces Markdown clean enough for RAG pipelines, documentation, and AI workflows without any post-processing.
The Markdown Memo
A fortnightly note for lawyers, researchers, accountants, and anyone else drowning in PDFs, scans, and decks. No spam.