RAG Document Prep: PDF to Markdown to Chunks
If your RAG pipeline is hallucinating or missing key context, the issue usually isn't your embedding model—it's how you extract and chunk your PDFs.
The Quick Answer: The Markdown Pipeline
The most reliable way to prepare documents for Retrieval-Augmented Generation (RAG) is a three-step pipeline: PDF → Markdown → Semantic Chunks. You can start the first step instantly using our free PDF to Markdown converter.
- Convert PDF to Markdown: Transform raw PDFs into structured Markdown, preserving headers and tables.
- Semantic Chunking: Split the Markdown file based on headers (
##,###) rather than arbitrary character counts. - Embed and Store: Generate embeddings for these logical chunks and load them into your vector database.
This approach ensures that related concepts stay together, giving the LLM complete context.
Step 1: Converting PDF to Markdown
PDFs are designed for printing, not for data extraction. They store text as coordinates on a page, meaning paragraphs, headers, and tables are often lost when using basic text extractors.
When you convert a PDF directly to Markdown, you restore this semantic structure. A good converter will recognize that a large, bold font is an H1 or H2, and format it as # or ##. It will also reconstruct tables using standard Markdown syntax.
For a deep dive into why this format is superior for AI, read our guide on Markdown for RAG pipelines.
Step 2: Semantic Chunking
Once you have clean Markdown, you need to chunk it. Many developers default to character-based chunking (e.g., 1000 tokens with a 200-token overlap). This is a mistake. Character chunking inevitably cuts sentences in half or separates a crucial heading from the paragraph it describes.
Instead, use semantic chunking. Because Markdown uses specific characters for headings, you can programmatically split your documents based on these markers.
Tools like LangChain's MarkdownHeaderTextSplitter allow you to split text by headers, keeping the header information as metadata. This ensures that a specific section of a document stays together in a single chunk. We cover this extensively in our post on chunking Markdown for vector databases.
Step 3: Embedding and Storage
With your documents neatly divided into semantic chunks, the final step is to generate embeddings and store them in your vector database (like Pinecone, Weaviate, or Milvus).
Because the chunks are logically grouped, the embeddings will more accurately represent the core concepts of each section. When a user queries your RAG system, the retrieved chunks will contain complete thoughts, drastically reducing LLM hallucinations.
Edge Cases in Document Prep
While the PDF → Markdown → Chunks pipeline is robust, you will encounter edge cases.
Scanned PDFs and OCR
If your PDF is a scanned image, standard text extraction will fail. You must use Optical Character Recognition (OCR) to extract the text before converting it to Markdown. This adds a layer of complexity and processing time to your pipeline.
Massive Tables
Sometimes, a table in a PDF spans multiple pages. Reconstructing this into a single Markdown table can be challenging. If the resulting Markdown table is too large for your embedding model's token limit, you may need to implement a strategy to summarize the table or split it logically.
Frequently Asked Questions (FAQ)
Q: Why not just use raw text for RAG?
A: Raw text loses all semantic structure. Headings, paragraphs, and lists blend together, making it impossible to chunk logically. This leads to fragmented context and poor retrieval accuracy.
Q: Can I automate the PDF to Markdown conversion?
A: Yes. While our free tool is great for testing, production pipelines require automation. Our Pro plan offers API access for automated, high-volume conversion workflows.
Q: What if my chunks are still too large after semantic splitting?
A: If a section under a heading exceeds your token limit, apply a secondary character-based split with a significant overlap as a fallback.
Ready to build a more accurate RAG pipeline? Convert your PDFs to Markdown today.
The Markdown Memo
A fortnightly note for lawyers, researchers, accountants, and anyone else drowning in PDFs, scans, and decks. No spam.