file2markdown
htmlmarkdownpythonweb scrapingaillm

HTML to Markdown with Python: markdownify, html2text, and More

July 3, 2026

HTML to Markdown with Python

Whether you're building a web scraper, preprocessing documents for an LLM pipeline, or migrating legacy content, converting HTML to Markdown in Python is a common and important task. file2markdown makes this easy via a free web UI and API, but sometimes you need to do it locally inside your own Python code. This guide covers the three main libraries, with working code examples for each, and explains when to reach for each option.

Why Convert HTML to Markdown in Python?

HTML is designed for browsers. Markdown is designed for people and AI models. When you pull content from a webpage or document, the raw HTML is packed with <div> tags, class attributes, and inline styles that add noise and consume tokens without adding meaning.

Converting to Markdown first:

  • Reduces token usage — clean Markdown typically uses 60–80% fewer tokens than equivalent HTML
  • Improves LLM accuracy — models reason better over structured Markdown than over HTML soup
  • Simplifies text processing — headings, lists, and tables become predictable plain text

For a deeper look at why this matters for AI applications, see why LLMs prefer Markdown and our guide to converting documents to Markdown for LLMs.

The 3 Main Python Libraries

1. markdownify

markdownify is the most widely used Python library for this task. It converts an HTML string into Markdown, preserving headings, links, bold/italic, lists, and tables.

Install:

pip install markdownify

Basic usage:

from markdownify import markdownify as md

html = """
<h1>Getting Started</h1>
<p>This is a <strong>quick</strong> example.</p>
<ul>
  <li>Item one</li>
  <li>Item two</li>
</ul>
<a href="https://example.com">Visit us</a>
"""

result = md(html)
print(result)

Output:

# Getting Started

This is a **quick** example.

* Item one
* Item two

[Visit us](https://example.com)

Handling tables and stripping tags:

from markdownify import markdownify as md

# Strip unwanted tags
result = md(html, strip=["img", "script"])

# Convert tables to Markdown tables (default behaviour)
table_html = """
<table>
  <tr><th>Name</th><th>Score</th></tr>
  <tr><td>Alice</td><td>95</td></tr>
</table>
"""
print(md(table_html))

markdownify is a solid default choice for most web scraping and content-processing pipelines.

2. html2text

html2text is another mature library. It uses a slightly different approach — it renders HTML as a text block rather than doing DOM-to-Markdown mapping. The output tends to be plainer and more readable as raw text.

Install:

pip install html2text

Basic usage:

import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False  # keep hyperlinks
converter.body_width = 0        # no line wrapping

html = "<p>Hello <b>world</b>. <a href='https://example.com'>Click here</a>.</p>"
result = converter.handle(html)
print(result)
# Hello **world**. [Click here](https://example.com).

Useful options:

converter.ignore_images = True   # skip <img> tags
converter.ignore_tables = False  # convert tables
converter.mark_code = True       # wrap <code> blocks in backticks

html2text is particularly good when you're processing a lot of text-heavy pages and want reliable, readable output with minimal configuration.

3. html-to-markdown

The html-to-markdown library is a newer, high-performance fork of markdownify. It aims for better spec compliance, faster processing, and more predictable output on complex HTML structures.

Install:

pip install html-to-markdown

Basic usage:

import html_to_markdown

html = "<h2>Results</h2><p>See the <em>table</em> below.</p>"
result = html_to_markdown.convert(html)
print(result)
# ## Results
#
# See the *table* below.

If you are processing large volumes of pages or have had issues with markdownify on edge cases (nested tables, unusual tag combinations), html-to-markdown is worth evaluating as a drop-in alternative.

Comparison Table

LibraryInstallSpeedTable supportBest for
markdownifypip install markdownifyGoodYesGeneral purpose, most widely used
html2textpip install html2textGoodYesText-heavy pages, plain output
html-to-markdownpip install html-to-markdownExcellentYesPerformance-critical pipelines

All three are free and open source. For most projects, start with markdownify; switch to html-to-markdown if you hit performance or edge-case issues.

A Practical Pipeline Pattern

Here is a reusable function that fetches a URL, strips boilerplate, and returns clean Markdown — a common pattern for RAG data ingestion:

import requests
from markdownify import markdownify as md
from bs4 import BeautifulSoup

def url_to_markdown(url: str) -> str:
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")

    # Remove nav, footer, ads
    for tag in soup.select("nav, footer, aside, script, style"):
        tag.decompose()

    main = soup.find("main") or soup.find("article") or soup.body
    return md(str(main), heading_style="ATX")

For more on this pattern in the context of LLM workflows, see the guide to building RAG pipelines with Markdown.

When to Use the file2markdown API Instead

The local libraries work well for HTML content you fetch programmatically. But if you are dealing with:

  • PDF, DOCX, XLSX, PPTX, or images alongside HTML
  • High-volume batch conversion where you don't want to manage dependencies
  • Scanned documents that require OCR

…then the file2markdown API is the better fit. It handles all major file types, returns clean Markdown via a single REST endpoint, and requires no libraries to install. You can also use the free online converter for quick one-off jobs without writing any code.

For a broader look at converting HTML without any Python setup, see HTML to Markdown Converter: The Best Way to Convert HTML to MD.

Frequently Asked Questions

Which Python library is best for converting HTML to Markdown?

markdownify is the best starting point for most projects — it is well-maintained, widely used, and handles common HTML structures reliably. If you need higher throughput or encounter edge cases, html-to-markdown is a fast, modern alternative.

How do I convert an entire webpage to Markdown in Python?

Fetch the page with requests, parse it with BeautifulSoup to remove navigation and boilerplate, then pass the main content HTML to markdownify or html2text. The pipeline pattern above shows a working example.

Does markdownify support Markdown tables?

Yes. markdownify converts HTML <table> elements into Markdown pipe-style tables by default. html2text and html-to-markdown also support tables, though the output format may vary slightly.

Can I use these libraries in a RAG pipeline?

Yes — that is one of the primary use cases. Converting HTML to Markdown before chunking reduces noise, lowers token counts, and makes retrieval more precise. See chunking Markdown for vector databases for the next step after conversion.

The Markdown Memo

A fortnightly note for lawyers, researchers, accountants, and anyone else drowning in PDFs, scans, and decks. No spam.