How to Scrape Webpages to Markdown (3 Methods)
You need to feed web data to an LLM, but raw HTML is bloated with tags and scripts that waste tokens and confuse the model.
The fastest way to get clean, AI-ready text is to use a dedicated converter. If you just need to convert a few pages manually, save the webpage as HTML (Ctrl+S) and drop it into our free HTML to Markdown converter.
But if you are building an automated pipeline, you need to scrape webpages directly to Markdown programmatically. Here are the three best methods to do it, ranging from Python libraries to fully managed APIs.
Method 1: Python Libraries (Trafilatura + Markdownify)
If you want full control and are comfortable writing Python, combining a scraping library with an HTML-to-Markdown converter is the standard approach.
Trafilatura is excellent for extracting the main body text from a webpage while ignoring navigation menus, footers, and ads. Once you have the clean HTML, you can use Markdownify to convert it.
import trafilatura
from markdownify import markdownify as md
# 1. Download the webpage
url = 'https://example.com/article'
downloaded = trafilatura.fetch_url(url)
# 2. Extract the main content as HTML
html_content = trafilatura.extract(downloaded, output_format='html')
# 3. Convert HTML to Markdown
markdown_text = md(html_content)
print(markdown_text)
Pros: Free, runs locally, highly customizable. Cons: Struggles with JavaScript-heavy sites (React/Vue) unless paired with a headless browser like Playwright or Selenium.
Method 2: Open-Source AI Crawlers (Crawl4AI)
If you need to handle dynamic content rendered by JavaScript, traditional scrapers fall short. Crawl4AI is an open-source tool designed specifically for LLM workflows. It handles JS rendering and outputs clean Markdown directly.
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Pros: Handles JavaScript, outputs Markdown natively, open-source. Cons: Requires setting up and managing a headless browser environment, which can be resource-intensive.
Method 3: Managed APIs (Jina Reader / Firecrawl)
For production environments where you don't want to manage infrastructure, proxies, or headless browsers, managed APIs are the best choice. Services like Jina Reader or Firecrawl turn any URL into Markdown with a simple HTTP request.
For example, with Jina Reader, you simply prepend https://r.jina.ai/ to your target URL:
curl https://r.jina.ai/https://example.com
Pros: Zero infrastructure, handles proxies and anti-bot measures, extremely reliable. Cons: Paid services at scale (though most offer generous free tiers).
Edge Cases to Consider
When scraping to Markdown, you will inevitably run into a few edge cases:
- Paywalls and Logins: APIs and headless browsers will hit the same paywalls a normal user would. If you need to scrape authenticated content, you must pass session cookies or use a tool like file2markdown.ai where you save the authenticated HTML manually first.
- Complex Tables: Deeply nested HTML tables often break when converted to Markdown. You may need to write custom parsing logic or accept simplified table structures.
- Images: Markdown only stores image URLs, not the images themselves. If the source website goes down or blocks hotlinking, your images will break.
Frequently Asked Questions
Q: Why should I scrape to Markdown instead of JSON? A: JSON is great for structured data (like product prices), but Markdown is superior for long-form content (articles, documentation) because it preserves semantic structure (headings, lists) in a format LLMs understand natively. Read more about why LLMs prefer Markdown.
Q: Can I scrape an entire website to Markdown? A: Yes, but you will need a crawler to discover all the links on the site first, and then pass each URL to your Markdown conversion logic. Tools like Firecrawl offer this as a built-in feature.
Q: Is it legal to scrape webpages?
A: Generally, scraping public data is legal, but you should always respect the site's robots.txt file and terms of service. Avoid aggressive scraping that could overload the target server.
Need to convert files or saved HTML pages to Markdown without writing code? Try file2markdown.ai for free today. We support PDFs, Word docs, and more. If you hit our free tier limits, check out our Pro pricing for batch processing.
The File2Markdown Newsletter
Markdown tips, AI workflows, and document automation. Weekly, no spam.