Skip to content
DevToolKit

PDF to Markdown — Extract Structured Text from PDFs

Convert PDF documents to Markdown with automatic heading detection, bold text recognition, and list formatting. Runs locally in your browser — no uploads.

pdf

Drop your PDF here, or click to browse

Files are processed entirely in your browser — never uploaded

Processed locally
Was this tool helpful?

How to Use

Convert any digital PDF to clean Markdown in three steps:

  1. Upload your PDF — Drag and drop your file into the dropzone or click to browse. The tool reads the document locally in your browser to determine the page count and validates that the file is a valid PDF. No data is transmitted to any server at any point during the process.
  2. Set a page range (optional) — To convert specific pages, enter a range like 1-5 or individual pages like 1,3,7. Leave the field empty to convert all pages. This is useful for long documents where you only need specific chapters.
  3. Click "Convert to Markdown" and review — The tool processes each page using PDF.js, analyzing font sizes to detect headings, identifying bold text from font metadata, and recognizing list patterns. Toggle between raw Markdown and a rendered preview. On wide screens both views display side-by-side. Copy the output to your clipboard or download it as a .md file.

The stats bar shows page count, character count, and word count — useful for content migration audits and word-limit compliance. Multi-page conversions include horizontal rule separators and page number markers so you can identify which content belongs to which page.

About This Tool

PDF to Markdown conversion bridges two fundamentally different document philosophies. PDF is a fixed-layout format designed for pixel-perfect visual reproduction — every character is positioned at exact coordinates on the page, with no inherent concept of paragraphs, headings, or semantic structure. Markdown, by contrast, is a lightweight markup language that uses plain-text formatting conventions to express document structure: headings with hash marks, emphasis with asterisks, and lists with dashes or numbers.

This tool reconstructs semantic structure from visual cues embedded in the PDF. The algorithm works in several stages. First, it extracts all text items from each page using Mozilla's PDF.js library, which parses PDF content streams and returns each text run along with its transformation matrix and font metadata. Each text item carries a 6-element affine transformation matrix encoding position, scale, and rotation.

Next, the tool analyzes font sizes across the entire page to identify the most common size — the body text baseline. Text rendered at larger sizes is mapped to heading levels: text at 2x the body size becomes H1, 1.7x becomes H2, 1.4x becomes H3, and so on down to H5 at 1.1x. This ratio-based approach adapts automatically to each document's typographic scale rather than using absolute font-size thresholds that would break across different PDF generators.

Bold text detection inspects the font name embedded in each text item. PDF fonts that include weight indicators — names containing "Bold", "Black", or "Heavy" — signal bold formatting. When the majority of characters on a line use a bold font, the entire line is wrapped in Markdown double-asterisk syntax. This heuristic works reliably for standard PDF generators but may miss bold text in documents that use custom font families without conventional naming.

List detection recognizes common bullet characters (bullet points, dashes, asterisks, and arrows) as well as numbered patterns (1., 2), a.) at the start of text lines. Detected list items are converted to proper Markdown list syntax — unordered bullets become - item and numbered lists become 1. item. This covers the most common list styles but does not currently detect nested indentation levels.

Like all text-extraction-based tools, this converter works exclusively with digital (born-digital) PDFs that contain actual text data in their content streams. Scanned PDFs — which are effectively images wrapped in a PDF container — produce no output because the visible text exists only as pixels. For plain text without formatting, the PDF to Text Extractor provides a simpler output. For structured data including positional metadata, try PDF to JSON. For tabular data extraction, see Extract Tables from PDF.

Why Use This Tool

Converting PDF documents to Markdown serves a wide range of professional and technical workflows where structured, portable, and editable content is needed:

  • Documentation migration — Move existing documentation from PDF archives into modern systems that use Markdown natively: GitHub wikis, Notion, Obsidian, Docusaurus, MkDocs, and Confluence. The heading hierarchy is preserved, so the document structure maps naturally to navigation sidebars and table-of-contents generators.
  • Content management systems — Import PDF content into CMS platforms that accept Markdown (Ghost, Hugo, Jekyll, Astro). The structured output with proper headings, lists, and emphasis reduces the manual reformatting work compared to pasting raw plain text.
  • Knowledge base construction — Convert policy documents, technical manuals, and reference guides into Markdown for integration into searchable knowledge bases. Markdown files are plain text and index easily with full-text search tools, unlike PDF files which require specialized parsing.
  • Version control — Transform static PDF reports into Markdown files that can be tracked in Git. Plain-text diffs show exactly what changed between document versions — a workflow impossible with binary PDF files. This is especially valuable for regulatory compliance documents and technical specifications that undergo frequent revision.
  • AI and LLM pipelines — Prepare PDF content for use with large language models. Markdown provides cleaner input than raw text extraction because the heading structure and list formatting give the model contextual signals about content hierarchy and document organization.
  • Academic writing — Extract structured content from research papers for use in literature reviews, annotated bibliographies, or reference management tools that support Markdown. The heading detection preserves section titles, making it easier to navigate extracted content.

Privacy is critical when processing sensitive documents. This tool runs entirely in your browser — your PDF is never uploaded to any server. This makes it suitable for confidential contracts, medical records, financial statements, and internal corporate documents. For related workflows, try PDF to Text for unformatted extraction, Word Counter to analyze the output, or JSON Formatter if you need to work with structured data from the converted content.

FAQ

How does heading detection work?
The tool analyzes font sizes across the document to find the most common size (body text). Text rendered at larger sizes is mapped to Markdown heading levels (H1-H6) based on the size ratio relative to body text.
Does it detect bold text?
Yes. The tool inspects font names for bold indicators (Bold, Black, Heavy). Lines where the majority of characters use a bold font are wrapped in double asterisks (**bold**) in the output.
What list formats are recognized?
Bullet lists using common markers (bullet, dash, asterisk, arrow) and numbered lists using digit or letter prefixes (1., 2), a.) are detected and converted to proper Markdown list syntax.
Are my PDF files uploaded to a server?
No. All processing runs entirely in your browser using PDF.js. Your documents never leave your device.
Can I convert specific pages only?
Yes. Use the page range field to specify pages (e.g., '1-5' or '2,4,6'). Leave it empty to convert all pages. Multi-page output includes horizontal rule separators between pages.