Skip to content
DevToolKit

PDF to CSV Converter

Extract tabular data from PDF documents and export as CSV. Detects rows and columns automatically using coordinate-based clustering — no uploads, fully private.

pdf

Drop your PDF here, or click to browse

Files are processed entirely in your browser — never uploaded

Processed locally
Was this tool helpful?

How to Use

Convert tables inside a PDF into structured CSV data in three steps:

  1. Upload your PDF — Drag and drop a PDF file into the dropzone or click to browse your files. The tool reads the document locally in your browser to count pages. No data is sent to any server.
  2. Configure extraction settings — Optionally enter a page range like 1-5 or 2,4,6 to limit extraction to specific pages. Choose a delimiter: comma for standard CSV, tab for pasting into Excel or Google Sheets, or semicolon for European locale compatibility.
  3. Extract and export — Click the extract button to begin processing. The tool analyzes text positions on each page to detect rows and columns, then builds a structured grid. Once complete, review the CSV output, copy it to your clipboard for pasting into a spreadsheet, or download it as a .csv file.

The results panel shows how many tables were detected, which pages they came from, and the row and column counts for each. This helps you verify the extraction captured the right data before downloading.

About This Tool

PDF documents store text as individually positioned glyphs, not as structured tables. When a table is rendered in a PDF, each cell's text is placed at exact coordinates — but the table structure itself (rows, columns, borders) exists only as visual lines and spacing. There is no semantic table element in the PDF specification that a tool can simply read. Extracting tabular data from a PDF therefore requires reconstructing the table structure from positional clues.

This tool uses Mozilla's PDF.js library to extract every text item on a page along with its transformation matrix. Each item's matrix encodes its x and y position, scale, and rotation. The extraction algorithm then applies a three-stage pipeline to reconstruct table structure from these coordinates.

In the first stage, text items are clustered into rows using Y-coordinate proximity. Items whose vertical positions fall within approximately 3 points (about 1mm) of each other are grouped into the same row. This threshold accounts for minor baseline variations that PDF generators introduce when rendering table cells. The items within each row are then sorted by their X-coordinate to establish left-to-right reading order.

The second stage detects column boundaries. The algorithm collects all X-start positions across every row and clusters them into groups of positions that recur frequently. A cluster that appears in at least 20% of rows is treated as a column boundary. This statistical approach works with both bordered and borderless tables — the column structure emerges from the consistency of text alignment rather than from visible gridlines. The median X-position within each cluster becomes the column's left edge.

In the third stage, each text item is assigned to its nearest column boundary, and the values are assembled into a 2D grid. The grid is then serialized as CSV following RFC 4180: values containing delimiters, double quotes, or newlines are wrapped in double quotes, and existing double quotes are escaped by doubling them. This ensures the output imports cleanly into any spreadsheet application. For structured extraction of complex layouts, see also the Extract Tables from PDF tool or PDF to Text for full-page text output.

Why Use This Tool

Converting PDF tables to CSV unlocks data that would otherwise be trapped in a format designed for visual display rather than data processing:

  • Financial analysis — Extract transaction tables from bank statements, invoices, and financial reports into spreadsheets for reconciliation, budgeting, and trend analysis. Many financial institutions deliver statements exclusively as PDF files, making extraction the only practical path to structured data.
  • Research data collection — Pull statistical tables from academic papers, government reports, and industry surveys into CSV for analysis in R, Python, or Excel. Researchers working with meta-analyses often need to aggregate data from dozens of PDF publications.
  • Supply chain and procurement — Convert product catalogs, price lists, and inventory reports from PDF to CSV for import into ERP systems, comparison spreadsheets, or procurement databases. This eliminates manual data entry that is both time-consuming and error-prone.
  • Regulatory compliance — Extract compliance checklists, audit findings, and inspection reports from PDF documents into structured formats for tracking and reporting. CSV output integrates directly with compliance management platforms and dashboards.
  • Data migration — When legacy systems export data only as PDF reports, CSV extraction provides a bridge to import that data into modern databases, CRM systems, or cloud platforms. This is common during system transitions where historical data exists only in archived PDF reports.

Privacy is critical when processing documents containing financial data, personally identifiable information, or trade secrets. This tool runs entirely in your browser — the PDF is never transmitted to any server. For related workflows, use PDF to Text to extract unstructured content, PDF to JSON for positional metadata, or Word Counter to analyze extracted text length.

FAQ

How does the tool detect tables in a PDF?
It extracts text items with their page coordinates using PDF.js, clusters them into rows by Y-coordinate proximity (within 3 points), then detects column boundaries from consistent X-position patterns across rows.
Does this work with scanned PDFs?
No. This tool extracts embedded text data from digital PDFs. Scanned PDFs contain only images and require OCR (Optical Character Recognition) to convert the visual text into machine-readable characters.
Can I choose a different delimiter?
Yes. You can select comma (CSV), tab (TSV), or semicolon as the delimiter. Tab-separated output is useful for pasting directly into Excel or Google Sheets.
Is my PDF uploaded to a server?
No. All table detection and CSV extraction run entirely in your browser using pdfjs-dist. Your file never leaves your device.
Can I extract tables from specific pages?
Yes. Use the page range field to specify pages like '1-5' or '2,4,6'. Leave it empty to extract from all pages.