Question 1

How does the table detection algorithm work?

Accepted Answer

The tool uses coordinate-based text clustering rather than relying on visible grid lines. It extracts every text item from a PDF page along with its X and Y coordinates from the transformation matrix. Text items at similar Y positions (within 3 points) are grouped into rows, then sorted by X position within each row. Column boundaries are inferred by detecting consistent vertical alignment gaps across multiple rows. This approach works on both bordered and borderless tables.

Question 2

Can this tool extract tables from scanned PDFs?

Accepted Answer

No. This tool operates on text-based PDFs where characters are encoded as selectable text objects. Scanned PDFs store pages as raster images without text layer data. If your PDF was created by scanning a paper document and no OCR was applied, there are no text coordinates to analyze. You would need an OCR tool to convert the image to text first.

Question 3

What output formats are available for extracted tables?

Accepted Answer

Extracted tables can be copied to clipboard as CSV (comma-separated values) or downloaded as a .csv file. CSV is a universal format that opens directly in Microsoft Excel, Google Sheets, LibreOffice Calc, and any text editor. From CSV you can easily convert to XLSX, JSON, or other structured formats using companion tools.

Question 4

Is my PDF uploaded to a server during extraction?

Accepted Answer

No. The entire extraction process runs locally in your browser using pdfjs-dist, Mozilla's open-source PDF rendering library. Your document is loaded into browser memory, processed client-side via JavaScript, and never transmitted to any server. This makes the tool safe for confidential documents including financial statements, contracts, and medical records.

Question 5

Why are some columns merged or split incorrectly?

Accepted Answer

Column detection depends on consistent vertical alignment of text across rows. If a PDF uses variable spacing, merged cells, or multi-line cell content, the algorithm may misidentify column boundaries. Tables with irregular layouts, nested tables, or spanning cells are inherently ambiguous without visible grid lines. For best results, use PDFs generated from structured data sources like spreadsheets or databases rather than manually formatted documents.

Extract Tables from PDF

How to Use

About This Tool

Why Use This Tool

FAQ