PDF to JSON Converter

Extract text, coordinates, fonts, and metadata from PDF files into structured JSON. Client-side processing with pdfjs-dist — your files never leave the browser.

Processed locally

Was this tool helpful?

How to Use

Extract structured data from any PDF document in three steps:

Upload your PDF — Drag and drop a PDF file onto the upload area, or click to browse your local files. The tool reads the file directly in your browser using the pdfjs-dist library. No data is sent to any server.
Review and configure output — After extraction, you will see a stats bar showing total pages, text items extracted, and the JSON output size. Click "Options" to toggle coordinate data (X, Y positions and width), font information (font name and size), or switch between pretty-printed and minified JSON. Click "Re-extract with new options" to regenerate the output.
Copy or download — Use the clipboard icon in the output area to copy the JSON to your clipboard, or click "Download JSON" to save it as a .json file. The filename is automatically derived from your original PDF filename.

The extraction engine processes every page sequentially, reading the text content layer that PDF creators embed when generating documents from word processors, design tools, or HTML renderers. Each text run is captured with its precise position, dimensions, and typography data, then grouped into logical reading lines.

About This Tool

PDF (Portable Document Format) stores text as positioned character sequences rather than flowing paragraphs. Each text element in a PDF carries a transformation matrix that defines its exact placement on the page, along with font references and sizing data. This tool extracts that positioning layer and converts it into machine-readable JSON, preserving the spatial relationships between text elements that are invisible when simply copying text from a PDF viewer.

The output JSON follows a hierarchical structure: a top-level metadata object contains document-level properties such as title, author, creator application, and creation date. The pages array contains one entry per page, each with its dimensions in PDF points (1 point = 1/72 inch) and an array of text lines. Each line aggregates text items that share the same vertical position — items within 2 points of each other vertically are considered part of the same line. Within each line, items are ordered left-to-right by X-coordinate, preserving natural reading order regardless of the order they appear in the PDF's internal content stream.

Coordinate values use PDF's standard coordinate system, converted from bottom-left origin to top-left origin for consistency with web and screen coordinate systems. The Y value of 0 corresponds to the top of the page, increasing downward. Font sizes are derived from the text transformation matrix, specifically the vertical scaling component (transform[3]), which gives the effective rendered size in points. Font names reference the internal PDF font dictionary — these may differ from the display names you see in applications (for example, "BCDFGH+ArialMT" indicates a subset-embedded Arial font).

This tool works exclusively with digitally-created PDFs — documents generated by word processors, web browsers, LaTeX, or design software. These PDFs contain an explicit text layer that the extraction engine reads. Scanned PDFs, which store pages as raster images without embedded text, will produce empty or minimal output. To extract text from scanned documents, you would first need to run optical character recognition (OCR) to generate a text layer, then use this converter on the OCR-processed file.

All processing runs on the pdfjs-dist library — the same rendering engine that powers Firefox's built-in PDF viewer. The library is loaded on-demand when you upload a file, keeping initial page load fast. Your PDF file is read into browser memory, processed entirely client-side, and never transmitted over the network. Related tools for working with PDF data include JSON Formatter for beautifying the output, JSON Validator for verifying structure, CSV to JSON Converter for tabular data, and PDF Merge for combining documents.

Why Use This Tool

Converting PDF content to structured JSON unlocks programmatic access to document data across a wide range of workflows:

Data pipeline ingestion — ETL pipelines that process invoices, receipts, or reports need structured input. Converting PDF text to JSON with coordinate data enables rule-based field extraction: identify a label like "Total" by its position, then read the adjacent value. This approach works across documents that share a layout template without requiring machine learning models.
Document comparison and diffing — Comparing two versions of a contract or specification is straightforward once both are in JSON format. Standard JSON diff tools can identify added, removed, or changed text at the line level, something that is difficult to do reliably with raw PDF binary comparison.
Accessibility auditing — Accessibility engineers verify that PDFs contain a proper text layer by extracting and inspecting the text content. Documents with missing or garbled text indicate encoding problems that would prevent screen readers from reading the content. The JSON output makes these issues immediately visible.
Search index construction — Building a full-text search index over a PDF document collection requires extracting plain text with page numbers. The JSON output provides text grouped by page with line-level granularity, suitable for indexing with Elasticsearch, Typesense, or similar search engines.
Font and layout analysis — Designers and typesetters analyze font usage across large documents to verify brand compliance. The font name and size data in the JSON output reveals every typeface used, its rendering size, and where it appears — information that is tedious to extract manually from a PDF viewer's properties panel.
Automated form processing — Organizations that receive filled PDF forms can extract field values by position. Since form fields appear at consistent coordinates across copies of the same template, a script can read the JSON output and pull values from known (x, y) regions without needing PDF form field parsing support.

Unlike server-based extraction APIs that charge per page or require uploading sensitive documents to third-party infrastructure, this tool processes everything in your browser. The PDF never leaves your device, making it suitable for confidential contracts, medical records, financial statements, and any document where data sovereignty matters. For further processing, use the JSON Minifier to compress the output or the Word Counter to analyze the extracted text content.

FAQ

What data does the PDF to JSON converter extract?

The converter extracts document metadata (title, author, creation date), page dimensions, and every text item with its exact X/Y coordinates, font name, font size, and width. Text items are grouped into logical lines based on vertical proximity.

Can this tool extract text from scanned PDFs?

No. Scanned PDFs store content as raster images rather than selectable text. This tool extracts text from digitally-created PDFs where the text layer exists. For scanned documents, you would need OCR (optical character recognition) software first.

Is my PDF uploaded to a server?

No. All processing runs locally in your browser using the pdfjs-dist JavaScript library. Your PDF file never leaves your device — the extraction happens entirely client-side.

What JSON structure does the output use?

The output contains a metadata object, total page count, and a pages array. Each page includes its number, dimensions (width/height in points), and a lines array. Each line contains the full concatenated text and an items array with per-character-run positioning data.

How are text items grouped into lines?

Text items are grouped by vertical position (Y-coordinate). Items within 2 pixels of each other vertically are considered part of the same line. Within each line, items are sorted left-to-right by X-coordinate to maintain natural reading order.