PDF OCR — Make Scanned PDFs Searchable
Add an invisible text layer to scanned PDFs with Tesseract.js OCR. Supports 100+ languages, runs entirely in your browser, and preserves the original page layout.
How to Use
Turn a scanned or image-based PDF into a fully searchable document in four steps:
- Upload your PDF — Drag and drop a scanned PDF into the dropzone or click to browse. The tool reads the file locally and displays the page count and file size. Scanned PDFs, photographed documents, and image-only PDFs all work.
- Select the document language — Choose the primary language of the text in your document from the dropdown. Accurate language selection is critical because Tesseract uses language-specific training data for character shapes, ligatures, and dictionary-based word correction. If your document contains multiple languages, select the dominant one.
- Run OCR — Click "Make Searchable" to begin processing. Each page is rendered at 300 DPI for optimal character recognition, then passed through the Tesseract.js neural network. The progress indicator shows which page is being processed and the confidence score for each completed page.
- Download results — Once OCR finishes, review the overall confidence score and per-page breakdown. Download the searchable PDF (original pages with an invisible text layer) or a plain text file containing all recognized text. You can also copy the extracted text directly to your clipboard.
The searchable PDF preserves the exact visual appearance of your original document. The invisible text layer sits on top of each page image, perfectly aligned with the printed words. This means you can select text, use Ctrl+F to search, and copy passages — all while the document looks identical to the original scan. PDF readers like Adobe Acrobat, Chrome's built-in viewer, and Preview on macOS all support this text layer format.
About This Tool
Optical Character Recognition (OCR) is the process of converting images of text into machine-readable characters. When you scan a paper document, the resulting PDF contains a photograph of each page — the text you see is actually an image, not selectable characters. OCR analyzes these images, identifies individual letters and words, and produces the corresponding digital text along with precise coordinates for where each word appears on the page.
This tool uses Tesseract.js, a WebAssembly port of Tesseract, the most widely deployed open-source OCR engine. Originally developed at Hewlett-Packard in 1985 and maintained by Google from 2006 to 2018, Tesseract uses a Long Short-Term Memory (LSTM) neural network architecture trained on millions of text samples across more than 100 languages. The LSTM model processes text line by line, recognizing character sequences in context rather than isolated glyphs, which significantly improves accuracy for connected scripts and degraded text.
The 300 DPI rendering pipeline is a deliberate choice based on OCR research. The industry standard for OCR input is 300 dots per inch — below 200 DPI, small characters lose critical detail and recognition accuracy drops sharply. Above 400 DPI, the marginal accuracy gain is negligible while processing time and memory consumption increase substantially. At 300 DPI, a standard A4 page (8.27 x 11.69 inches) renders to a 2481 x 3507 pixel canvas, providing roughly 12 pixels per character for 12pt text — well within Tesseract's optimal recognition range.
The invisible text layer is the standard technique used by Adobe Acrobat, ABBYY FineReader, and other commercial OCR products. After OCR identifies each word and its bounding box coordinates, the tool uses pdf-lib to draw each word in the correct position using a standard font, with opacity set to zero. The font size is calculated to match each word's bounding box width, ensuring that when you select text in a PDF viewer, the selection highlight aligns precisely with the visible printed text underneath.
Confidence scoring reflects the OCR engine's certainty about its character predictions. Tesseract assigns a confidence value (0-100) to each recognized word based on the neural network's output probability distribution. High confidence (above 90%) typically means clear, well-printed text on a clean background. Lower confidence can indicate faded ink, complex backgrounds, unusual fonts, handwriting, or low scan resolution. The per-page and overall confidence scores help you quickly assess whether the OCR output is likely to be accurate enough for your needs.
Language model files are approximately 1-15 MB each and are downloaded from the Tesseract CDN the first time you select a language. Your browser caches these files automatically, so subsequent uses of the same language load instantly from cache. The language data includes the LSTM neural network weights, character set definitions, dictionaries for word-level correction, and script-specific recognition parameters. For simpler text extraction from digital PDFs that already contain text layers, OCR is not needed — those PDFs can be processed directly.
The entire processing pipeline runs locally in your browser via WebAssembly. PDF rendering uses pdfjs-dist (Mozilla's PDF.js), OCR runs in Tesseract.js, and the output PDF is assembled with pdf-lib. No data leaves your device, making this safe for processing confidential contracts, medical records, legal documents, and identity papers.
Why Use This Tool
Making scanned PDFs searchable solves several practical problems that arise when working with digitized paper documents:
- Full-text search — Without OCR, finding a specific clause in a 200-page scanned contract requires scrolling through every page manually. A searchable PDF lets you press Ctrl+F and jump directly to every occurrence of a term. This transforms hour-long document reviews into seconds-long searches.
- Text selection and copying — Scanned PDFs contain only images, so you cannot select or copy any text. After OCR processing, every word becomes selectable, allowing you to copy passages into emails, reports, spreadsheets, or other documents without manual retyping.
- Accessibility compliance — Screen readers cannot interpret text embedded as images. Government agencies, universities, and corporations operating under accessibility laws (Section 508, WCAG 2.1, EN 301 549) are required to provide searchable text versions of scanned documents. OCR is the standard method for making legacy scanned archives accessible.
- Document management systems — Enterprise document management platforms (SharePoint, Google Drive, Dropbox, M-Files) index PDFs for search. Scanned PDFs without text layers are invisible to these search indexes. Running OCR before uploading ensures your documents are discoverable alongside born-digital files.
- Data extraction workflows — Extracting structured data from scanned invoices, receipts, forms, and reports requires machine-readable text as the first step. OCR converts the visual content into text that can then be parsed, analyzed, or fed into table extraction and data processing pipelines.
- Archival and preservation — Libraries, courts, and government archives routinely digitize historical paper records. Adding OCR text layers to these scans creates a dual-layer document: the visual fidelity of the original scan combined with searchable, indexable text. The PDF to Text tool can then extract the plain text for further analysis or database entry.
Because this tool runs entirely in your browser, it is safe for processing sensitive documents — legal filings, tax returns, medical records, employment contracts, and identity documents. Your files never leave your device, no copies are stored on remote servers, and no analytics track the content of your documents. The WebAssembly-based processing runs on your own hardware, which also means there are no per-page charges or subscription requirements, unlike commercial OCR services that typically charge $0.01-0.10 per page.