All projects
AI/ML·2024

AI Document OCR

PythonTesseractGPT-4oOpenAI APIPDF processing

The Problem

Extracting specific fields from unstructured PDF documents sounds straightforward until you face real-world documents: multi-page contracts, scanned invoices, inconsistent formatting. Sending entire documents to GPT-4o works, but at scale it gets expensive fast. Most tokens describe pages that hold nothing relevant.

The goal: pull exact attributes into a clean JSON output, ready to plug into downstream systems, without paying for tokens you don't need.

Pipeline Architecture

The extraction runs in three stages:

Stage 1: PDF to Text via Tesseract

Each PDF page is converted to an image first, then passed through Tesseract OCR. Splitting by page keeps memory manageable and makes parallel processing straightforward. Results are written to .md files, one per page, regardless of how messy Tesseract's raw output looks (and it often does look messy).

Stage 2: Retrieval Before Inference

Rather than feeding the full extracted text to GPT-4o, the pipeline runs a retrieval pass first. It identifies the pages or sections most likely to contain the target attributes, then builds a focused context window from those chunks only.

This is the key engineering decision: the LLM never sees the full document.

Stage 3: Structured Extraction with GPT-4o

The filtered context is sent to GPT-4o with a prompt that defines the exact attributes to extract. The model returns a structured JSON object, ready for integration.

PDF → per-page images → Tesseract OCR → .md chunks
                                              ↓
                                     Retrieval filter
                                              ↓
                                    GPT-4o (targeted context)
                                              ↓
                                       JSON output

Why Retrieval Before GPT

Tesseract on a 20-page document produces a lot of text. Without filtering, you'd send all of it upstream, most of which is irrelevant to the two or three fields you actually need. The retrieval step cuts that noise before it hits the API, reducing token consumption significantly while keeping extraction quality high.

The tradeoff: retrieval adds a step. The payoff: the system scales to large document volumes without runaway API costs.

Key Design Decisions

  • Tesseract as preprocessor, not final extractor: Tesseract's output isn't clean enough to use directly, but it's fast and free. It does the heavy lifting of getting text off the page; GPT-4o handles the understanding.
  • Markdown as intermediate format: Storing OCR output as .md keeps it human-readable for debugging and easy to parse programmatically.
  • JSON output for composability: The final output is designed to be consumed by other systems. No post-processing required downstream.