Skip to main content

OCR pipelines for legacy federal documents.

April 14, 2026 · 13 min read · Microfiche, typewritten memos, handwritten fields, classical CV preprocessing, and LLM rescoring.

The scope of the legacy problem

Every federal agency has paper. Sometimes it is warehoused; sometimes it is microfiche; sometimes it has been scanned once with 1990s technology and the images have been sitting ever since. Digitizing it is an enduring federal need: FOIA response, records management, historical research, mission continuity. The tools to do it well have improved substantially in 2026, but a legacy OCR pipeline is still a real engineering project, not a library call.

OCR ERROR RATES COMPOUND

Even rigorous OCR (Tesseract 5, AWS Textract) produces 1–3% character error rates on degraded federal documents. For downstream AI use, errors compound through extraction. Build correction and confidence-scoring layers before any structured output.

OCR Engine Suitability — Printed Federal Text

AWS Textract
90%
Azure Document Intelligence
88%
Google Document AI
87%
Tesseract (open source)
72%

Bar shows accuracy on printed text. Full breakdown by document type: AWS Textract hits 88% on tabular legacy forms and 45% on handwriting; Azure DI hits 85% tabular and 52% handwriting; Google Doc AI hits 82% tabular and 60% handwriting; Tesseract hits 58% tabular and 28% handwriting. Route per region, per document type.

The corpus you actually have

  • Typewritten memos and reports from the 1950s-1990s. Varying quality, carbon-paper copies, typewriter idiosyncrasies.
  • Microfiche and microfilm with compression artifacts, sometimes re-scanned from fiche readers.
  • Handwritten forms and annotations on otherwise printed forms.
  • Mixed-language or multi-script documents in programs with international scope.
  • Ornate letterheads and seals that confuse layout models.
  • Faxed documents with artifacts from the fax process compounding on top of the original.

Preprocessing that actually matters

Scan-level cleanup

Deskew

Hough line transform or Radon-based. Correct to within 0.5 degrees.

Rotation

Orientation detection (Tesseract --psm 0 or paddleocr angle classifier).

Border removal

Crop to content; remove scanner edges.

Binarization

Sauvola or Niblack adaptive methods beat Otsu on documents with uneven lighting. ImageMagick and OpenCV both work.

Denoising

Non-local means or bilateral filter. Avoid aggressive blurring on small text.

Dewarping

For curled book pages; dewarp with a trained model (DocTr, DocUNet) or page-boundary detection.

Bleed-through removal

For double-sided documents, subtract the back-page signal via registered scan pairs or learned filters.

Layout analysis

Before OCR, separate page zones: body text, tables, figures, headers, footers, margin notes, stamps. DocLayNet-trained detectors or Layout Parser. This routes different content to different downstream treatment.

OCR engine selection

EngineBest forNotes
Tesseract 5 (LSTM)Clean typewritten, modern printedLanguage models tunable; fine-tune on domain
PaddleOCRMixed quality; non-English; detection+recognitionOften better than Tesseract on degraded material
TrOCR (Hugging Face)Handwriting; can be fine-tunedTransformer; heavier compute; strong on target domain
Azure Document Intelligence / AWS TextractGeneral; handwritingCloud; FedRAMP High in GovCloud / Azure Gov
KrakenHistorical / handwritten; academic heritageSpecialized; smaller community
DocTR (Mindee)Integrated detection+recognition, PyTorch/TFActively maintained; competitive
GPT-4o / Claude VisionHard cases, rescoringExpensive at scale; strong on hard pages

LLM rescoring

OCR output is a character-level hypothesis with confidence. Many errors are character substitutions that produce non-words or unlikely word sequences. An LLM with a language prior can correct these:

  1. OCR produces the raw text plus per-word confidence.
  2. For low-confidence words or suspect sequences, LLM rescoring pass: "Here is the OCR output with confidence scores; correct errors using your language prior, but preserve any name, date, or number that might be domain-specific. Return only corrections with rationale."
  3. High-confidence corrections are applied; low-confidence corrections are flagged for human review.

This is a large accuracy lever on legacy typewritten material where the language is predictable but the OCR struggles with typewriter artifacts.

Handwriting specifically

  • Layout analysis routes handwritten regions to a handwriting-specialized recognizer.
  • TrOCR handwritten variant is strong on English handwriting; fine-tune on agency-specific examples (cursive vs print, period style).
  • Azure DI handwriting is competitive and ready-made for authorized workloads.
  • Structured fields (date, name) benefit from field-level templates and validation.
  • Free-form handwriting remains hard; expect 60-85% accuracy and plan for human review on anything consequential.
No engine is good at all of the corpus. The pipeline routes per region, per document type, per quality stratum. The right engine for each gets the right job.

Quality measurement

Ground truth from a subset is the only way to measure. For a legacy digitization program:

  • Human-transcribe 200-500 representative pages across quality strata. This is the ground truth.
  • Compute character error rate (CER) and word error rate (WER) per stratum.
  • Track named-entity extraction accuracy separately — a 95% CER with 50% dates-misread is unacceptable for many use cases.
  • Report confidence-calibration: for OCR outputs with confidence above threshold X, what is the actual accuracy? If they diverge, confidence is not useful.

Pipeline orchestration

  • Batch-oriented, not real-time. OCR throughput on a modest GPU cluster is millions of pages per day with reasonable preprocessing.
  • Step Functions or Airflow for per-document orchestration.
  • Each page is a unit of work; document-level reassembly at the end.
  • Persist: original scan (if not already), preprocessed image, layout analysis, OCR output, LLM-rescored text, per-word confidence, pipeline versions.
  • Idempotent and resumable. Failures should not require reprocessing a whole archive.

Human-in-the-loop for the hard parts

Even at 99% character accuracy, a million-page archive has meaningful residual error. Targeted human review:

  • Low-confidence pages prioritized.
  • Specific field types (names, case numbers, dates) always reviewed if below confidence threshold.
  • Reviewer corrections fed back into training data for the next model iteration.
  • Audit trail per correction: reviewer, timestamp, original OCR, corrected value.

Preservation posture

  • Always retain the original scan. OCR text is a derivative; do not treat it as the system of record.
  • PDF/A-3 for long-term preservation, embedding the OCR layer alongside the image.
  • Store at appropriate resolution (400 dpi for archival, 300 dpi minimum).
  • Track pipeline version and parameters with each derived artifact so a future re-run can be compared.

Where this fits in our practice

We build archival digitization pipelines for federal records programs. See our document AI for federal PDFs for the broader document pipeline and our RAG architecture for downstream retrieval.

FAQ

Why is OCR harder on federal archives than on modern PDFs?
Physical degradation, inconsistent scan quality, typewriter artifacts, bleed-through, handwritten annotations, carbon-paper copies, microfiche with compression artifacts, and mixed layouts. A modern OCR tuned to contemporary documents produces garbage on genuine archive material without preprocessing.
What engine should I start with for typewritten federal documents?
Tesseract 5 with LSTM and a typewriter-tuned training language model is a strong baseline. PaddleOCR is competitive and often better on degraded material. TrOCR (Hugging Face) is a transformer-based alternative with strong results after fine-tuning on the target document type.
Can LLMs rescore OCR output and improve accuracy?
Yes, meaningfully. OCR produces character-level hypotheses with confidence; an LLM with a language prior (prompted or fine-tuned on the domain) can correct plausible-but-wrong OCR output, especially on typewritten documents with consistent language patterns. Expect 20-50% error reduction on cleanup tasks.
What is the right preprocessing stack for scanned archives?
Deskew, rotation correction, binarization (Sauvola or Niblack for uneven lighting), denoising, border removal, and optional dewarping for curled pages. OpenCV + scikit-image covers most of it. Preprocessing quality caps OCR quality.
How do you handle handwritten fields on archival forms?
Separate the typewritten body from handwritten fields via layout analysis (LayoutLM, DocLayNet-trained detectors). Run typewritten OCR on printed content, handwriting recognition (TrOCR-handwritten, Azure DI, cloud handwriting services) on the handwritten fields. Different error profiles, different pipelines.
What confidence should I expect on archival OCR?
Clean typewritten documents: 95-99% character accuracy after preprocessing. Degraded typewritten: 85-95%. Microfiche: 75-90%. Mixed handwriting: 60-85%. Always run spot-check human review and report accuracy by stratum, not as an aggregate.

Related insights

Digitizing a federal archive?

We build archival OCR pipelines that handle microfiche, typewritten, and handwritten documents with error bounds you can defend.