The scope of the legacy problem
Every federal agency has paper. Sometimes it is warehoused; sometimes it is microfiche; sometimes it has been scanned once with 1990s technology and the images have been sitting ever since. Digitizing it is an enduring federal need: FOIA response, records management, historical research, mission continuity. The tools to do it well have improved substantially in 2026, but a legacy OCR pipeline is still a real engineering project, not a library call.
Even rigorous OCR (Tesseract 5, AWS Textract) produces 1–3% character error rates on degraded federal documents. For downstream AI use, errors compound through extraction. Build correction and confidence-scoring layers before any structured output.
OCR Engine Suitability — Printed Federal Text
Bar shows accuracy on printed text. Full breakdown by document type: AWS Textract hits 88% on tabular legacy forms and 45% on handwriting; Azure DI hits 85% tabular and 52% handwriting; Google Doc AI hits 82% tabular and 60% handwriting; Tesseract hits 58% tabular and 28% handwriting. Route per region, per document type.
The corpus you actually have

- Typewritten memos and reports from the 1950s-1990s. Varying quality, carbon-paper copies, typewriter idiosyncrasies.
- Microfiche and microfilm with compression artifacts, sometimes re-scanned from fiche readers.
- Handwritten forms and annotations on otherwise printed forms.
- Mixed-language or multi-script documents in programs with international scope.
- Ornate letterheads and seals that confuse layout models.
- Faxed documents with artifacts from the fax process compounding on top of the original.
Preprocessing that actually matters
Scan-level cleanup
Deskew
Hough line transform or Radon-based. Correct to within 0.5 degrees.
Rotation
Orientation detection (Tesseract --psm 0 or paddleocr angle classifier).
Border removal
Crop to content; remove scanner edges.
Binarization
Sauvola or Niblack adaptive methods beat Otsu on documents with uneven lighting. ImageMagick and OpenCV both work.
Denoising
Non-local means or bilateral filter. Avoid aggressive blurring on small text.
Dewarping
For curled book pages; dewarp with a trained model (DocTr, DocUNet) or page-boundary detection.
Bleed-through removal
For double-sided documents, subtract the back-page signal via registered scan pairs or learned filters.
Layout analysis
Before OCR, separate page zones: body text, tables, figures, headers, footers, margin notes, stamps. DocLayNet-trained detectors or Layout Parser. This routes different content to different downstream treatment.
OCR engine selection
| Engine | Best for | Notes |
|---|---|---|
| Tesseract 5 (LSTM) | Clean typewritten, modern printed | Language models tunable; fine-tune on domain |
| PaddleOCR | Mixed quality; non-English; detection+recognition | Often better than Tesseract on degraded material |
| TrOCR (Hugging Face) | Handwriting; can be fine-tuned | Transformer; heavier compute; strong on target domain |
| Azure Document Intelligence / AWS Textract | General; handwriting | Cloud; FedRAMP High in GovCloud / Azure Gov |
| Kraken | Historical / handwritten; academic heritage | Specialized; smaller community |
| DocTR (Mindee) | Integrated detection+recognition, PyTorch/TF | Actively maintained; competitive |
| GPT-4o / Claude Vision | Hard cases, rescoring | Expensive at scale; strong on hard pages |
LLM rescoring
OCR output is a character-level hypothesis with confidence. Many errors are character substitutions that produce non-words or unlikely word sequences. An LLM with a language prior can correct these:
- OCR produces the raw text plus per-word confidence.
- For low-confidence words or suspect sequences, LLM rescoring pass: "Here is the OCR output with confidence scores; correct errors using your language prior, but preserve any name, date, or number that might be domain-specific. Return only corrections with rationale."
- High-confidence corrections are applied; low-confidence corrections are flagged for human review.
This is a large accuracy lever on legacy typewritten material where the language is predictable but the OCR struggles with typewriter artifacts.
Handwriting specifically
- Layout analysis routes handwritten regions to a handwriting-specialized recognizer.
- TrOCR handwritten variant is strong on English handwriting; fine-tune on agency-specific examples (cursive vs print, period style).
- Azure DI handwriting is competitive and ready-made for authorized workloads.
- Structured fields (date, name) benefit from field-level templates and validation.
- Free-form handwriting remains hard; expect 60-85% accuracy and plan for human review on anything consequential.
Quality measurement
Ground truth from a subset is the only way to measure. For a legacy digitization program:
- Human-transcribe 200-500 representative pages across quality strata. This is the ground truth.
- Compute character error rate (CER) and word error rate (WER) per stratum.
- Track named-entity extraction accuracy separately — a 95% CER with 50% dates-misread is unacceptable for many use cases.
- Report confidence-calibration: for OCR outputs with confidence above threshold X, what is the actual accuracy? If they diverge, confidence is not useful.
Pipeline orchestration
- Batch-oriented, not real-time. OCR throughput on a modest GPU cluster is millions of pages per day with reasonable preprocessing.
- Step Functions or Airflow for per-document orchestration.
- Each page is a unit of work; document-level reassembly at the end.
- Persist: original scan (if not already), preprocessed image, layout analysis, OCR output, LLM-rescored text, per-word confidence, pipeline versions.
- Idempotent and resumable. Failures should not require reprocessing a whole archive.
Human-in-the-loop for the hard parts
Even at 99% character accuracy, a million-page archive has meaningful residual error. Targeted human review:
- Low-confidence pages prioritized.
- Specific field types (names, case numbers, dates) always reviewed if below confidence threshold.
- Reviewer corrections fed back into training data for the next model iteration.
- Audit trail per correction: reviewer, timestamp, original OCR, corrected value.
Preservation posture
- Always retain the original scan. OCR text is a derivative; do not treat it as the system of record.
- PDF/A-3 for long-term preservation, embedding the OCR layer alongside the image.
- Store at appropriate resolution (400 dpi for archival, 300 dpi minimum).
- Track pipeline version and parameters with each derived artifact so a future re-run can be compared.
Where this fits in our practice
We build archival digitization pipelines for federal records programs. See our document AI for federal PDFs for the broader document pipeline and our RAG architecture for downstream retrieval.