Higher score = stronger published methodology in LLM-driven schema synthesis from documents.
Why technical documents are full of latent structure

Federal technical documents — specifications, interface control documents, requirements packages, system descriptions — are full of structure that is never written down explicitly. There are entities, relationships, constraints, and units of measure scattered across hundreds of pages of prose, tables, and diagrams. A reader can extract the structure mentally; a database cannot.
A "domain model" is a structured representation of that latent structure. Think of it as a typed inventory of the things the document talks about — the components, the signals, the requirements, the tolerances — and the relationships between them. An "ontology" is the same idea pushed further, with formal type definitions and relationship rules that support automated reasoning.
The published research on LLM-driven schema synthesis (recurring in ESWC, ISWC, the ACL workshops on knowledge extraction, and the NeurIPS structured-prediction workshops) is converging on a workable pattern. The pattern is not "ask GPT to make a JSON." It is a layered pipeline with constrained generation, provenance tracking, and explicit verification. This article is a reading of that pattern.
Domain models versus ontologies: the practical difference
The two terms get used interchangeably, but the difference matters in practice.
A domain model is a typed inventory. It lists the entities the document describes (Pump, Valve, Sensor), their attributes (rated_pressure, max_temperature, calibration_date), and how they connect (Pump feeds Valve). It is essentially a strongly-typed JSON or XML schema with instances. Tools that consume it — databases, simulators, configuration managers — treat it as data.
An ontology adds rules. It says things like "every Pump must have exactly one rated_pressure" or "if a Valve is downstream of a Pump, the Valve's max_pressure must be at least the Pump's rated_pressure." Those rules can be checked automatically with a reasoner (a piece of software that derives conclusions from rules). Ontology languages like OWL (Web Ontology Language) and SHACL (Shapes Constraint Language) exist specifically to express these rules.
For most federal document-extraction work, a strong domain model gets you 80 percent of the value at a fraction of the engineering cost. An ontology earns its keep when the document set is large enough that automated consistency checking pays off, or when the downstream consumer is a reasoner rather than a database.
Constrained generation: the schema goes in first
The single biggest methodology shift in the published LLM-extraction literature over the last three years has been constrained generation.
The old approach: ask the model to produce JSON, then parse it, then hope it is well-formed. The new approach: hand the model the target schema (JSON Schema, Pydantic model, or grammar) and force every token of the output to satisfy it. Tools like Outlines, Guidance, and the JSON-mode features in major LLM APIs implement this constraint at the decoder level. The output is, by construction, schema-conformant.
The published evaluations show two consequences. First, parse-failure rates drop from the 5 to 20 percent range (depending on the model) to effectively zero. Second, semantic accuracy of the extracted fields goes up too, because the constraint forces the model to commit to a structured answer rather than hedging in prose.
The methodological tip from the open literature is to make the schema as specific as the domain allows. Generic types ("string") give weak constraints; enums and regex patterns give strong ones. A schema that lists the fifteen valid component types, instead of allowing any string, prunes most of the hallucination surface before the model ever runs.
Provenance: every extracted entity points to a span
Provenance is the property that turns extraction output from "model claim" into "auditable evidence." Every extracted entity, attribute, and relationship carries a pointer back to the exact span of the source document it came from.
The published practice is to store the span as document_id plus character offsets, plus an excerpt. When a downstream user asks "where did this 1500 PSI rated_pressure come from," the system shows the paragraph. When a reviewer disagrees with an extraction, the system shows the same paragraph and the reviewer can edit either the extraction or note that the source itself is ambiguous.
Without provenance, extraction outputs are not auditable. With provenance, they become a structured layer over the existing document corpus — an index, not a replacement. That distinction matters a lot to federal reviewers who must defend any structured artifact back to its source.
Neuro-symbolic verification of generated schemas
Verification is what catches the cases where the model produced syntactically valid output that is semantically wrong.
The published verification pattern has three layers, each with a different strength.
Schema validation. The mechanical first pass. JSON Schema or XSD checking catches everything wrong with the shape of the output. Constrained generation usually makes this pass trivially, but the validator is kept as a safety net.
Symbolic rule checking. Domain rules expressed in SHACL, Datalog, or a similar declarative language. Examples: "every Pump has exactly one rated_pressure"; "the document's stated total mass equals the sum of component masses within tolerance"; "no two components claim the same part_number." These rules catch the structural inconsistencies LLMs are prone to introduce.
LLM-as-judge with strict grounding. A second model passes over the extraction with the source document and answers "does the source actually support this claim?" This catches the cases where the extraction is structurally fine but semantically unsupported by the source. The published practice is to run this only on items that pass the first two layers, and to keep the judge model separate (different family or size) from the extractor to reduce correlated errors.
The combination is what makes neuro-symbolic verification work. Each layer catches a different failure class. Skipping any layer leaves a known hole.
Evaluation rigor for structured-output LLMs
Evaluation is where most published schema-synthesis work falls short, and where the discipline matters most.
The right evaluation is not "did the LLM produce JSON?" The right evaluation is "did the produced structure match a human-validated gold standard, on documents the model did not see during training, with inter-annotator agreement on the gold standard itself?"
Each piece of that sentence carries weight.
Gold standard means a human-curated reference structure for each evaluation document. Building it is expensive; using it is essential. Without a gold standard, claimed accuracy numbers are unfalsifiable.
Out-of-distribution documents means evaluation on document types the model was not trained on. The published failure mode is to evaluate only on documents from the training distribution and report numbers that collapse on novel document types. Holding out by source, time period, or document family forces honest measurement.
Inter-annotator agreement (IAA) measures how much humans disagree on the gold standard itself. If two trained annotators disagree on 30 percent of relationships, the model is being graded against a ceiling lower than it appears. The published practice is to report IAA alongside model accuracy, and to investigate any disagreement before using the gold as a benchmark.
Common failure modes and how the literature addresses them
Three failure modes recur in the published research, each with a documented mitigation.
Hallucinated relationships. The model invents an edge between two entities the source never connected. Mitigation: provenance plus LLM-as-judge grounded in the cited span.
Unit and tolerance errors. The model extracts "1500" without the unit, or rounds a tolerance silently. Mitigation: schemas that require explicit units, with a unit registry; symbolic checks on tolerance arithmetic.
Acronym and synonym collision. The same component appears under three names across the document; the model creates three entities. Mitigation: a coreference-resolution pass before structuring, with a controlled vocabulary the schema references.
None of these is hypothetical. Each appears in the published evaluations of real extraction systems on real document corpora. The mitigations are also published; they are not exotic.
| Layer | Public technique | Catches | Misses if used alone |
|---|---|---|---|
| Constrained generation | Outlines, Guidance, JSON-mode APIs | Malformed output, hedging prose | Semantic errors that satisfy the schema |
| Provenance binding | Document_id + offset + excerpt | Untraceable claims | Wrong entities cited from real spans |
| Schema validation | JSON Schema, XSD | Structural drift | Structurally valid hallucinations |
| Symbolic rules | SHACL, OWL, Datalog | Cross-entity inconsistencies | Single-entity hallucinations |
| LLM-as-judge | Grounded second-model pass | Unsupported but consistent claims | Correlated hallucination if same model family |
Where the published research lands today
Schema synthesis from technical documents is mature enough to deploy with discipline. The field is not "should we use LLMs for this" anymore; it is "which combination of constraints and verifications fits this document type."
The strongest published results are on documents that are partially structured to begin with — technical specifications with consistent section structure, interface control documents with tabular data, requirements packages with numbered statements. Pure-prose documents (concept of operations, narrative analyses) are harder; published F1 scores on entity and relationship extraction drop noticeably.
Evaluation discipline is the most actionable lever the open literature identifies. Many published systems claim accuracy numbers that fall apart under out-of-distribution evaluation or that report against a gold standard with weak IAA. Programs that take evaluation seriously deploy with realistic confidence; programs that don't, ship something that fails on the first novel document and lose reviewer trust permanently.
Human-in-the-loop integration
None of the published deployments removes the human reviewer. The successful pattern is a pipeline that produces a defensible structure draft and a reviewer who edits, approves, and provides correction signal.
The interface details are consistent across the open literature. Show the reviewer the extracted structure side-by-side with the source document, with each extraction linked to its source span. Highlight verification warnings inline. Let the reviewer accept, edit, or reject each extraction with a recorded reason. Treat the edit stream as the most valuable training signal in the system.
The published practice is to track edits as labeled corrections and to retrain the extractor on a regular cadence. Systems that capture this signal improve steadily; systems that do not stagnate at the accuracy of the original training corpus.
Frequently asked questions
Use a domain model when the downstream consumer is a database or simulator. Use an ontology when the downstream consumer is a reasoner that needs to derive conclusions automatically, or when the document corpus is large enough that automated consistency checking pays off. For most federal extraction work, a strong domain model is sufficient.
No. It eliminates malformed output, which is a different problem. The model can still produce structurally valid claims that the source does not support. Provenance, symbolic rules, and LLM-as-judge each address part of that residual hallucination surface.
The published practice suggests 50 to 200 carefully annotated documents is enough for meaningful evaluation if inter-annotator agreement is reported and the documents are deliberately drawn out-of-distribution. Smaller gold standards produce noisy numbers; larger ones rarely move conclusions if the smaller set was honestly diverse.
JSON is dominant in the published recent work because of the constrained-generation tooling that targets it. XML is still common where downstream consumers (MBSE tools, configuration managers) require it. RDF is the right choice when the model is genuinely an ontology with a reasoner attached. Format follows the consumer.
How we use this site
We write articles like this to make our reading visible — what we think the open literature says, what we think the open gaps are, and where careful work might land. We do not use these pages to preview proposed approaches in active program spaces. Precision Federal is a software-only SBIR firm. If your office is funding work in this area and would value a software-first partner with a documented public-reading habit, we welcome the introduction.