Higher score = stronger discipline in published quick-look ML work.
What a quick-look report actually is

A quick-look is the short, fast version of an ASW exercise debrief. Analysts have terabytes of sensor recordings and a small window to tell evaluators what happened. The report has to say: when the contacts showed up, which sensors heard them, where the high-confidence runs were, where the false alarms clustered, and where the data is too messy to draw conclusions from.
The quick-look is not the final word. A deeper analysis follows over the next weeks. But the quick-look is what evaluators read first, and it is what decides whether the deeper analysis even gets the resources to happen.
The reason ML is in this conversation at all is volume. A multi-day exercise produces more data than any one person can read by hand inside the quick-look window. The published research treats the problem as compression and triage — helping analysts get to the interesting moments faster — rather than as full automation.
The two parallel processing flows
Most ASW pipelines run two flows side by side. One looks for steady tones; the other looks for short, sharp events.
Narrowband processing focuses on the slow, repeating sounds a ship's machinery makes — the steady hum of a propulsion system, the tone of a generator. These tones stand out against random ocean noise the way a hummed note stands out in a noisy room. The classical tool for visualizing them is called a LOFAR-gram (Low-Frequency Analysis and Recording — a kind of waterfall plot of frequency over time). DEMON (Demodulation of Envelope Modulation on Noise) is a related technique that finds rhythms hidden inside broadband noise.
Broadband processing handles the short events — clangs, transients, pulses. These do not have a neat tonal signature; they are short bursts of energy across many frequencies at once. The classical front end here uses beamforming and spectral whitening to clean up the signal before any pattern-matching happens.
The published ML systems plug into both flows. They consume the cleaned-up time-frequency images, not raw waveforms, which lets them inherit decades of mature signal-processing work instead of starting from scratch.
Where ML actually enters the loop
ML does not replace the signal-processing chain. It enters at four specific spots, and the published research is careful to keep them separate so each can be evaluated on its own.
First: data-quality triage. Before anything else, the system asks "is this sensor window even usable?" If a hydrophone got bumped, if ambient changed, if calibration drifted — the system flags the window as suspect so downstream models do not learn to predict on artifacts.
Second: artifact flagging. CNNs and sequence models score time-frequency tiles for "this looks unusual." A high score does not mean "submarine"; it means "an analyst should look at this." The distinction matters more than it sounds.
Third: learned association in tracking. Once contacts are detected, something has to decide which detection at time T is the same object as a detection at time T+1. Classical methods use distance metrics; learned association functions, borrowed from work like DeepSORT (a visual multi-object tracker), do better when the scene is crowded.
Fourth: summary generation. At the very end, an extractive summarizer pulls the highest-confidence flagged events into a structured table that an analyst can review.
Automated artifact flagging in plain terms
This is the single most useful function ML serves in a quick-look. It tells the analyst where to look first.
The model scans through the time-frequency representation of the sensor data and assigns each tile a "this is unusual" score. CNN-based scoring is most common in the published work, with self-supervised methods (the model learns what "normal ambient" looks like from unlabeled data, then flags anything that deviates) gaining ground.
The methodological discipline that matters here is calibration. A flag score of 0.91 should mean roughly the same thing on every exercise — otherwise the analyst has to relearn the threshold every time. Published work uses techniques like temperature scaling and isotonic regression (statistical methods that adjust the raw model output so the numbers line up with real-world frequencies) to keep confidence stable across deployments.
And every flag has to be paired with its evidence. A bare score is useless. The flag record needs the input window that produced it, the bearing estimate, the time stamp, the model version, and an analyst-readable label. If an analyst cannot trace a flag back to its origin, the analyst does not trust it.
| Flag class | What it watches for | How calibration is handled | What goes wrong without calibration |
|---|---|---|---|
| Tonal-narrowband | Steady tones from machinery | Temperature scaling on a held-out exercise | Analyst over-trusts ambiguous tonals |
| Broadband transient | Short bursts and clangs | Isotonic regression on a real-data subset | Flood of low-quality transient flags |
| Sensor degradation | Hydrophone bumps, drift, contamination | Threshold on a clean-vs-noisy split | Bad data treated as good |
| Cross-sensor consistency | Same event seen on multiple sensors | Reliability diagram on agreement bins | False corroboration across channels |
The synthetic data question
Real ASW data is hard to get. The publicly available datasets are limited, and the operationally meaningful recordings are usually classified. Researchers fill the gap with synthetic data — acoustic simulations.
Two open-source physics simulators do most of the work in the public literature: Bellhop (a ray-tracing simulator that models sound bending through different water layers) and Kraken (a normal-mode simulator suited to long-range, low-frequency propagation). Combined with parametric source models — "what does a vessel of class X sound like" — researchers can simulate scenarios real datasets do not cover.
The honest discipline in the published work is to keep a real held-out test set and report performance on it separately. The "synthetic-real gap" — the difference between scores on simulated data and scores on real recordings — is treated as a methodology number, not buried. Self-supervised pretraining on unlabeled hydrophone recordings, then fine-tuning on a small labeled real set, has emerged as a practical compromise.
Tracking summaries that survive review
Tracking is the part evaluators care about most. Where did contacts go, how long did the system hold them, where did tracks break or merge?
The published work treats a tracking summary as the joint output of three stages: bearing estimation (which direction is the sound coming from), association (linking detections across time into tracks), and track management (when to start, confirm, and delete tracks). Each stage has multiple defensible methods, from classical (MUSIC, ESPRIT, Mahalanobis-distance association) to learned (neural beamformers, learned cost functions for association).
The methodology that survives review reports end-to-end tracking metrics — not just detection accuracy. OSPA (Optimal Sub-Pattern Assignment) and GOSPA are the standard published metrics; both score how close the system's set of tracks is to the ground-truth set of tracks. Track lifetime and swap rates round out the picture.
Why does this matter? Because a system that scores higher on detection F1 (the classical accuracy metric for individual detections) can produce worse tracks — the extra detections are noise that confuses the association step. A quick-look tool tuned only on detector F1 will look better on paper and worse in practice.
Generating the summary itself
At the end of the pipeline, the system has to produce something an evaluator can read. The published research distinguishes two flavors.
Extractive summaries pull the highest-confidence flagged events directly out of the data into a structured table — here is the time, here is the bearing, here is the snippet, here is the confidence. Every row points to specific evidence. Provenance is preserved end to end.
Abstractive summaries use a language model to write a short narrative paragraph: "During hour 14, the southern hydrophone array detected three persistent tonal sources..." These are easier to read but harder to trust. The risk is hallucination — the model writing a sentence the underlying data does not support.
The published systems that do both pair them carefully. The narrative paragraph is annotated with citations that link every sentence back to the extractive evidence. An analyst can scan the narrative, follow any claim to its evidence, and accept or reject it. That is the only way LLM-generated text earns its place in an evaluator-facing report.
Why analyst workflow is half the problem
A quick-look ML tool that ignores how analysts actually work will not be used — no matter how accurate the underlying model is.
Analysts work with established displays: bearing-time plots, waterfalls, narrowband and broadband panels. Decades of training are baked into those interfaces. A new tool that adds a parallel display analysts have to learn from scratch competes with the workflow rather than helping it.
The published human-factors research, including ONR-funded work on operator decision-making, is consistent: tools that overlay flags and confidence onto existing displays get adopted; tools that ask analysts to switch contexts do not. Auditability is the second adoption factor. If an analyst can click a flag and see exactly which window of data and which model version produced it, trust grows. If the flag is opaque, trust does not grow.
Common questions on the public-methods framing
What is the synthetic-real gap, in plain terms?
Models trained on simulated acoustic data tend to score higher on simulated test data than on real recordings. The "gap" is that score difference. Honest published work reports the real-data number separately so reviewers see what to expect in operations.
Why does calibration matter as much as accuracy?
Imagine the system says it is "90% confident" about a flag. If that 90% really means 90% in practice, an analyst can act on it. If 90% sometimes means 50%, the analyst has to retrain their intuition every exercise. Calibration is what makes confidence numbers usable.
What does this article not cover?
Specific platform integrations, signature characterizations under restriction, or any Precision Federal architectural approach to quick-look tooling.
Frequently asked questions
The first compressed account of an exercise — when contacts were detected, which sensors carried them, where the high-confidence runs were, and where the data is too degraded to draw conclusions from. It precedes the deeper analysis that follows over the next weeks and gates whether that analysis happens.
Because the volume problem dominates. A multi-day exercise produces more data than an analyst can read by hand within the quick-look window. ML helps surface the windows worth reading first — with confidence scores and audit trails — rather than replacing the analyst.
No, and the published work is consistent that it should not. The classical front end — beamforming, spectral whitening, LOFAR-gram and DEMON-gram generation — remains the foundation. Modern ML enters as a layer on top, scoring the cleaned-up output rather than replacing the cleanup.
End-to-end tracking metrics like OSPA and GOSPA, plus calibration metrics on flag confidence, plus analyst-acceptance rates in human-factors studies. Component accuracy on individual detections is necessary but not sufficient — a higher detection F1 can produce worse tracks.
How we use this site
We write articles like this to make our reading of the open literature visible — what we think the published methods say, what the open gaps are, and where careful work might land. We do not use these pages to preview proposed approaches in active program spaces. Precision Federal is a software-only SBIR firm. If your office is funding work in this area and would value a software-first partner with a documented public-reading habit, we welcome the introduction.