Passive Sonar Tracking with Deep Learning: Methods, Calibration, and What Survives Review

Open-Literature Reading Everything below comes from peer-reviewed papers, the publicly published BAA, and open agency documents. Internal Precision Federal solution content, proposal text, and any program-office communications are off-limits for public articles in active program spaces, and none appears here.

Passive Sonar ML — Methodological Quality Signals (0–100)

Classical signal-processing front end retained

92%

Subject-independent and held-out evaluation

88%

Synthetic-real gap reported as a methodology number

84%

Operator-facing calibration and explanation

80%

End-to-end (not component-only) evaluation

74%

Robustness to dense or ambiguous contact fields

68%

Higher score = stronger methodological discipline in published passive-sonar ML work.

The passive-sonar problem

Passive sonar systems listen rather than emit. They detect and track objects from radiated acoustic signatures — propulsion noise, machinery tonals, and environmental sounds. The publicly available signal-processing literature for passive sonar is older and deeper than the active-sonar literature, with mature treatments of narrowband detection, broadband detection, and bearing estimation. Modern ML methods have entered this stack at specific layers.

LOFAR and DEMON classical front ends remain foundational. Modern ML enters at specific layers — narrowband classification, broadband attention, learned association — without displacing the classical core.

Narrowband versus broadband

Narrowband processing extracts persistent tonal signatures using LOFAR (low-frequency analysis and recording) and DEMON (Demodulation of Envelope Modulation on Noise) techniques. Broadband processing tracks signature integrity across wider frequency bands. The published research has applied deep learning to both: CNNs and transformer encoders for tonal classification, time-frequency analysis with learned attention for broadband. The classical signal-processing front end — beamforming, spectral whitening, and band-limited normalization — remains the foundation, with modern ML systems consuming the front-end output rather than replacing it.

The publicly available datasets that anchor much of the open research — DeepShip from the University of Victoria, ShipsEar from the University of Vigo, and the smaller VTUAD set — provide a useful but limited training surface. They are dominated by surface-vessel signatures in coastal acoustic environments, which makes them appropriate for vessel-class research but inadequate as a stand-in for operationally meaningful conditions. Survey papers in IEEE JOE (Journal of Oceanic Engineering) and the Journal of the Acoustical Society of America have catalogued the gap between dataset diversity and operational diversity for at least a decade.

The choice between narrowband and broadband framing also shapes the ML architecture. Narrowband framing favors models that learn slow-time tonal evolution — recurrent encoders, sliding-window CNNs over LOFAR-grams, attention over time-frequency tiles. Broadband framing favors models that learn cross-band correlations and short-time signature integrity — short-time CNNs, time-distributed transformers, and self-supervised pretraining on unlabeled hydrophone data. Most published competitive systems run both pathways and combine outputs at a later stage rather than choosing one framing wholesale.

Tracking-by-detection

The classical tracking-by-detection paradigm — detect contacts, associate them across time, maintain tracks — has been augmented by learned components. Learned association functions, often inspired by visual multi-object tracking work such as DeepSORT and ByteTrack adapted to acoustic features, handle dense or ambiguous contact fields better than classical Mahalanobis-distance metrics. Particle filters and IMM (interacting multiple model) trackers remain widely used for state estimation in passive sonar tracking, alongside an active research literature on neural augmentation of the dynamics models.

Bearing estimation is the most operationally consequential output of the front end, and it interacts heavily with tracking. Published methods range from classical MUSIC and ESPRIT to learned beamforming networks (Bianco et al., IEEE JOE) and convolutional bearing-estimation models. Each has different sensitivity to array imperfections, snapshot count, and SNR. Tracking systems built on a single bearing-estimation method tend to inherit that method's failure modes; tracking systems that consume calibrated bearing distributions, with reported uncertainty, generalize better.

Track management — track initiation, confirmation, deletion, and split/merge handling — is where most operational pain lives. The published research on multiple-hypothesis tracking (MHT) and probabilistic data association (PDA) provides a foundation, but operational evaluations consistently show that the parameters governing track lifecycle dominate end-state performance. Recent work that learns track management policies from labeled track outcomes is an active research direction; results are dataset-dependent, and the published discipline emphasizes evaluation against operationally realistic clutter and crossing scenarios.

The classical signal-processing front end remains the foundation. Modern ML methods have entered this stack at specific layers — not as a replacement, but as an augmentation.

Joint detection-tracking

Recent published work has explored architectures that share representations between detection and tracking stages, with mixed results. The intuition is that detection and tracking depend on overlapping features, so sharing learned representations should reduce duplicated computation and improve consistency. The empirical evidence in the open literature — including work in IEEE OCEANS proceedings and JASA — suggests improvements on benchmark sets that disappear or invert when systems are evaluated on out-of-distribution conditions or when tracking metrics are evaluated end-to-end rather than over the detection set.

The methodological lesson is that component-only evaluation is misleading. A detector that scores higher on F1 over a detection benchmark may produce a worse track set, because the additional detections are correlated noise that destabilizes association. End-to-end track-quality metrics — OSPA (Optimal Sub-Pattern Assignment), GOSPA, and operational track-life metrics — capture what users care about and tend to favor architectures that the component metrics deprioritize.

Whether to share representations or to keep stages independent is therefore an empirical question on each problem. The published research norm is to evaluate both architectures under the same end-to-end conditions and to report both component and end-to-end metrics rather than choosing the framing that flatters the proposed system.

Narrowband processing. Extracts persistent tonal signatures through LOFAR and DEMON techniques and remains the entry point for tonal classification.

Broadband processing. Tracks signature integrity across wider frequency bands and benefits from learned attention over time-frequency representations.

Tracking-by-detection. Detects contacts, associates them across time, and maintains tracks; learned association functions outperform classical metrics in dense scenes.

Joint detection-tracking. Shares representations between detection and tracking stages; results are mixed and depend on end-to-end evaluation discipline.

Public dataset	Source	Use	Operational gap
DeepShip	University of Victoria (open)	Vessel-class ML benchmarks	Coastal surface-vessel emphasis; limited environmental diversity
ShipsEar	University of Vigo (open)	Vessel-class baselines, transfer-learning studies	Smaller scale; coastal Iberian environment
VTUAD	Open underwater acoustic set	Pretraining and self-supervised learning	Diversity bounded by collection geography
Synthetic (Bellhop, Kraken)	Open propagation models	Augmentation, environment-conditioned training	Real-synthetic gap is a reported methodology number

Data scarcity and synthetic generation

Real passive-sonar data is constrained. Publicly available datasets are limited, and operationally meaningful data is often classified. Synthetic generation, based on environment modeling and known signature characteristics, is part of the engineering toolchain. Acoustic propagation models such as Bellhop (ray-tracing) and Kraken (normal-mode) are openly available, and combining them with parametric source models lets researchers simulate conditions not present in real datasets — different bottom types, different propagation regimes, different vessel classes.

The published validation discipline is to maintain a real held-out set and to report performance there, with synthetic-real gap as a key methodology number. Self-supervised pretraining on unlabeled hydrophone data, followed by supervised fine-tuning on a small real labeled set, has emerged in the open literature as a practical compromise. Recent contrastive and masked-spectrogram methods adapted from the audio ML community (HuBERT, BYOL-A, MAE-AST) have been applied to underwater acoustics with reported improvements over fully supervised baselines.

Domain adaptation between synthetic and real domains has its own active literature, including adversarial-feature alignment, batch-normalization recalibration on real data, and explicit physics-informed regularization. None of these methods produces zero gap; the discipline is to report the gap honestly and to design the pipeline so that the system can be re-tuned cheaply when new real data becomes available.

Operator interaction

Passive sonar operators are highly trained, and a system that does not integrate cleanly with their existing workflow will not be used. The published human-factors literature — including ONR-funded work on sonar-operator decision-making and Navy NAVSEA reports on operator-machine teaming — emphasizes calibrated confidence, anomaly explanation, and clear separation between system suggestions and operator decisions.

Display integration is a non-trivial constraint. Existing sonar displays — waterfall views, bearing-time displays, narrowband and broadband panels — have decades of operator training behind them, and any ML output that ignores these conventions imposes cognitive cost on the operator. The published research consistently shows that systems that augment existing displays with calibrated confidence overlays, anomaly markers, and audit trails are accepted at higher rates than systems that introduce parallel displays or that take automatic action without operator concurrence.

The published research consistently emphasizes that operator-facing tooling, calibration, and explanation matter alongside raw model accuracy. A higher-accuracy model that is rejected by operators contributes nothing operationally; a slightly lower-accuracy model that fits the operator workflow and produces auditable outputs scales into routine use.

Common questions on the public-record framing

How does the literature handle data scarcity?

Bellhop and Kraken physics simulators; HuBERT, BYOL-A, MAE-AST self-supervised pretraining; synthetic-real gap reporting as the methodological discipline.

Why is operator integration not optional?

Highly trained sonar operators reject systems that don't integrate cleanly. Calibrated confidence, anomaly explanation, and clear separation between system suggestions and operator decisions are the published human-factors baseline.

What does this article not cover?

Specific platform integrations, specific signature characterizations under restriction, or any Precision Federal passive-sonar architectural approach.

Frequently asked questions

What is the difference between narrowband and broadband passive sonar processing?

Narrowband processing focuses on persistent tonal signatures at specific frequencies, often using LOFAR or DEMON techniques. Broadband processing operates across wider frequency bands and tracks signature integrity over time. The two approaches are complementary and often run in parallel within a passive sonar processing chain.

Why is real passive sonar data scarce for ML researchers?

Operationally meaningful passive sonar recordings are usually classified or otherwise restricted. Publicly available datasets exist but are limited in size, environmental coverage, and signature diversity. This is why synthetic generation — and a discipline of reporting the synthetic-to-real gap — is a standard part of the published research.

Do classical signal-processing methods still matter when deep learning is used?

Yes. The published literature consistently treats the classical signal-processing front end as the foundation, with modern ML methods entering at specific layers. Replacing the classical front end wholesale tends to produce worse, not better, downstream results.

Why is operator integration a serious research consideration?

Passive sonar operators are highly trained and operate within established workflows. A model that ignores the operator interface — that does not present calibrated confidence, that does not explain anomalies, or that confuses system suggestions with operator decisions — will not be adopted regardless of raw accuracy.

How we use this site

We write articles like this to make our reading visible — what we think the open literature says, what we think the open gaps are, and where careful work might land. We do not use these pages to preview proposed approaches in active program spaces. Precision Federal is a software-only SBIR firm. If your office is funding work in this area and would value a software-first partner with a documented public-reading habit, we welcome the introduction.