Binary-Level AI Vulnerability Detection Without Source Code

Open-Literature Reading Everything below comes from peer-reviewed papers, public conference proceedings (USENIX Security, IEEE S&P, NDSS), and openly available open-source toolchains. No internal Precision Federal solution content, no proposal text, and no program-office discussion appears in this article.

Binary-Level Vulnerability Discovery — Methodological Quality Signals (0–100)

Combines static and dynamic analysis

90%

Reproducible sandbox execution traces

86%

Held-out evaluation across CWE classes

81%

Hybrid quantitative + qualitative scoring

77%

Generalizes to unseen architectures (ARM, MIPS)

70%

Detection of implanted (not native) zero-day classes

63%

Higher score = stronger evidence in published binary-vulnerability research.

The problem in plain language

Federal systems run on software the government did not write and cannot always inspect — firmware on a network switch, a closed-source library inside a vendor application, a compiled module shipped by a subcontractor four supply-chain hops away. The question this article reads is straightforward: how do you find vulnerabilities and implants in software when you only have the binary — the compiled program — and not the source code?

The published answer is a stack of complementary methods. Static analysis reads the binary without running it. Dynamic analysis runs it inside a sandbox (an isolated environment) and watches what it does. AI-driven anomaly detection flags executions that do not match what a clean version of the program should look like. None of these methods works alone; the published research is consistent that the combination is what produces credible coverage.

Why source code isn’t available

The federal supply chain is full of black boxes. Commercial off-the-shelf (COTS) hardware ships with firmware the manufacturer protects as intellectual property. Closed-source libraries are linked into vendor applications without ever exposing their internals. Vendors hand the government a signed binary; what is inside it is the vendor’s problem, until it becomes the government’s problem.

Even when source code is available in principle, it is often not available in practice for the specific build that shipped. A subtle bug introduced by a compiler optimization, an environment variable at build time, or a linked library version mismatch can exist in the binary and not in the source. Inspecting the binary is therefore the only way to be sure about what actually got delivered.

This is why binary-level analysis matters in the federal context specifically. It is not a niche corner of security research; it is the entry point for assessing the trustworthiness of essentially every piece of vendor software the government acquires.

Static analysis with Ghidra and angr

Static analysis is the practice of reading the binary as data — never running it — and reasoning about what it would do if you did. The two open-source workhorses in published research are Ghidra (an open-source binary reverse-engineering toolkit released by NSA in 2019) and angr (an open-source binary-analysis framework from UC Santa Barbara).

Ghidra disassembles the binary, recognizes function boundaries, and reconstructs an approximate version of high-level control flow. It is used heavily for human-driven reverse engineering and increasingly as the front end for AI systems that consume the disassembly as input. The published work pairs Ghidra’s output with neural models trained on disassembled functions to predict where vulnerabilities are likely to live.

angr does symbolic execution — it explores all the paths through a binary using mathematical placeholders for inputs, building up a logical formula for what conditions trigger each path. Symbolic execution is powerful but expensive: complex binaries produce a path explosion that no machine can finish. The published systems balance this by using AI-driven heuristics to decide which paths are worth exploring and which to skip.

Dynamic analysis and fuzzing with AFL++

Dynamic analysis runs the binary inside an isolated sandbox and watches what it does. The dominant open-source fuzzer in published vulnerability research is AFL++ (American Fuzzy Lop, plus-plus) — a coverage-guided fuzzer that mutates inputs and feeds them to the program at high speed, looking for crashes and unusual behavior.

The mental model is simple. The fuzzer tries millions of weird inputs per hour. If any input causes the program to crash, hang, or wander into a previously unexplored region of code, that input is a candidate for a vulnerability. AFL++ uses lightweight instrumentation (small probes inserted into the binary) to know which code paths each input covers, and it focuses on inputs that exercise new paths.

For binaries the fuzzer cannot recompile, AFL++ supports binary-only fuzzing through QEMU-mode emulation or DynamoRIO. The performance cost is real — binary fuzzing is slower than source-instrumented fuzzing — but the published research treats it as the price of admission for closed-source assessment.

If you only have the binary, you have to read it without running it, run it without trusting it, and watch every move it makes. AI helps decide what to read, what to run, and what to flag.

AI anomaly detection on execution traces

An execution trace is a recording of the events a program produced as it ran — the system calls it made, the memory regions it touched, the network packets it sent. Anomaly detection on traces is the AI piece of the stack: train a model on traces from a known-clean run of the program, and flag traces from later runs that deviate in ways the model considers suspicious.

The published methods range from classical statistical models (Markov models, n-gram models over system-call sequences) to deep learning over event embeddings (LSTMs, transformers, graph neural networks where each node is a system call and each edge is a temporal relation). All of them are looking for the same kind of signal: a sequence of events that does not match what the program is supposed to do.

The hardest part is qualitative judgment about whether an anomaly is malicious or merely unusual. A program may behave differently because the input was unusual, the environment changed, or a benign update modified its logic. The published research treats this as a hybrid quantitative-plus-qualitative problem — the AI flags candidate anomalies, and a human analyst (or a downstream LLM-based explanation system) reasons about whether each flag is genuine.

Supply-chain implant detection

An implant is a piece of code added to an otherwise legitimate binary by an attacker, typically through a compromised build pipeline, a malicious dependency, or a compromised update channel. Detecting implants is harder than detecting native vulnerabilities because the implant is designed to look like normal program behavior.

The published detection patterns combine three signals. The first is binary diffing — comparing the suspect binary against a reference build to find unexpected differences. The second is runtime behavioral comparison — running the suspect binary and a reference binary on identical inputs in parallel sandboxes, and flagging behavioral divergence. The third is provenance verification — checking signed build metadata against a trusted record of how the binary should have been produced.

The provenance angle is increasingly tied to software supply chain frameworks like SLSA (Supply-chain Levels for Software Artifacts) and to software bill-of-materials (SBOM) standards. The published consensus is that none of these signals alone catches every implant; the combination, scored together with AI assistance, is where the methodology has been heading for the last two years.

Trusting the toolchain

A subtle published concern is that the analysis tools themselves are software in a supply chain. If Ghidra, angr, or AFL++ were compromised, the analysis they produced would be compromised too. The mature published response is to run analysis tools in isolated environments, to verify their integrity through reproducible builds, and to cross-check results across multiple tools rather than trusting any single one.

This is part of why the published gold-standard analyses use more than one disassembler, more than one symbolic-execution engine, and more than one fuzzer in parallel. Disagreement between tools is itself a useful signal — when two tools disagree about whether a function is reachable or whether a path is feasible, that disagreement is often where bugs hide.

The trust-the-toolchain pattern also appears in NIST guidance on assessing software supply chains. The federal SCRM literature treats binary-analysis tooling as part of the supply chain to be vetted, not as a neutral instrument that sits outside it.

Eval against zero-day classes

A zero-day vulnerability is one that is unknown to defenders — no patch exists, no signature is in any database. Evaluating binary-analysis systems against zero-day classes is hard for the obvious reason: the zero-days you can evaluate against are the ones already discovered, which makes them not zero-days anymore by the time the evaluation runs.

The published evaluation discipline works around this by using held-out classes (training on some CWE classes — Common Weakness Enumeration categories — and evaluating on others), retroactive evaluation (training on data from before a known disclosure and testing whether the system would have caught the issue cold), and synthetic vulnerability injection (deliberately introducing bugs of a known class into binaries and testing detection).

Each method has weaknesses. Retroactive evaluation is biased toward bugs that were findable with the data available at the time. Held-out classes can miss the way real attackers combine multiple weaknesses. Synthetic injection is only as realistic as the injection methodology. The norm in published work is to use more than one evaluation strategy and to report results separately rather than averaging them into a single score.

Static analysis. Read the binary without running it — Ghidra and angr are the open-source workhorses.

Dynamic analysis. Run the binary inside a sandbox and watch what it does — AFL++ for fuzzing, QEMU-mode for binary-only.

Anomaly detection. Train AI on clean execution traces and flag deviations from the expected behavior.

Implant detection. Combine binary diffing, runtime divergence, and supply-chain provenance to catch added code.

Open-source tool	Origin	Strength	Limitation
Ghidra	NSA (released 2019)	Disassembly, decompilation, scripting	Not designed for fully automated workflows
angr	UC Santa Barbara	Symbolic execution; binary-analysis Python API	Path explosion on real-world binaries
AFL++	Open-source community	Coverage-guided fuzzing; binary-only mode	Slower in binary-only mode than source-instrumented
QEMU-user / DynamoRIO	Open source	Binary instrumentation for fuzzing and tracing	Architecture coverage and instrumentation overhead

Supply-chain trust frameworks

Binary analysis fits into a broader story about software supply-chain trust. SLSA defines levels of build-pipeline integrity. SBOMs catalogue what is inside a binary. CISA has published guidance on secure-by-design software, and DoD has issued instructions on software assurance.

None of these frameworks replaces binary analysis. SBOMs tell you what dependencies a binary claims to contain; they do not verify that the binary actually contains only those dependencies. SLSA tells you the build pipeline was disciplined; it does not catch a compromise inside a trusted dependency. Binary analysis is what closes the gap between “the supply chain documented this” and “the binary actually does that.”

The published research treats binary analysis and supply-chain frameworks as complementary, with the analysis providing the empirical check on the documentation.

Frequently asked questions

Why analyze a binary if a software bill of materials already lists everything inside?

Because an SBOM lists what the build claims is inside, not what is actually inside. A compromised build pipeline, a substituted dependency, or an injected implant can put code in the binary that the SBOM never mentions. Binary analysis is the empirical check on the documentation.

Can AI catch a true zero-day vulnerability in a binary?

The published research is honest: catching specific zero-days before any human has — reliably, repeatedly — is not yet a solved problem. AI is most useful at narrowing the search space, ranking suspicious functions, and flagging anomalies for human review. The combination of AI plus skilled analysts plus multiple tool disagreement is what produces useful coverage.

How big is the false-positive problem?

Significant. Anomaly detection on execution traces produces many candidates that turn out to be benign environmental differences. The published methods address this through hybrid scoring (statistical anomaly score plus semantic explanation), human triage workflows, and continuous retraining on labeled triage outcomes. False positives are a workflow problem more than a research problem.

Is binary-only fuzzing as effective as source-level fuzzing?

It is slower and somewhat less efficient, but the published evaluations show it finds the same classes of bugs given enough compute. For closed-source vendor binaries, it is the only option. The published systems compensate for the speed gap by running fuzzers in parallel across many CPUs and combining fuzzing with static analysis to focus the search.

How we use this site

We write articles like this one to make our public reading visible — what we think the open literature shows, where the methodological gaps are, and how the open-source toolchain composes. We do not preview proposed approaches in active program spaces. Precision Federal is a software-only SBIR firm. If your office is exploring binary-level vulnerability discovery and would value a software-first partner with a documented public-reading habit, we welcome the introduction.