False Positive Reduction in Underground Threat Detection: The Operational Cost Problem

Detection systems are typically evaluated on sensitivity — the recall metric, asking what fraction of true threats the system catches. In most machine learning benchmarks, higher recall is better. In operational threat detection systems, that framing will get your operators to stop responding to alerts within a few weeks. The false positive rate (FPR) is not a secondary metric. At realistic trigger volumes, it's the metric that determines whether the system is used at all.

Here's the arithmetic. A sensor array monitoring a border corridor generates, from legitimate sources — small animals, equipment vibration, thermal settling, dripping water, surface vehicle traffic coupling through the substrate — an ambient trigger rate of approximately 200 candidate events per day. A system with a 5% false positive rate on those events generates 10 false alerts per day. That's not an overwhelming number. Now layer in a real tunnel corridor with consistent pedestrian and equipment activity: the ambient trigger rate can realistically be 500–800 candidate events per day. At 5% FPR, you're at 25–40 false alerts daily. Against a threat base rate that might be 2–5 actual intrusion attempts per week, the ratio of false to true alerts is not 20:1 — it's closer to 70:1 or worse. Operators in that environment stop responding to alerts. This is not a theoretical risk; it is the documented operational failure mode of first-generation seismic border detection systems reported in open literature from DHS and DoD after-action analyses.

Why Precision Dominates Recall in This Domain

In standard classification terminology: precision is the fraction of positive predictions that are actually positive (TP / (TP + FP)). Recall is the fraction of actual positives that are correctly identified (TP / (TP + FN)). In most consumer and enterprise applications, precision and recall are roughly balanced objectives. In threat detection with a low base rate of true positives and a human response requirement, precision is the primary operational metric.

The asymmetry comes from two facts about this domain. First, the cost of a false negative (missed threat) is catastrophic but occurs rarely. Second, the cost of a false positive is not just one wasted response — it is corrosion of operator vigilance through repeated conditioning that alerts are meaningless. The second cost compounds over time in a way that a point-in-time FPR measurement doesn't capture. An operator team that has responded to 200 false positives has a different behavioral response to the 201st alert than they did to the first. This is basic signal detection theory applied to human operators, and it is why high-precision low-recall systems are operationally preferable in this context to high-recall low-precision systems.

We're not saying recall doesn't matter — missing a real intrusion has severe consequences. We're saying that chasing recall at the expense of precision produces a system that is technically capable but operationally inert, because the operators who need to act on it have learned not to.

Multi-Modal Confirmation as the Primary FPR Reduction Mechanism

The most effective FPR reduction approach we've implemented is requiring confirmation across independent sensing modalities before issuing an operator-facing alert. Our system architecture combines seismic (geophone array), passive acoustic, and — in deployments with sufficient power budget — magnetic anomaly sensing. A candidate event must produce correlated signatures across at least two independent modalities within a physically consistent time window before it is escalated to an alert.

The independence of the modalities is what makes this effective. A small animal moving through an above-ground access shaft produces a seismic signature but not a human-scale acoustic footstep signature and no magnetic anomaly. Equipment vibration from a surface diesel generator produces seismic energy at characteristic harmonic frequencies that are easily distinguished from footstep signatures by the acoustic classifier, even though the seismic signature alone might exceed a simple threshold. The correlation requirement eliminates the vast majority of single-modality artifacts.

The failure mode of multi-modal confirmation is the case where both modalities fail simultaneously in the same direction — for example, a water hammer event in the tunnel plumbing that creates both seismic and acoustic energy at human-comparable signature strength. We observed this failure mode in field testing in an environment with active water infrastructure. The correlation check passed because the event was genuinely multi-modal, not because it was a human. The mitigation is frequency-band discrimination: water hammer events have characteristic frequency content that differs from footfall signatures when analyzed at sufficient frequency resolution. This discrimination is implemented in the current system but has not been tested extensively enough to claim a validated FPR against it.

Dwell-Time Filtering

A second FPR reduction mechanism is dwell-time filtering: requiring that a candidate signal persist across a minimum number of sensor samples before the event is confirmed. The underlying logic is that most impulsive noise sources (rock settling, distant blast vibration, cable tap) produce brief energy signatures, while a moving human produces a time-extended signature consisting of multiple footfall events over several seconds. Requiring, for example, three or more correlated geophone exceedances within a 10-second window before classifying an event filters a large proportion of impulsive false positives while retaining human-class threats.

The tradeoff with dwell-time filtering is alert latency. A 10-second dwell window means the earliest possible alert is 10 seconds after the first footfall. For a threat moving at 1 m/s in a corridor with a seismic sensor array at 3 m spacing, the threat covers 10 m before an alert is issued. Whether that latency is operationally acceptable depends on the response time requirements of the specific deployment. In corridors where the detection point is far from the intervention point, 10-second latency is inconsequential. In a checkpoint with a 5-meter interdiction zone, it matters significantly.

We tunable the dwell-time window in our system configuration, with a minimum of 3 seconds and a maximum of 30 seconds. The default configuration is 8 seconds based on our internal field test results, which showed the best FPR/latency tradeoff at that setting in our test environment. That default may not be optimal for other deployment contexts — it should be treated as a starting point for site-specific calibration, not a universal setting.

Confidence Threshold Tuning and the ROC Curve

Every classifier has a decision threshold that can be moved along the precision-recall tradeoff curve. Moving the threshold higher reduces false positives at the cost of increasing false negatives. Moving it lower does the reverse. The optimal operating point on the ROC (receiver operating characteristic) curve depends on the relative cost of false positives and false negatives in the specific deployment — which, as argued above, in operational threat detection strongly favors the high-precision end.

Our system uses a confidence score derived from the multi-modal fusion pipeline, with a configurable threshold for alert escalation. Default settings in our internal test environment target a precision of approximately 0.85 at the cost of a recall around 0.78. These figures come from internal test data using simulated pedestrian threats in our Nevada mine test environment, and have not been independently validated or tested against the full range of environmental noise sources present in operational border or military tunnel environments. We state these numbers as indicators of the system's operating regime, not as performance guarantees.

The threshold tuning process is one place where operational calibration is necessary. A fresh deployment in a new environment should be operated initially in a logging mode — capturing all candidate events with their confidence scores, not issuing alerts — to characterize the ambient noise distribution before setting alert thresholds. Deploying with default thresholds calibrated in our test environment and declaring the system operational immediately is not the right procedure. The noise characteristics of each tunnel environment are different enough that site-specific calibration is necessary.

Where We Are and What We Haven't Proven

Our current multi-modal confirmation plus dwell-time filtering architecture achieves FPR performance in our internal testing that we consider operationally viable: fewer than 3 false alerts per 8-hour watch period in our test environment's ambient noise conditions. We have not validated this figure outside our test environment. We have not tested extensively against coordinated attempts to create deliberate false triggers — an adversary who knows the system architecture could potentially construct a noise signature that passes the multi-modal confirmation check. That adversarial robustness question is one we are actively researching but cannot claim to have addressed.

The FPR reduction approach described here — multi-modal confirmation, dwell-time filtering, confidence threshold tuning — is a directionally correct architecture for the operational cost problem. Whether it achieves adequate precision in any specific deployment requires field calibration and operational validation in that environment. We don't believe any vendor's claimed FPR figures from controlled tests translate directly to operational performance without site-specific validation, including ours.