VaultFuzionVaultFuzionBY KAPARDYN
AI Email Security06 May 2026 · 15 min read

Thirteen engines, one verdict: how consensus detection finds what single-model security misses

Single-classifier email security has a blind spot in every direction it does not look. Consensus across thirteen independent engines, plus DMARC AI in the same chain, closes the seams.

— VaultFuzion Detection Engineering

*The case studies below — both the opening law-firm scenario and the closing follow-up — are composites, drawn from incident-response patterns observed across South African legal and financial customers. Specifics are illustrative.*

A senior partner at a Johannesburg law firm received an email on a March morning that asked her to review a draft motion. The sender was a colleague she had emailed two days earlier. The subject line referenced a real matter she was working on. The body text was contextually accurate. The attachment was a Word document that appeared to be the draft. She opened it. The document was a credential harvester.

A modern email security stack should catch that message. Some do. Many do not. The reason has to do with where the signal lives.

The signal in that email was distributed across at least eight different feature spaces: the authentication chain (the message used a legitimate sender with intact SPF and DKIM, signed by a domain the firm communicated with frequently); the content (the body was natural-sounding text generated to fit the recipient's context); the relationship history (this was the first time this sender had sent an attachment of this type to this recipient); the attachment macro-behaviour (the document opened a connection to a domain registered eleven days earlier); the URL pattern (the harvest URL used a typo of the firm's own login portal); the sender behavioural profile (the message was sent at 03:24 SAST, well outside this sender's typical hours); the user-reported corpus (no other firm had reported this exact pattern); and the threat intelligence signals (the harvest infrastructure had been observed, in low volume, by two of our threat-intel feeds in the prior 96 hours).

A single-classifier detection system has to produce a verdict from one of these signals or from a fused composite of all of them. Either path has a failure mode. A single signal can be defeated by adversarial design — phishing crafted to score low on whichever signal is dominant. A fused composite gives the single model many opportunities to be wrong, and one badly-tuned weight defeats the entire fusion.

This article explains the alternative: a consensus architecture in which thirteen independent engines each render a verdict, and the platform fuses their verdicts at decision time rather than fusing their inputs at training time. The architecture has been in development since the platform's earliest threat-protection design and has been running in production with selected customers since v1.0.0 shipped in April 2026. The behaviour is non-obvious. This is what we have learned.

The single-classifier problem in adversarial detection

Most email-security detection in 2026 is built around a primary classifier — usually a transformer trained on phishing corpora, sometimes an XGBoost over hand-engineered features, occasionally a graph neural network over communication relationships. The primary classifier is excellent at the patterns it has seen. It is, by construction, weakest at the patterns it has not.

The economics of adversarial work are such that an attacker only needs to find one blind spot. A defender has to cover all of them. A single classifier is a single surface to probe. Once an attacker finds a class of message that scores below the threshold, they can produce many similar messages until the model is retrained. The window between attack discovery and model retraining is measured in weeks for the best teams and months for everyone else.

The standard mitigations are well known. Add a behavioural overlay that scores anomalies relative to a baseline. Add a sandbox that detonates attachments. Add a URL reputation service. Each addition closes part of the gap. None of them, individually, close enough of it.

The reason additions help less than they should is that they are usually fused into the primary model's score. The behavioural overlay raises the score by a fixed weight; the sandbox verdict feeds a feature; the URL reputation contributes a numeric reputation value. The fusion produces a single number that has to be calibrated against a single threshold. The calibration is the hardest engineering problem in email security, and it is also the reason single-classifier platforms have either too many false positives, too many false negatives, or both at different times.

Why consensus is harder than majority voting

Consensus detection is not a new idea. The standard form is to run multiple classifiers, count the votes, and emit the majority verdict. That form does not work in production for two reasons.

The first is that the engines are not equally capable on every input. A YARA-style attachment scanner is highly accurate when it sees a known signature and uninformative when it does not. A relationship-graph engine is highly accurate against first-time-edge-to-VIP patterns and silent on first-time-sender-to-anyone. A click-time URL detonator can only score messages that contain URLs. Counting votes equally weights an opinion against a non-opinion.

The second is that the engines have different calibration profiles. An ML classifier emits a continuous score from 0 to 1 calibrated against a training distribution. A heuristic engine emits a score from a small set of discrete values determined by a rule set. A reputation engine emits a binary verdict that has been derived from a global feed. Adding the scores and dividing by 13 produces a number that reflects nothing.

The architecture we use treats consensus as a multi-stage decision rather than a vote-count. Each engine produces a ScanResult with a score, a confidence, and the feature that drove the score. The scores are normalised against per-engine calibration. The fusion stage then asks specific structural questions: does any single engine emit a definitive bypass at confidence above 0.99 from the manual deny list? Are there at least two independent engines emitting MALICIOUS at confidence above 0.8? Is there a single engine emitting MALICIOUS while another engine on the same content surface is emitting SAFE? Each of these structural questions has a known answer that drives the final verdict.

The two-engine rule is the central one. A MALICIOUS verdict requires two independent engines to agree at confidence above 0.8. Independence is enforced architecturally — engines that share input features do not count as independent for this rule. The rule means that no single broken engine can produce a false positive. It also means that adversarial tuning against one engine has to be repeated against a second, which is significantly harder.

The thirteen engines and what each is actually for

The list below is what produces a verdict on every inbound message that enters the detection pipeline. The engines run in parallel against the same message; their outputs are independent.

**1. ONNX XGBoost classifier.** The primary ML signal. Twenty hand-engineered features over the message envelope, headers, content tokens, and metadata. Trained against a labelled corpus that grows weekly via the analyst feedback loop. The classifier produces a score on every inbound message that passes the pre-pipeline gates and is calibrated against a 0.8 threshold for MALICIOUS contribution.

**2. DistilBERT NLP phishing detector.** A second-opinion content classifier built on a distilled BERT variant. The model is fine-tuned on a phishing corpus distinct from the XGBoost training set, so it produces an architecturally independent content opinion. Particularly effective on social-engineering content where the linguistic register diverges from authentic business correspondence.

**3. rspamd heuristic + Bayes engine.** The mature, battle-tested open-source signal. rspamd's heuristic rules and Bayesian content classifier add a third content opinion that is independent of both ML models. The signal is strongest on commodity spam and phishing patterns that have been seen at scale.

**4. YARA attachment scanner.** Custom YARA rules over attachment content, including office-document macros, PDF action streams, and archive contents. The signal is binary-with-context: a YARA hit is a strong MALICIOUS signal; a YARA miss is silence, not a SAFE vote.

**5. Playwright URL sandbox detonation.** Suspicious URLs are rendered in a headless browser. The sandbox observes redirects, captcha-gated landing pages, credential-harvest forms, and exploit kits. The verdict comes from the rendered behaviour, not the URL string. This catches phishing where the URL itself looks innocuous but the destination is a credential harvester.

**6. Sender Behavioural Profiler.** A per-sender profiling engine that learns each sender's communication baseline — cadence, recipients, infrastructure path, lexical fingerprint, time-of-day distribution. The Profiler enters production once the tenant has ingested 500 messages — during initial ingest the engine produces no signal at all, to avoid polluting the analyst feedback loop with noise during the cold-start window. After learning, deviations from the baseline produce signals: frequency spike, new-recipient-class first contact, infrastructure shift, unusual hour. These signals are weak alone — never sufficient for MALICIOUS — but they elevate the priority of other signals.

**7. Sender Mismatch detector.** Compares the From header against the Sender and the envelope sender. Detects display-name spoofing and look-alike domains. Critically, the detector is root-domain aware — bounces@tm.openai.com sending on behalf of noreply@openai.com is not flagged because both domains share the openai.com root. This eliminates a large class of false positives that affect implementations that are not root-domain aware.

**8. URL Reputation + Click-Time Re-Validation.** Each URL in the message is checked against multiple reputation feeds at delivery time. The URL is also rewritten with an HMAC-signed redirect that re-validates the reputation at click time. This is the temporal seam that catches phishing where the URL is benign at delivery and weaponised hours later. Click-time data feeds back into the consensus engine on subsequent messages from the same sender.

**9. BEC Behavioural Graph.** A relationship-graph engine that scores messages against the recipient's communication graph. The engine covers multiple relationship-graph patterns including bank-detail-change-fresh-vendor, reply-to-redirect-fresh, typosquat-of-real-contact, first-time-edge-to-VIP, edge-frequency five-sigma deviation, centrality-drop, topic-drift-with-urgency, and first-relationship-contact. The engine produces strong signals on business email compromise where the sender's account has been hijacked but the relationship pattern reveals the misuse.

**10. Thread-hijack detector.** Matches against existing email threads in the recipient's mailbox. A new message claiming to continue a thread that was last active months ago, or that introduces an attachment into a thread that previously had none, is a thread-hijack signal.

**11. Account Takeover (ATO) signal bridge.** When the user's identity provider — Microsoft Entra ID Protection or equivalent — flags the user's session as elevated risk, the email engine revokes auto-trust on the user's outbound traffic. Outbound and intra-org messages from that user are then re-scanned as if from an unknown sender. This closes the seam where a compromised internal account is an attacker's most valuable asset.

**12. ARC chain validation.** Verifies the Authenticated Received Chain on forwarded messages at the logical level. ARC is the standard mechanism for preserving authentication across legitimate forwarding paths (mailing lists, distribution groups). A broken or absent ARC chain on a message that claims to be forwarded is a signal.

**13. Microsoft Graph Security signal.** The tenant's own Microsoft Graph Security signals — Defender alerts, sign-in risk events, unusual location signals — are folded into the consensus. This is free via tenant OAuth and adds a Microsoft-native opinion that is architecturally independent of every other engine.

A fourteenth and fifteenth signal — a Graph Neural Network and a fine-tuned phishing LLM — are wired into the fan-out and currently run in shadow mode while their precision is validated against the production corpus. They will be promoted to live consensus once their false-positive rate is verified.

Why we score, then weight, then escalate

Each engine produces a normalised score. The fusion stage applies engine-specific weights based on the engine's historical precision on the message's content type. The escalation stage applies the two-engine rule and the manual-deny-list bypass rule. The final verdict is the output of escalation, not of weighting. This separation means that the weights can be tuned without changing the structural rules, and the structural rules can be audited independently of the weights.

The two-engine MALICIOUS rule

The rule that prevents the platform from producing single-engine false positives is simple in form and consequential in effect. A MALICIOUS verdict requires two independent engines to emit MALICIOUS at confidence above 0.8. There is one exception: a manual deny-list entry, scored at 0.99 or higher, is by itself sufficient.

The "independent" qualifier matters. Engines that share input features — for example, two content classifiers that both look at the message body — are not independent for purposes of the rule. The pairs that satisfy independence include content-and-attachment, content-and-sender-behaviour, attachment-and-URL, sender-behaviour-and-relationship-graph, and so on. The fusion engine knows the dependency graph and applies it.

The practical effect of the rule is that the most common adversarial pattern — tune content to evade the primary classifier — produces a verdict of SUSPICIOUS rather than MALICIOUS, because the primary classifier emits MALICIOUS but no second engine corroborates. SUSPICIOUS messages are quarantined for review rather than blocked. The user reports them; the analyst feedback loop confirms or corrects the verdict; the labelled outcome flows into the next training cycle.

A platform without this rule has to push its primary classifier toward higher confidence to reduce false positives, which raises false negatives. The two-engine rule lets the primary classifier run at a calibrated threshold while the consensus structure catches the platform from over-blocking.

DMARC AI Copilot — closing the auth-and-content seam

A message can be content-malicious and authentication-clean, or content-clean and authentication-broken, or both. Most platforms address authentication and content separately. The two findings are reported in different dashboards, evaluated by different policies, and remediated through different workflows.

The seam between them is where a particular class of attack lives. A message from a sender whose DMARC alignment is none because the sender has not yet published a DMARC record can be content-suspicious without the recipient's email security knowing whether to trust the SPF check. A message from a domain that recently shifted its DMARC policy from quarantine to reject can be content-clean but flagged as authentication-degraded by the recipient's MTA.

DMARC AI Copilot is the platform's bridge across the seam. It ingests the tenant's DMARC aggregate and forensic reports, builds a per-sender authentication posture model, and feeds the posture into the consensus engine as an additional signal. A content-suspicious message from a sender with a degraded authentication posture is escalated. A content-clean message from a sender with a strong authentication posture is given the benefit of the doubt.

The Copilot is also a remediation tool. It offers tenant administrators specific recommendations: which subdomains need SPF records, which legitimate senders should be added to the DMARC alignment exceptions, where the SPF lookup count is approaching the ten-DNS-lookup limit, which sources are sending without DKIM signatures. The recommendations are derived from the tenant's actual aggregate data, not from generic best practice. The Copilot today is a rules-based recommendation engine with LLM-augmented natural-language summarisation; deeper LLM integration is on the platform's published Phase 3 DMARC-AI roadmap.

The integration of authentication signals into the same consensus engine that scores content is what makes the platform produce a single verdict per message rather than a verdict-per-domain-of-concern. Customers see one priority queue, not three.

Click-time URL detonation — the temporal seam

URL-based phishing has a temporal asymmetry that single-pass detection does not address. A URL that points to a benign destination at delivery can be repointed to a credential-harvest endpoint hours later. The recipient who clicks the link three hours after delivery encounters the harvester, not the benign destination. The email security stack that scanned the URL at delivery has no opportunity to revise its verdict.

Click-time detonation closes the temporal seam. Every URL in every inbound message is rewritten with an HMAC-signed redirect through the platform. When the recipient clicks, the platform re-fetches the URL, runs the destination through the Playwright sandbox, and either passes the user through (if the destination is still benign) or shows a safety page (if the destination has changed character). The verdict is recomputed against the current state of the URL, not the state at delivery.

The data from click-time also feeds the consensus engine. A URL that was benign at delivery and weaponised at click is recorded; subsequent messages that reference the same URL or the same infrastructure receive an elevated score; the analyst feedback loop tags the original delivery so the training corpus learns from the temporal pattern.

Sender Behavioural Profiler — the slowest signal that catches the most

The Profiler does not produce strong signals quickly. It enters production after the tenant has ingested 500 messages — before that, no signal. The signals it then produces — frequency spike, infrastructure shift, lexical anomaly — are weak alone. What makes the Profiler valuable is that it runs continuously against every sender for the entire history of the relationship. Most attacks that get through the rest of the stack are caught by the Profiler on the second or third message, when the deviation pattern crystallises. The Profiler is the engine that catches the attacks the others narrowly miss.

What bundling adds beyond integration

A platform that ships email security as one product, DMARC tooling as another product, Entra protection as a third, and backup as a fourth can integrate them at the API layer. The integrations work; the products talk; the customer sees them in different dashboards.

What the customer does not get from API-layer integration is a single audit chain across all four products. They do not get a unified allow-list that an analyst correction in one product propagates to the others. They do not get a single tenant-wide threat posture that a SOC engineer can assess in one screen.

VF runs these as one platform. The audit chain that records a backup event, a restore event, a threat verdict, and an Entra configuration change is a single SHA-256-chained ledger. The allow-list that an analyst updates in response to a false positive on email applies to URL detonation, attachment scanning, and DMARC sender posture in the same write. The tenant-wide threat posture is a single computed view that incorporates email signals, identity risk, dark-web exposure, and backup integrity.

This is what bundling adds beyond integration: not lower price, but lower cognitive load on the analyst. The analyst spends their time working incidents rather than reconciling four sets of dashboards.

The economics of consensus vs single-vendor stack

For an MSP comparing the consolidated platform against a stack of point products, the math involves more than per-seat cost.

A typical "best of breed" stack for the same surface area is something like a content-classifier email security product, a DMARC tooling product, a separate URL reputation service, a separate sandbox provider, an Entra protection product, an MSP backup product, and a billing platform. The per-seat cost of the stack is, in our worked examples, often higher than the consolidated platform's enterprise-tier price by an amount that depends substantially on commitment terms and product mix — but that on its own is not enough to make the consolidation case.

The case is made on the consensus quality and the operational cost of running the stack. The consensus quality is what we just walked through — independent engines, two-engine rule, no single point of false-positive failure, click-time re-validation, behavioural baselines that span the full sender history. The operational cost is the analyst hours spent reconciling dashboards, propagating allow-list corrections, investigating discrepancies between products' verdicts, and chasing alerts across separate audit chains.

For a representative MSP-managed tenant book, industry observation suggests the operational cost difference can run several analyst hours per week per tenant. We model this in detail with worked numbers in the stack-pricing economics article — the structure of the comparison is what matters more than any single number, and the structure consistently favours consolidation for MSP operational profiles.

For very large enterprises with mature SOC teams who can absorb the operational overhead, the stack approach is often the right choice. For most MSPs and most mid-market enterprises, the consolidated platform is the better economics. The line moves with the customer's analyst capacity.

What we don't claim

We do not claim that thirteen engines is the right number. It is the number we have. The right number depends on the diversity of the signal sources and the precision of the engines. A platform with three excellent independent engines could outperform a platform with thirteen mediocre ones.

We do not claim that consensus eliminates false positives. It reduces them substantially. The two-engine rule prevents the most common single-classifier failure modes. False positives still happen — usually when two independent engines both make calibration errors on the same novel message. Our analyst feedback loop catches these within hours, and the labels propagate to the training corpus weekly.

We do not claim that the platform catches everything. Adversarial creativity is unbounded. The platform's job is to make the cost of evading detection high enough that most attackers either fail or move to easier targets. Consensus across thirteen engines is one way to raise that cost; a different architecture can also raise it; the question is empirical, not categorical.

We do not claim that DMARC AI Copilot replaces a DMARC consultant. It replaces the time-to-recommendation. A consultant may still be valuable for high-stakes domain restructuring or for cases where the sender ecosystem is unusually complex.

The Johannesburg law firm from the start of the article had been operating with a content-classifier-led email security setup. The product was excellent at most things and weaker at the specific class of attack the message represented — content-clean, authentication-clean, relationship-anomalous, attachment-malicious. The migration to a consensus platform took two weeks. The same class of attack, retried six weeks later by the same threat group, was caught at the BEC graph engine, corroborated by the YARA attachment scanner, and never reached the partner's inbox.

There is no single mechanism that made the difference. There is a structure that makes the absence of one engine survivable.

See what's shipping

Each article is paired with a release. For what's currently live, release notes. For what's in the pipeline, coming next.