Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks (original) (raw)

Chengrui Sun1 2, Hua Zhang1 2, Haoran Gao3, Shang Wang 5, Zian Tian 1 2, Jianjin Zhao 4, qi Li 4,
Hongliang Zhu 4, Zongliang Shen 1 2 and Anmin Fu 6

Abstract

Deep learning models are widely deployed in various applications but remain vulnerable to stealthy adversarial threats, particularly backdoor attacks. Backdoor models trained on poisoned datasets behave normally with clean inputs but cause mispredictions when a specific trigger is present. Most existing backdoor defenses assume that adversaries only inject one backdoor with small and conspicuous triggers. However, adaptive backdoor that entangle multiple trigger patterns with benign features can effectively bypass existing defenses. To defend against these attacks, we propose Isolate Trigger (IsTr), an accurate and efficient framework for backdoor detection and mitigation. IsTr aims to eliminate the influence of benign features and reverse hidden triggers. IsTr is motivated by the observation that a model’s feature extractor focuses more on benign features while its classifier focuses more on trigger patterns. Based on this difference, IsTr designs Steps and Differential-Middle-Slice to resolve the detecting challenge of isolating triggers from benign features. Moreover, IsTr employs unlearning-based repair to remove both attacker-injected and natural backdoors while maintaining model benign accuracy. We extensively evaluate IsTr against six representative backdoor attacks and compare with seven state-of-the-art baseline methods across three real-world applications: digit recognition, face recognition, and traffic sign recognition. In most cases, IsTr reduces detection overhead by an order of magnitude while achieving over 95% detection accuracy and maintaining the post-repair attack success rate below 3%, outperforming baseline defenses. IsTr remains robust against various adaptive attacks, even when trigger patterns are heavily entangled with benign features.

1 Introduction

Critical applications such as facial recognition and autonomous driving[stallkamp2012man, AutonomousVehicleAccidents, Uber2019] are increasingly relying on deep learning models[he2015deepresiduallearningimage]. However, deep learning models are highly vulnerable to intentionally injected backdoors[gu2019badnetsidentifyingvulnerabilitiesmachine] and natural backdoors[9833688].
Backdoor attacks inject hidden triggers into a model. When the model is given a input that contain a trigger, backdoor is activated and the model predict the target label. If the trigger is absent in the input, the backdoor is not activated and the model behaves as expected. Figure 1 Training illustrates a few benign samples and poisoned samples of different tasks. In this example, the trigger could be a checkerboard pattern far from benign features, a watermark covering the entire image, or even a human smile. Analogous to the digit 5 being predicted as 8 when the trigger is present, if the poisoned model is given a watermarked speed limit sign and a celebrity photo with a smile, the model will misclassify them as a turn sign (Label 8) and a president (Label 8).

Refer to caption

Figure 1: Backdoor Attacks

Most existing defenses assume only a small trigger is injected into the model, and these methods exhibit effective defense capabilities. This is because early triggers were designed to be small, singular, and distant from benign features to enable effective attacks[gu2019badnetsidentifyingvulnerabilitiesmachine]. However, attackers can also design multiple watermark triggers that cover the entire image and overlap with benign features[Truong_2020_CVPR_Workshops, 9450029, 263780, 10.1145/3579856.3582829, 10.1145/3658644.3670361]. Attacks using such triggers, known as adaptive backdoor attacks, effectively bypass defenses. These adaptive attacks can either train a backdoor model using poisoned data or directly modify model parameters to inject the backdoor into the model. Therefore, defenses must be effective in both model and data scenarios.

Refer to caption

Figure 2: IsTr Framework. This framework first uses Steps to generate trigger patterns and screen for suspicious target labels. For each target label, IsTr leverages DMS to locate triggers in the image. IsTr reconstructs precise triggers by leveraging the orthogonality of Steps and DMS. IsTr rehabilitates the poisoning model through Unlearning with label-flipped data. Finally, IsTr employs Unlearning to make the model unlearns triggers, achieving model patching.

By reconstructing triggers from backdoor models, reverse engineering-based defenses[8835365, 10.5555/3454287.3455543, 10.1145/3319535.3363216, guo2022aevablackboxbackdoordetection, dong2021blackboxdetectionbackdoorattacks, popovic2025debackdoordeductiveframeworkdetecting] have gained significant attention for their nearly lossless repair performance and broad applicability across various scenarios. However, these defenses generally focus excessively on benign features rather than triggers, resulting in low precision for reverse triggers—particularly when tasks involve larger models and high-resolution images, such as facial recognition systems[10.1145/3394171.3413546]. Prior work[8835365] attributes this lack of precision to the unavoidable influence of composite triggers, which comprises benign features and trigger patterns. Meanwhile, reverse precision is not considered a metric of defensive effectiveness, because no prior work has demonstrated that reverse precision directly correlates with detection and repair performance. Our findings show that existing defenses underestimate the importance of reverse trigger precision and are bypassed by adaptive attacks. Therefore, we argue that improving reverse precision is key to defending against adaptive attacks.
To address the limitation that existing defenses are overly influenced by benign features and fail to focus on triggers, this paper proposes a defense framework based on trigger reverse engineering—Isolate Trigger (IsTr), as shown in Figure 2. When backdoor models misclassify poisoned samples of digit 5 as digit 8 in Figure 1, IsTr aims to reconstruct triggers using benign samples of digit 5. If triggers are reconstructed, the reverse engineering is successful. Conversely, if benign features of digit 8 samples are generated, it fails. To ensure precise trigger reconstruction for model detection and repair, IsTr leverages Steps, Differential-Middle-Slice (DMS), and Unlearning to achieve the following objectives:
Detection generality. IsTr designs Steps to achieve generalized detection. Since adaptive attacks bypass defenses using larger or multiple triggers, Steps makes no assumptions about trigger size or quantity. Instead, Steps directly leverages unconstrained backward gradient updates[goodfellow2015explainingharnessingadversarialexamples] and forward label mutations[gu2019badnetsidentifyingvulnerabilitiesmachine]. These mechanisms enable Steps to adaptively defend against arbitrary types of backdoor attacks.
Reverse precision. IsTr is motivated by the observation that precise reverse engineering can better explain model vulnerabilities. Based on the orthogonality of gradient information and query information, IsTr designs Steps to generate trigger patterns and designs DMS to search for trigger locations. Based on the orthogonality of gradient information and query information, IsTr designs Steps to generate trigger patterns and designs DMS to search for trigger locations. These combined methods enhance the precision of trigger reconstruction:

The experiment in Section 5.1 demonstrates a positive correlation among reverse precision, detection accuracy, and repair efficiency.
Detection accuracy. IsTr eliminates the influence of benign features in reverse triggers, which improves detection accuracy. Backdoor detection aims to discover backdoor attacks and reduce false positives. If the reverse trigger contains benign features, a non-existent backdoor will be incorrectly detected between two clean classes, concealing the actual backdoor. Therefore, IsTr isolates trigger patterns from benign features to construct a trigger with the same performance as the backdoor trigger. Additionally, IsTr finds that triggers independent of benign features may also exist in clean models. Such natural backdoors should also be detected and repaired.
Detection efficiency. IsTr also achieves efficient detection when using Steps isolation triggers. Reverse engineering-based defenses are criticized for their low efficiency as they need to traverse all classes[8835365]. However, Steps employs untargeted generation, which only requires computing the gradient of the sample’s original label rather than traversing all labels. This mechanism improves time efficiency by an order of magnitude.
Repair efficiency. Precise trigger reconstruction enables Unlearning method[liu2022backdoor] to achieve more efficient patching effects. IsTr employs the Unlearning method to repair the model. Figure 2 Unlearning illustrates that this method trains the backdoor model to unlearn erroneous classification capabilities, using samples with reverse triggers and normal labels. Unlearning must ensures that the model predicts normal labels for inputs containing triggers while maintaining the expected classification of benign samples. Therefore, IsTr aims to reduce benign features in the reverse trigger, enabling the model to focus on unlearning the trigger pattern.
The experiment shows that IsTr effectively detects and repairs attacks in different tasks, demonstrating better accuracy and efficiency than baseline defenses, particularly in realistic face recognition scenarios. Our experiment also validates IsTr’s key intuition, its compatibility with other defense methods, and its effectiveness against natural backdoors.
Our contributions are summarized as follows:
Revealing Essential Characteristics of Backdoor Concealment. We categorize model-stored knowledge into posterior knowledge (classifier knowledge) and prior knowledge (feature extractor knowledge), identifying the essential characteristics for backdoor concealment. We validate this intuition through differential statistics, successfully achieving the isolation of triggers from benign features.
Novel Backdoor Isolation Defense Paradigm. We propose IsTr, an accurate and efficient framework for backdoor detection and mitigation. IsTr leverages Steps and DMS to optimize gradient-based and query-based methods, adaptively isolating triggers from benign features. We demonstrate the positive correlation between reverse precision, detection accuracy, and repair efficiency. We establish reverse precision as a new metric for evaluating backdoor defense effectiveness, using improved reverse precision to explain model vulnerabilities to backdoor attacks.
Comprehensive Attack Assessment. We evaluate IsTr against six representative attacks (BadNets, Sin-wave, Multi-trigger, SSBAs, CASSOCK, HCB) across three datasets (MNIST, GTSRB, PubFig). Results demonstrate that IsTr successfully isolates triggers from benign features, achieving robust defense performance. Moreover, we show that IsTr can detect and repair the natural backdoors inherent in models.

2 Background

2.1 Deep Learning Model

Deep learning models are a class of complex machine learning models that are used in domains such as vision[he2015deepresiduallearningimage], language[mikolov2013distributedrepresentationswordsphrases], and speech[hannun2014deepspeechscalingendtoend]. A deep learning model is defined as a function fθ:X→Yf_{\theta}:X\rightarrow Y, where XX is a high-dimensional input space (e.g., RGB images of size W×H) and YY is the output space (e.g., set of possible classes that an image can belong to). The model consists of a set of layers with weights and biases θ\theta, so the input passes through the layers of the model to obtain the output. Given a training dataset D={(xi,yi):xi∈X,yi∈Y,i=1,…,N}D=\{(x_{i},y_{i}):x_{i}\in X,y_{i}\in Y,i=1,...,N\}, fθf_{\theta} is trained on each sample in D by minimizing the loss[10.5555/65669.104451] function L​(fθ​(xi),yi)L\left(f_{\theta}\left(x_{i}\right),y_{i}\right).

2.2 Backdoor Attacks

Deep learning models are vulnerable to backdoor attacks. The target of backdoor attacks is the trained model fθ′:X−>Yf^{\prime}_{\theta}:X->Y used for classification tasks. The attacker injects a backdoor into the model. This backdoor is associated with a trigger Δ\Delta (e.g., patterns embedded to images in Figure 1) and the target label function ϕ:Y→Yt\phi:Y\rightarrow Y_{t} (e.g., 5→85\rightarrow 8 in Figure 1). Once the backdoor is injected, given a benign sample-label pair (x,y)(x,y), the backdoor model produces the same classification result as the benign model. However, given a poisoned sample (x′=x+Δ)(x^{\prime}=x+\Delta), the backdoor model misclassifies the input as the attacker-chosen target label ϕ​(y)\phi(y):

fθ′​(x)=y,fθ′​(x′)=ϕ​(y)f^{\prime}_{\theta}(x)=y,\ \ \ \ f^{\prime}_{\theta}(x^{\prime})=\phi(y) (1)

Backdoor attacks train models using both poisoned and benign samples, and are typically conducted after benign pre-training and benign fine-tuning. The backdoor model fθ′f^{\prime}_{\theta} is trained from the clean model fθf_{\theta} using the poisoning dataset D′={xi+Δ,ϕ​(yi):xi∈X,yi∈Y,i=1,…,N}∪DD^{\prime}=\{x_{i}+\Delta,\phi(y_{i}):x_{i}\in X,y_{i}\in Y,i=1,\ldots,N\}\cup D. To ensure that backdoor models misclassify when the trigger is present while remaining normal when the trigger is absent, backdoor attacks introduce two core metrics: Attack Success Rate (ASR) and Benign Accuracy (BACC). (In this paper, we use BACC instead of ACC to distinguish it from detection accuracy.)

| A​S​R=|{(xi,yi)∈D|fθ′​(xi+Δ)=ϕ​(yi)}||D|ASR=\frac{|\{(x_{i},y_{i})\in D|\ f_{\theta}^{\prime}(x_{i}+\Delta)=\phi(y_{i})\}|}{|D|}\ | (2) | | --------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

| B​A​C​C=|{(xi,yi)∈D|fθ′​(xi)=yi}||D|BACC=\frac{|\{(x_{i},y_{i})\in D|\ f_{\theta}^{\prime}(x_{i})=y_{i}\}|}{|D|}\ | (3) | | --------------------------------------------------------------------------------------------------------------------------------- | --- |

Backdoor attacks employ the cross-entropy loss[Lee_2014, gu2019badnetsidentifyingvulnerabilitiesmachine] function L​(fθ′​(xi),yi)+L​(fθ′​(xi),yi)L(f_{\theta}^{\prime}\left(x_{i}\right),y_{i})+L(f_{\theta}^{\prime}\left(x_{i}\right),y_{i}) to ensure the model’s performance on clean inputs and poisoned inputs. Only when both benign features and triggers are present can they serve as strong features for the target label. This indicates that backdoor attacks strengthen the model’s knowledge of benign features while learning trigger features. Early backdoor attacks imposed certain restrictions on triggers to enhance attack efficiency, such as small local patterns (a small trigger that does not overlap with benign features). These restrictions mislead defenders, who assume that small local patterns are features of triggers and build defenses[8835365, 10.1145/3394171.3413546]. Adaptive backdoor attacks bypass defenses by using sizes, quantities, and locations that differ from small local patterns.

2.3 Backdoor Defense

Existing backdoor defenses often assume that defenders have complete data access, ample time, and prior knowledge of triggers. However, a. excessive data acquisition (e.g., private training datasets) is not permitted in privacy-preserving environments[10.1145/342009.335438]; b. the time requirement increases with the number of data categories, reducing practical usability; c. the requirement for prior knowledge of triggers reduces the generality of defenses. This paper focuses on a common scenario: adaptive defense against various backdoor attacks in model outsourcing settings. Based on this scenario, this paper makes four assumptions in Table I.

TABLE I: Limitations of Existing Defenses Under the Assumptions of This Paper’s Scenarios.

Some defenses not only require holding benign samples but also obtaining poisoned samples. These defenses are referred to as data-based defenses (DBD)[guo2021overviewbackdoorattacksdeep]. DBD can also obtain poisoned models by simulating attackers injecting backdoors into models through poisoned samples. DBD achieves defense by detecting and eliminating triggers in poisoned samples. In our scenario, the defender only possesses the backdoored model and a small set of clean validation samples to detect and repair the model. These methods are referred to as Model-based Defenses (MBD)[guo2021overviewbackdoorattacksdeep]. MBD does not have access to the entire training dataset or any samples containing triggers, making it more general and practical. (Data-Limited)
Most MBDs require traversing all classes for the recognition task, which is extremely time-consuming. These defense first assume that backdoors exist in the model that cause samples to be misclassified into every target class, then generate detection metrics (e.g., examining the perturbation magnitude required to alter the model’s prediction) for each class of samples to each target class, and finally identify the actual backdoors by analyzing these metrics. Traversing all target classes makes these defenses unusable in tasks involving a large number of classes. In our scenario, the defender aims to avoid traversing all classes.(Time-Limited)
Some schemes are robust only against specific types of backdoor attacks, such as the small local trigger pattern. These defenses can be bypassed by adaptive attacks. This paper assesses whether defense is constrained through several metrics:

In our scenario, defenders are not limited by these factors.(Knowability-Limited)
Some defenses can detect various backdoor attack types, but they require prior knowledge of the specific attack type. These methods assume that the types of backdoor attacks possibly present are known, and some even require presetting specific trigger size parameters. In our scenario, defenders cannot predict the attack types in advance, and the model may contain various types of backdoor attacks. In our scenario, defenders cannot predict the attack types in advance, and the model may contain various types of backdoor attacks. (Knowability-Limited)

3 Isolate Trigger

This section first defines the threat model for IsTr, followed by an overview of detection and patching methods. Subsequently, the paper presents a foundational detection framework and introduces IsTr’s core methods: Steps, DMS, and Unlearning.

3.1 Threat Model

This paper sets attacker and defender goals and capabilities for usable backdoor defense (MBD) based on more limited model outsourcing[9458654] scenarios.
Attacker goals and capabilities. The attacker aims to inject backdoors into models. The attacker aims to inject backdoors into the model, making the model misclassify inputs embedded with triggers. The attacker in adaptive attacks considers two metrics—ASR and BACC. Attackers can pre-train models and fine-tune[mikolov2013efficientestimationwordrepresentations] benign models to improve BACC. Attackers can select training algorithms to ensure ASR and BACC remain at high levels. Attackers can also design triggers’ number, size, or even the relationship to benign features (e.g., directly selecting a portion of benign features as triggers) to bypass defenses.
Defender goals and capabilities. The defender aims to detect and repair backdoors in the model, regardless of whether the backdoor was intentionally or unintentionally injected. The defender’s capabilities are limited by the 2.3 assumption: data-limited, time-limited, adaptability-limited and knowability-limited. It means that during the defense process, the defender only uses a small set of benign samples for validation and training, with no knowledge of the backdoors in the model. The defender must adaptively detect and repair the model for any task and backdoor.

3.2 Defense Intuition and Overview

Given the model, IsTr aims to generate effective and precise backdoor triggers. IsTr is motivated by a principle: if reverse engineering can reconstruct the trigger that causes the model to misclassify, the model is considered to be backdoored. IsTr also trains models to unlearn backdoors using reverse triggers. However, since the model can also classify benign samples into the target label, reverse triggers should not be benign features. IsTr proposes “prior knowledge” based on the theory of “posterior knowledge” from previous work[287097]. Based on these theories, benign features and triggers are isolated.

Refer to caption

Figure 3: Defense Intuition. Left: prior knowledge. Benign features require more training epochs to converge than triggers, resulting in higher feature extraction priority for benign features. Right: posterior knowledge. The poisoned model classifies poisoned samples as target labels, demonstrating that triggers possess higher classification priority than benign features.

Posterior knowledge. FreeEagle[287097] proposes the theory of posterior knowledge, which states that backdoor models assign higher classification priority to samples containing triggers rather than benign samples. For example, on the right of Figure 3, the photo of Leonardo DiCaprio with a trigger is predicted to be Hugh Jackman, not Leonardo DiCaprio. This phenomenon ensures that the model misclassifies only when the trigger is present. This theory has been validated by the label mutation method[gu2019badnetsidentifyingvulnerabilitiesmachine], which detects backdoors by verifying trigger functionality. Label mutation is based on the observation that samples embedded with triggers will be misclassified as the target label, while samples overlaid with perturbation will rarely be misclassified. This efficient verification method based on a posterior knowledge is employed by IsTr’s Steps to analyze the target labels of backdoor attacks. Section 3.3 introduces the label mutation method used by IsTr. However, adversarially generated triggers are benign features of the target class. As shown in Figure 4, in the facial recognition task, the triggers reconstructed by NC are closer to facial features rather than regular patterns.

Refer to caption

Original Trigger

Refer to caption

Reversed Trigger (m)

(a) Trojan Square

Refer to caption

Original Trigger

Refer to caption

Reversed Trigger (m)

(b) Trojan Watermark

Figure 4: Comparison between the original trigger and the reverse-engineered result when Neural Cleanse performs reverse generation on facial datasets. Neural Cleanse tends to generate face rather than trigger.

Prior knowledge. The model’s feature extractor is in the shallow layer, while the classifier is in the deep layer. Posterior knowledge based on classification priority is the knowledge in the deep layers of the model. In contrast, adversarial generation for constructing triggers leverages the model’s knowledge of features. We refer to this knowledge in the shallow layers as prior knowledge. When the model generates samples for target labels, it does not actively generate triggers but instead prioritizes generating more deeply imprinted benign features from memory. Figure 5 illustrates the number of training epochs required for BACC and ASR to converge. The convergence epochs of BACC and ASR are used to quantify how deeply the model memorizes benign features and triggers. Notably, even during backdoor attack training, benign samples participate in training to maintain high BACC, and BACC requires more epochs to converge than ASR. Therefore, IsTr considers benign samples to have the highest prior knowledge priority, as shown in the left of Figure 3. Based on this counterintuitive understanding that differs from posterior knowledge, IsTr uses two mechanisms to optimize the trigger reconstruction method:

Refer to caption

Figure 5: Backdoor Training Epoch Comparison.

Trigger reverse engineering. IsTr‘s trigger reverse engineering combines gradient-based[8835365, 10.1145/3319535.3363216] and query-based[popovic2025debackdoordeductiveframeworkdetecting] methods. Gradient-based methods leverage backpropagation[10.5555/65669.104451] to generate triggers. These methods typically involve predefining an optimization objective, such as minimizing the number of pixels used. The goal of these methods is to generate patterns that include the original trigger. These methods are more sensitive to the pattern of the trigger. Query-based methods optimize adversarial perturbations based on model outputs (e.g., the softmax layer[10.1007/978-3-642-76153-9_28]) to construct triggers. These methods typically predefine a trigger size. Their goal is to locate the trigger in the image, making them more sensitive to trigger location. IsTr leverages the orthogonality of the two methods, enabling it to consider both location and pattern. By introducing orthogonality and prior knowledge (untargeted generation and middle slice), IsTr eliminates the need for presetting trigger sizes and the optimization objective of minimizing trigger pixel count. Steps in Section 3.2.4 is a gradient-based method,and DMS in Section 3.2.5 is a query-based method.
Patching method. IsTr leverages Unlearning[9796974] as the patching method. Unlearning aims to make the model unlearn the effects of backdoors. Unlearning aims to eliminate backdoor effects from the model, which ensures that ASR is significantly reduced while maintaining stable BACC, thereby achieving lossless repair. Unlearning employs benign samples and a patch dataset for cross-entropy loss training. The patch dataset consists of benign samples embedded in reverse triggers, without modifying the labels. This dataset ensures that the model does not intentionally misclassify inputs regardless of whether the trigger is present, neutralizing the trigger. Unlearning requires that reverse triggers contain complete original triggers while minimally incorporating benign features, which is highly dependent on the precision of the reverse engineering. Section 3.6 describes the algorithm of Unlearning.

3.3 Backdoor Detection Based on Label Mutation

Algorithm 1 illustrates IsTr’s detection algorithm based on label mutation. Label mutation verifies whether the sample (x+T)(x+T) is a trigger by embedding the reverse trigger TT into the benign sample xx as input. (The generation method of trigger TT will be introduced in Sections 3.4 and 3.5.) Label mutation predicts the label ltl_{t} of input (x+T)(x+T) and counts the number of misclassifications L​e​a​d​(lo,lt)Lead(l_{o},l_{t}) when the model misclassifies (lo≠ltl_{o}\neq l_{t}).

Input: Validation dataset X, Constrained mask E, Number of classes M;

Output: Possible backdoor infection label pairs (m,n), Reverse trigger T;

1 for data x and original label lol_{o} in X do

2 T ←\leftarrow Steps(x,lol_{o}) * E;

3 if Label(max{f(x)}) = lol_{o} then

4 ltl_{t} ←\leftarrow Label(max{f(x+T)});

5 Lead(lol_{o},ltl_{t}) ←\leftarrow Lead(lol_{o},ltl_{t}) + (lo≠ltl_{o}\neq l_{t});

6

7 end if

8

9 end for

10data processing:

11 for each label m = 1 to M do do

12 (m,n) ←\leftarrow {m,Label(k-means{Lead(m),2})};

13

14 end for

15Return (m,n),T;

Algorithm 1 Detection Algorithm

Based on the significant difference in misclassification capabilities between triggers and non-trigger adversarial perturbations, IsTr employs clustering[hartigan1967] (k−m​e​a​n​s​()k-means()) to analyze label mutation results. L​e​a​d​(m)Lead(m) is an array that records the number of samples from class mm misclassified into target class. Based on the model’s tendency to misclassify fewer samples of class mm as normal labels and more as backdoor target labels, the clustering algorithm divides all labels into normal labels and backdoor target labels. IsTr selects target labels with high mutation rates as backdoor target labels nn, and uses (m,n)(m,n) as the backdoor (s​o​u​r​c​e,t​a​r​g​e​t)(source,target) labels.

3.4 Unconstrained label mutation(Steps)

IsTr innovatively introduced Steps. Steps is an efficient detection algorithm that combines forward validation and backward generation. In Steps, forward validation utilizes label mutations derived from the model’s posterior knowledge, while backward generation employs gradient-based generation mechanisms. Although this generation mechanism is highly sensitive to trigger patterns, triggers reconstructed through this mechanism often contain benign samples.

Input: Data x, Original label lol_{o}, Number of classes M;

Output: Unconstrained reverse trigger T;

1 Unconstrained label mutation:

2 for generate target label ltl_{t} = 1 to M do

3 Training a generative network G with x and ltl_{t};

4 T ←\leftarrow G;

5

6 end for

7Untargeted unconstrained label mutation:

8Training a generative network G with x and lol_{o};

9T ←\leftarrow -G;

10 Return T;

Algorithm 2 Steps Algorithm

Existing distance-based methods[8835365] attempt to address this challenge. For example, the distance-based method designs an optimization objective for reverse triggers to minimize the number of generated pixels, treating the trigger’s pixel count as the distance between the source class and target class. If the distance is small, a backdoor is considered to exist. This method successfully detects small local backdoors. However, distance-based methods rely on the assumption of small triggers, and adaptive attacks bypass this defense by employing watermark backdoors with a wide range.
To adapt triggers of different sizes, Steps initially imposes no constraints on generation. Unconstrained label mutation in Algorithm 2 introduces this mechanism. This method constructs a generating network GG for all ltl_{t} using sample xx, and employs GG to generate the trigger TT. This method generates both benign samples and triggers for backdoor target labels, while only generating benign samples for normal labels. Therefore, backdoor target labels still exhibit more label mutations. However, this method does not eliminate the influence of benign samples.
Steps leverages prior knowledge to optimize the generation algorithm. Based on the prior knowledge from Section 3.2 that untargeted generation can weaken benign features, Steps does not specify a generation target, as shown in the untargeted unconstrained generation in Algorithm 2. This method constructs a generation network G only based on the sample xx original lol_{o}, and uses −G-G as the trigger TT for generation. This method is also a generation algorithm that diverges from the original label lol_{o} toward all classes, and is not influenced by the benign features of the target class.

3.5 Differential-Middle-Slice(DMS)

IsTr also designs Differential-Middle-Slice(DMS). DMS is a query-based trigger location search method. Query-based methods optimize the adversarial perturbation through multiple queries to the model output, locating the perturbation closer to the trigger. Although this method is more sensitive to the location of the trigger, it mostly requires presetting the trigger size to mitigate the impact of benign features.
To adapt to triggers of varying sizes, DMS employs prior knowledge to optimize the location search algorithm. Based on the intuition from the prior knowledge in Section 3.2 that benign features have higher generation priority than triggers, DMS attempts to extract the Middle-Slice of the search results to reduce the influence of benign features:

| Ei={|f​(x)−f​(xi)|2−|fs​(x)−fs​(xi)|2,Ei∈(r1,r2)m​i​n​i​m​u​m,o​t​h​e​r​s\displaystyle E_{i}=\left\{\begin{array}[]{l}\left|f(x)-f(x_{i})\right|_{2}-\left|f_{s}(x)-f_{s}(x_{i})\right|_{2},E_{i}\in(r_{1},r_{2})\\[10.0pt] minimum\ ,\ others\end{array}\right. | (6) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

Equation 6 is the algorithm for DMS location search. The objective of DMS is to construct a non-negative matrix E containing trigger location information, where the middle layer of E represents the locations of triggers. f​()f() is the model under detect, fs​()f_{s}() is a benign model trained on benign samples, xx is a benign sample, and adding adversarial perturbation Δ\Delta to xx generates a differential sample xi=x+Δx_{i}=x+\Delta . DMS first employs the differential statistics method |f​(x)−f​(xi)|2\left|f\left(x\right)-f\left(x_{i}\right)\right|_{2} to calculate the query results at different locations across the two models, then computes the difference between the two sets of query results. DMS slices the middle layer of search results based on thresholds t1t_{1} and t2t_{2}. DMS determines t1t_{1} and t2t_{2} by minimizing the reciprocal of the harmonic mean of the center distance and radius difference:

M​i​n​(d1+r1d1∗r1+d2+r2d2∗r2)​(5)\displaystyle Min\left(\frac{d_{1}+r_{1}}{d_{1}\ast r_{1}}+\frac{d_{2}+r_{2}}{d_{2}\ast r_{2}}\right)(5) (7)

Given two point sets obtained by slicing EE, this method first calculates the center coordinates of two discrete point sets, then uses the distance between the two coordinates as the center distance dd. It then computes the average radius of each point set and calculates the absolute difference between the two radii as the radius difference rr. d1d_{1} and r1r_{1} are the center distance and radius difference between the upper-layer slice and the middle-layer slice; d2d_{2} and r2r_{2} are the center distance and radius difference between the benign model query results and the middle-layer slice of the model under detect. Except for the middle slice, the values at all other locations are minimum. Finally, DMS obtains a trigger location map to assist Steps in trigger generation. By leveraging the orthogonality of gradients and queries, DMS-Steps achieves precise trigger reconstruction.

TABLE II: Detailed information of datasets, task complexity, and model architectures per task; coupled with attack success rates and clean accuracy rates for backdoor injection attacks across diverse tasks.

Dataset Model Attacks
Name classes Image sizes training samples Architecture training parameters Backdoor Success rate Accuracy of clean samples
MNIST 10 28x28x1 60,000 3Conv+2FC 413,882 BadNets 100% 99.99%
SIN 99.42% 97.75%
MT 89.06% 98.72%
GTSRB 43 32x32x3 39,200 6Conv+2FC 571,723 BadNets 100% 96.03%
SIN 100% 94.74%
PubFig 83 224x224x3 11,070 13Conv+3FC 122,245,715 CASSOCK 100% 98.61%
HCB 100% 99.86%

Input: Validation dataset X, Reverse trigger T, Backdoor model fθf_{\theta};

Output: Safety model fθf_{\theta};

1 for data x and label y in X do

2 Generate patch data xpx_{p} ←\leftarrow x+Tx+T

3 Training:θ←M​i​nθ​(L​(fθ​(x),y)+L​(fθ​(xp),y))\theta\leftarrow Min_{\theta}(L(f_{\theta}(x),y)+L(f_{\theta}(x_{p}),y))

4 end for

5Return fθf_{\theta};

6 Inference:

7 for data x and label y in X do

8 xpx_{p} ←\leftarrow x+Tx+T

9 Inference:fθ​(x)=yf_{\theta}(x)=y and fθ​(xp)=yf_{\theta}(x_{p})=y

10

11 end for

Algorithm 3 Unlearning

3.6 Unlearning

IsTr employs Unlearning as a patching method, as illustrated in Algorithm 3. Unlearning embeds the reverse trigger into benign data xx, generating the patch data xpx_{p}. In contrast to backdoor attacks, Unlearning does not change the label yy to ϕ​(y)\phi(y). Unlearning trains the cross-entropy loss using xpx_{p} and xx to predict yy.
Unlearning aims to make the model unlearn the trigger, meaning the model will not intentionally misclassify inputs regardless of whether the trigger is present. Algorithm 3 Inference shows that the repaired model classifies xx as the correct label yy, and also classifies xpx_{p} as the correct label yy, rather than the backdoor target label ϕ​(y)\phi(y). This method maintains BACC while significantly reducing ASR, representing a lossless repair.

4 Implementation and Evaluation

This section first provides a detailed description of the experimental setup and then introduces the metrics used to evaluate the feasibility of IsTr. Through experimentation, we validate the theories of prior knowledge and posterior knowledge. Furthermore, we compare IsTr with baseline defenses.

4.1 Setup

The experimental setup extends and refines well-established evaluation protocols from prior work[8835365], with comprehensive metrics detailed in Table II and visualized in Figure 6.

Refer to caption

(a) Original
MNIST

Refer to caption

(b) MNIST
BadNets

Refer to caption

(c) MNIST
SIN

Refer to caption

(d) MNIST
Multi1

Refer to caption

(e) MNIST
Multi2

Refer to caption

(f) Original GTSRB

Refer to caption

(g) GTSRB BadNets

Refer to caption

(h) GTSRB SIN

Refer to caption

(i) Original PubFig

Refer to caption

(j) PubFig CASSOCK

Refer to caption

(k) PubFig HCB

Figure 6: Clean samples and poisoned samples

Datasets and Models. The experimental setup employs three widely-adopted datasets: MNIST[726791], GTSRB[10.1016/j.neunet.2012.02.016], and PubFig[5459250]. Table II summarizes their specifications and corresponding models. MNIST is used for efficient defense validation, containing 60,000 training and 10,000 test samples of 28×\times28×\times1 grayscale images. It implements a 3Conv+2FC architecture with 413,882 parameters. GTSRB serves as the traffic sign recognition benchmark, with 43 classes across 39,200 training and 12,600 test samples (32×\times32×\times3 RGB). The model adopts 6Conv+2FC layers totaling 571,723 parameters. PubFig provides facial recognition data with 11070 training and 2,768 test images from 83 celebrities. Images are resized to 224×\times224×\times3. It utilizes VGG16[simonyan2015deepconvolutionalnetworkslargescale] (13Conv+3FC) comprising 122,245,715 parameters.

Refer to caption

(a) MNIST Steps

Refer to caption

(b) GTSRB Steps

Refer to caption

(c) PubFig Steps

Refer to caption

(d) MNIST DMS-Steps

Refer to caption

(e) GTSRB DMS-Steps

Refer to caption

(f) PubFig DMS-Steps

Figure 7: Label mutation. Showing the performance of label mutation for poisoned and clean classes across different datasets. Subfigures (a)–(c) use Steps, while (d)–(f) use DMS-Steps.

Backdoor Attacks. The experiment evaluates IsTr using six backdoor attacks, which are detailed in Appendix 2. Figure 6 displays representative poisoned samples. BadNets[gu2019badnetsidentifyingvulnerabilitiesmachine] overlays a square trigger at the top-right corner of images, relabeling poisoned samples to class 8. The experimental setup enhances its stealth by alternating checkerboard patterns to challenge reverse-engineering, demonstrating our method’s resilience. It is optimized into a Source-Class-Specific Backdoor Attacks (SSBAs)(USENIX ’21)[8835365, 263780] to become adaptive attack. SIN-wave(CVPR ’20)[Truong_2020_CVPR_Workshops] implements stealthy attacks via enlarged trigger. It employs full-image stripe watermarks with alternating luminance as trigger. Multi-trigger(JSAC ’21)[gong2021defense] deploys multiple concurrent triggers. In dual-trigger experiments: top-right triggers relabel to class 8, lower-left triggers to class 1. We further scale this backdoor to four triggers configurations within a single model in Section 5.4. CASSOCK(ASIACCS ’23)[10.1145/3579856.3582829] executes efficient covert training by superimposing trigger onto benign features via cross-entropy optimization. The experimental setup implements colored square watermark as trigger. HCB(CCS ’24)[10.1145/3658644.3670361] utilizes extraneous features as triggers. Here, smiling expressions serve as the trigger.
Baseline Defenses. The experiment compares IsTr with seven baseline defenses, which are detailed in Appendix 3. Neural Cleanse (NC) (Oakland ’19)[8835365] is a defense that reconstructs triggers by optimizing the generation of minimal adversarial perturbations. MESA (NISP ’19)[10.5555/3454287.3455543] is a defense that generates distributed models against neural backdoor attacks. ABS (CCS ’19)[10.1145/3319535.3363216] is a defense that analyzes internal neuron behavior through artificial brain stimulation. AEVA (ICLR ’22)[guo2022aevablackboxbackdoordetection] is a defense that identifies backdoors using adversarial extreme value analysis. B3D (ICCV’21)[dong2021blackboxdetectionbackdoorattacks] is a defense based on queries. FreeEagle (USENIX ’23)[287097] is a data-free backdoor defense. DeBackdoor (DB) (USENIX ’25)[popovic2025debackdoordeductiveframeworkdetecting] is a defense utilizing simulated annealing.
Defense Metrics. The experiment focuses on seven metrics to validate the detection generality, detection accuracy, detection efficiency, repair efficiency, reverse precision, and defense intuition of IsTr:

Device. Experiments run on a computer with the following configuration: Intel Core i7 processor with eight CPU cores running at 2.30 GHz and 16 GB main memory, and a GPU card of NVIDIA GeForce RTX 3060.

4.2 Feasibility Verification

This section experimentally validates the feasibility of using label mutations as backdoor detection indicators and employing slice to locate trigger locations. It further verifies the defensive intuition represented by these two methods, respectively embodying a posterior knowledge and a prior knowledge.

TABLE III: Comparison of ACC TPR for different backdoor attacks and defense methods across datasets.

4.2.1 Label mutation

IsTr uses label mutation as the backdoor detection indicator, based on the prior knowledge that triggers have higher priority in classification. Specifically, due to the dual effects of the trigger’s higher generation pixels and higher prediction influence, adversarially generated samples are more likely to cause changes in prediction results.
The experiment quantifies the variation in label mutation rates for normal labels and backdoor target labels across different datasets, as shown in Figure 7. Subfigures a–c represent the Steps method, while d–f depict the DMS-Steps method. The differences between them can be observed from two perspectives:

The Experiment demonstrates that triggers more easily influence model classification. Therefore, IsTr employs label mutation with feasibility.

Refer to caption

(a) BadNets

Refer to caption

(b) BadNets slice

Refer to caption

(c) BadNets reverse

Refer to caption

(d) SIN

Refer to caption

(e) SIN slice

Refer to caption

(f) SIN reverse

Refer to caption

(g) MT

Refer to caption

(h) MT slice

Refer to caption

(i) MT reverse

Figure 8: Trigger, DMS and reverse result of three backdoors.

4.2.2 Effect of Slice Constraints

DMS-steps work based on a key intuition: benign features possess higher prior knowledge priority during generation. It means the model will tend to generate benign features while generating triggers less often. Therefore, the trigger’s location can be identified by examining the middle-layer slice that exhibits the greatest difference from the upper layer.
The Experiment validates this method on multiple attacks, as shown in Figure 8. Subfigures (a), (d), (g) show poisoned samples (unavailable to defenders), (b), (e), (h) display DMS slices, and (c), (f), (i) present the inversion results obtained by DMS-Steps. In this section, the dynamic adaptive hyperparameters t1 and t2 used for slicing are set to 0.01 and 0.05. It should be noted that t1 and t2 are related to the training of backdoor attacks. In practical applications, they should be dynamically obtained using Equation 7 in Section 3.5, rather than the reference values provided in this experiment.
Overall, DMS significantly mitigates benign features, and the reconstructed trigger signatures exhibit substantial similarity to the original triggers. As seen in (e), even for the SIN-wave backdoor covering the entire image, DMS can locate regions exhibiting significant pixel-level discrepancies between poisoned and clean data caused by the implanted trigger. Although residual benign features persist in (b) and (e), the reverse results have largely eliminated them, validating the existence of orthogonality. The results also demonstrate that triggers exist in the middle layer of the generated results (prior knowledge).

4.3 Defense Comparison

4.3.1 Detection Effects

The experiment compares IsTr (Steps and DMS-Steps) with seven baseline defenses. Since no defense other than IsTr fully satisfies all four defense assumptions of the experiment (Table I), we test the remaining methods by appropriately relaxing the requirements:

The experiment uses the ACC as the fundamental metric. The experiment employs TPR as a supplementary metric, primarily considering two aspects:

Therefore, in backdoor detection, false negatives are more unacceptable than false positives. The experiment uses TPR as the metric.
Table III shows that the ACC and TPR for Steps and DMS-Steps remain consistently above 0.9 and 0.8, with DMS-Steps demonstrating greater accuracy. The remaining methods maintain accuracy advantages in defense assumptions covering domains such as Badnets. However, these methods perform poorly on the SIN-wave attack and PubFig datasets. Most remaining seven methods exhibit relatively low ACC and TPR when confronted with large-scale triggers (SIN-wave). It aligns with their defense assumptions requiring small trigger sizes [12], [38], [39]. IsTr also suffers from this extreme backdoor influence. However, IsTr’s ACC and TPR remain above 0.8, which demonstrates the IsTr scheme’s generality regarding trigger size.
The other seven methods performed poorly on the PUBFIG task, which can be attributed to the large sample size and model complexity inherent in facial recognition tasks. Extensive benign training (e.g., 3000-round pretraining) assigns benign features with a high priority in the model’s prior knowledge. These benign features significantly hinder trigger reconstruction. However, IsTr mitigates this influence while maintaining high detection performance, demonstrating its generality across different tasks.

4.3.2 Repair Effect and Trigger Similarity

The experiment evaluates the effectiveness of model repair by Unlearning for each defense method, as shown in Table IV (FreeEagle is excluded as it does not generate triggers and thus cannot perform repair). When the repaired model exhibits high BACC and low ASR, we consider the repair to be effective. IsTr reduces all ASR to below 3%, outperforming other methods. IsTr reduces the ASR of BadNets and Multi-trigger(MT) attacks to below 0.2%, outperforming fixes for SIN-wave, CASSOCK, and HCB attacks. This improvement correlates positively with detection and reverse results.

TABLE IV: Comparison of ASR and NSR before and after repair.

ASR% BACC% MNIST GTSRB PUBFIG
BadNets SIN MT BadNets SIN CASSOCK HCB
Attacks 100 99.42 89.06 100 100 100 100
99.99 97.75 98.72 96.03 94.74 98.61 99.86
Repair ABS 0.39 6.69 7.545 3.16 14.41 24.21 65.73
99.04 98.41 99.04 92.33 92.95 96.93 99.78
AEVA 14.93 46.34 16.23 6.99 68.06 56.28 29.93
99.09 98.62 99.04 95.25 94.94 95.21 99.89
B3D 14.13 34.75 19.275 7.73 78.22 33.39 48.51
98.98 98.67 99.17 94.11 94.1 97.34 99.79
DB 11.61 37.17 5.01 6.76 70.52 47.39 44.21
99.2 98.48 99.29 94.06 94.31 97.02 99.74
NC 0.55 34.94 0.73 3.30 68.03 4.53 17.15
99.07 98.56 99.08 93.58 94.75 96.45 99.82
MESA 21.19 48.37 52.715 6.51 63.93 50.87 69.63
98.88 98.17 98.88 94.53 93.62 95.88 99.75
IsTr 0.10 2.17 0.04 0.13 2.60 0.22 1.78
99.23 98.98 99.45 96.6 95.29 98.87 99.85

The efficient repair of Unlearning is based on the recognition that reverse triggers exhibit higher visual similarity and functional integrity compared to the original triggers. The experiment compares IsTr with six methods for reverse triggers, as shown in Appendix Figure 11. (MESA only captures the pattern of the trigger and does not have a fixed location.) IsTr has advantages in eliminating the influence of benign features and reducing similarity, particularly in facial recognition tasks. Although the other successfully generated methods (ABS and NC) tend to produce benign features (faces), IsTr remains capable of focusing on triggers. The experiment also compared the REASR (Table V) and APD (Table VI) of reverse triggers by six methods under all scenarios. REASR refers to the probability that samples are misclassified when we use reverse triggers to generate poisoned samples. APD calculates the difference score between the reverse trigger and the original trigger. The smaller the value, the narrower the difference between the reverse trigger and the original trigger. Since HCB attack does not have a fixed trigger, APD is not calculated.

TABLE V: Reverse Attack Success Rat (REASR).

The results show that the reverse triggers obtained by the IsTr exhibit nearly complete REASR, with only a slight decrease observed for the SIN trigger targeting GTSRB. Nevertheless, it still exceeds 80%, outperforming the other defenses. The triggers generated by IsTr also achieved low APD, remaining below 0.1 in most defense tasks. Only when triggered by SIN did it exhibit higher values, yet it still outperformed the remaining defenses. The differences in REASR and APD for various attacks also correlate positively with the effectiveness of repair measures, validating the paper’s insight that reverse engineering precision and detection accuracy are positively correlated with the effectiveness of repair measures.

TABLE VI: Comparison of Average Pixel Difference (APD).

4.3.3 Time Efficiency

Finally, this paper compares the time consumed by IsTr (Steps and DMS) and seven classical defenses when processing a sample, as shown in Table VII, with units in seconds. To ensure fairness, the experiment eliminated all batch processing operations when calculating time efficiency. The results show that Steps, which does not require traversing classes, offers an advantage of an order of magnitude over other methods. Moreover, this advantage increases as the number of classes grows. DMS takes a relatively long time to process individual sample. However, since Steps has already performed an initial screening of the samples, in actual use, DMS only operates on samples classified as suspicious, thus maintaining time efficiency.

TABLE VII: Comparison of Time Efficiency(retain four significant digits).

5 Discussion

5.1 Precision Reverse Engineering

Adaptive attacks leverage benign features to conceal triggers, which causes imprecise trigger reverse engineering methods to recover benign features instead of the actual triggers. Consequently, triggers become obscured by benign features, posing potential security risks. This study emphasizes the critical importance of reconstructing precise triggers. As shown by the metrics, the detection accuracy for MNIST-BadNets, MNIST-SIN, MNIST-MT, GTSRB-BadNets, GTSRB-SIN, and PubFig-CASSOCK is 0.99/0.96/0.99/0.98/0.97/0.99, the graphical similarity metric is 0.0052/0.2290/0.0026/0.0224/0.2897/0.0708, the functional integrity is 0.99/0.99/0.99/0.94/0.82/0.99, and the attack success rate after repair is 0.10%/2.17%/0.04%/0.13%/2.60%/0.22%. The results indicate that precise reverse engineering ,detection accuracy, graphic similarity, functional integrity, and repair effectiveness are all positively correlated, which demonstrates that pursuing precise reverse engineering is synonymous with pursuing more accurate detection and repair.

5.2 Compatibility

In order to verify the compatibility of IsTr, this study also conducted experiments on IsTr providing samples for DBD, enabling it to run in MBD environment. As shown in Figure 9, with the assistance of IsTr, the detection accuracy of STRIP[10.1145/3658644.3670361] exceeds 90% on the MNIST dataset. DMS can also serve as a location reference for MBD schemes, enhancing the reverse accuracy of other approaches. Steps can also assist other backdoor detections in performing rapid initial screening, thereby enhancing detection efficiency.

Refer to caption

Figure 9: STRIP Uses IsTr Reverse Trigger for Detection.

5.3 Natural Backdoor

In addition to intentionally injected backdoors, factors such as the model’s inherent lack of robustness, insufficient training, or poor transferability can also lead to classification[9833688]. This misclassification indicates the presence of natural backdoors in the model. This study also applies IsTr to benign models and finds that IsTr can detect and repair natural backdoors. When IsTr generates reverse triggers for natural backdoors, it is found that on MNIST, GTSRB, and PubFig, the triggers tend to bias towards labels 9, 38, and 61, with REASR of 81.42%, 73.66%, and 90.48%, respectively, and post-repair attack success rates of 0.11%, 0.26%, and 0.17%. When IsTr generates reverse triggers for natural backdoors, it is found that on MNIST, GTSRB, and PubFig, the triggers tend to bias towards labels 9, 38, and 61, with REASR of 81.42%, 73.66%, and 90.48%, respectively, and post-repair attack success rates of 0.11%, 0.26%, and 0.17%.

5.4 More Backdoors and Hybrid Attacks

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 10: Four Backdoors and Hybrid EAB Attacks

We expanded multiple backdoor mixtures and four-backdoor variants (Figure 10) to validate the robustness of IsTr. IsTr achieves an 87.11% accuracy rate and 86.47% true positive rate in detecting four backdoors on the MNIST task, with an ASR below 0.1% after repair. IsTr achieves 98.80% detection ACC and 89.40% TPR when defending against hybrid backdoors on the PubFig task (which simultaneously implants four types of triggers: colored square, white interleaved square, SIN, and smile, targeting labels 0, 6, 42, 50, and 77, with attack success rates of 99.17%, 99.83%, 99.33%, and 95.06%, respectively). After repair, all ASRs drop below 3%.

6 Conclusion

This study explores the fundamental nature of backdoors to uncover the principles by which attacks bypass existing defenses, and proposes a novel defense framework—Isolation Trigger (IsTr). IsTr reveals the underlying logic behind the difficulty of generating triggers by intuitively decomposing a model’s knowledge into prior knowledge and posterior knowledge. This paper validates this intuition using six backdoor attacks across three tasks and evaluates the generality and efficiency of IsTr. This study comprehensively demonstrates that IsTr can adaptively defend against six mainstream attacks and their combinations. The primary reason lies in IsTr’s ability to focus more on triggers while eliminating the influence of benign features. This paper also verifies the compatibility of IsTr and its effectiveness in defending against natural backdoors, further demonstrating IsTr’s generality and orthogonality with other detection schemes. This study emphasizes the importance of precisely constructing triggers, demonstrating a positive correlation between trigger precision and detection accuracy, image similarity, functional integrity, and repair effectiveness. Therefore, precise trigger reverse engineering should be prioritized.

Ethics Considerations

None.

Appendix A Appendix

A.1 Visual Comparison

Figure 11 compares the visual similarity between reverse triggers original triggers across different backdoor attacks. Most methods achieve satisfactory reverse engineering results on the MNIST dataset. However, apart from the failed methods, ABS and NC tend to generate benign features on color datasets. The triggers generated by IsTr are closer to the original triggers because IsTr focuses on eliminating the influence of benign features.

Refer to caption

Figure 11: Visual Comparison between Reverse Trigger and Original Trigger.

A.2 Backdoor Attack Technology

A.2.1 BadNets

BadNets shows that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet) that has state-of-the-art performance on the user’s training and validation samples, but behaves badly on specific attacker-chosen inputs. They conducted experiments on different recognition tasks. Results demonstrate that backdoors in neural networks are both powerful and—because the behavior of neural networks is difficult to explicate—stealthy.

A.2.2 Sin-wave

Traditional data poisoning attacks manipulate training data to induce unreliability of an ML model, whereas backdoor data poisoning attacks maintain system performance unless the ML model is presented with an input containing an embedded “trigger” that provides a predetermined response advantageous to the adversary. Their work builds upon prior backdoor data-poisoning research for ML image classifiers and systematically assesses different experimental conditions including types of trigger patterns, persistence of trigger patterns during retraining, poisoning strategies, architectures (ResNet-50, NasNet, NasNet-Mobile), datasets (Flowers, CIFAR-10), and potential defensive regularization techniques (Contrastive Loss, Logit Squeezing, Manifold Mixup, Soft-Nearest-Neighbors Loss). Experiments yield four key findings. First, the success rate of backdoor poisoning attacks varies widely, depending on several factors, including model architecture, trigger pattern and regularization technique. Second, they find that poisoned models are hard to detect through performance inspection alone. Third, regularization typically reduces backdoor success rate, although it can have no effect or even slightly increase it, depending on the form of regularization. Finally, backdoors inserted through data poisoning can be rendered ineffective after just a few epochs of additional training on a small set of clean data without affecting the model’s performance. (CVPR ’20)

A.2.3 Multi-trigger

Concerning that an untrustworthy cloud service provider may inject backdoors to the returned model, the user can leverage state-of-the-art defense strategies to examine the model. They aim to develop robust backdoor attacks (named RobNet) that can evade existing defense strategies from the standpoint of malicious cloud providers. The key rationale is to diversify the triggers and strengthen the model structure so that the backdoor is hard to be detected or removed. To attain this objective, They refine the trigger generation algorithm by selecting the neuron(s) with large weights and activations and then computing the triggers via gradient descent to maximize the value of the selected neuron(s). They extend the attack space by proposing multi-trigger backdoor attacks that can misclassify inputs with different triggers into the same or different target label(s). (JSAC ’21)

A.2.4 SSBA

Source label specific (Partial) Backdoors Attack (SSBA) is a concept first proposed by Neural Cleanse.(Oakland ’19) Detection scheme is designed to detect triggers that induce misclassification on arbitrary input. A “partial” backdoor that is effective on inputs from a subset of source labels would be more difficult to detect.
Targeted contamination attack (TaCT) has conducted comprehensive research. A security threat to deep neural networks (DNN) is data contamination attack, in which an adversary poisons the training data of the target model to inject a backdoor so that images carrying a specific trigger will always be given a specific label. They discover that prior defense on this problem assumes the dominance of the trigger in model’s representation space, which causes any image with the trigger to be classified to the target label. Such dominance comes from the unique representations of trigger-carrying images, which are assumed to be significantly different from what benign images produce. Their research, however, shows that this assumption can be broken by a targeted contamination TaCT that obscures the difference between those two kinds of representations and causes the attack images to be less distinguishable from benign ones, thereby evading existing protection.They observe that TaCT can affect the representation distribution of the target class but don’t change the distribution across all classes.(USENIX ’21)

A.2.5 CASSOCK

As a critical threat to deep neural networks (DNNs), backdoor attacks can be categorized into two types, i.e., source-agnostic backdoor attacks (SABAs) and source-specific backdoor attacks (SSBAs). Compared to traditional SABAs, SSBAs are more advanced in that they have superior stealthier in bypassing mainstream countermeasures that are effective against SABAs. Nonetheless, existing SSBAs suffer from two major limitations. First, they can hardly achieve a good trade-off between ASR (attack success rate) and FPR (false positive rate). Besides, they can be effectively detected by the state-of-the-art (SOTA) countermeasures (e.g., SCAn). To address the limitations above, CASSOCK propose a new class of viable source-specific backdoor attacks. The key insight is that trigger designs when creating poisoned data and cover data in SSBAs play a crucial role in demonstrating a viable source-specific attack, which has not been considered by existing SSBAs. With this insight, CASSOCK focus on trigger transparency and content when crafting triggers for poisoned dataset where a sample has an attacker-targeted label and cover dataset where a sample has a ground-truth label. Specifically, CASSOCK implement C​A​S​S​O​C​KT​r​a​n​sCASSOCK_{Trans} and C​A​S​S​O​C​KC​o​n​tCASSOCK_{Cont}. While both they are orthogonal, they are complementary to each other, generating a more powerful attack, called C​A​S​S​O​C​KC​o​m​pCASSOCK_{Comp}, with further improved attack performance and stealthiness.(ASIACCS ’23)

A.2.6 HCB

In VCB attacks, any sample from a class activates the implanted backdoor when the secret trigger is present. Existing defense strategies overwhelmingly focus on countering VCB attacks, especially those that are source-class-agnostic. This narrow focus neglects the potential threat of other simpler yet general backdoor types, leading to false security implications. This study introduces a new, simple, and general type of backdoor attack coined as the horizontal class backdoor (HCB) that trivially breaches the class dependence characteristic of the VCB, bringing a fresh perspective to the community. HCB is now activated when the trigger is presented together with an innocuous feature, regardless of class. For example, the facial recognition model misclassifies a person who wears sunglasses with a smiling innocuous feature into the targeted person, such as an administrator, regardless of which person. The key is that these innocuous features are horizontally shared among classes but are only exhibited by partial samples per class.(CCS ’24)

A.3 Backdoor Defense Technology

A.3.1 Neural Cleanse

Neural Cleanse(NC) is the first robust and generalizable detection and mitigation system for DNN backdoor attacks. NC identifies backdoors and reconstruct possible triggers, thus identifies multiple mitigation techniques via input filters, neuron pruning and unlearning. The author claims that their techniques also prove robust against a number of variants of the backdoor attack.(Oakland ’19)

A.3.2 MESA

The author believes that getting the entire trigger distribution, e.g., via generative modeling, is a key to effective defense. propose max-entropy staircase approximator (MESA), an algorithm for high-dimensional sampling-free generative modeling and use it to recover the trigger distribution. Theirr experiments on colorful dataset demonstrate the effectiveness of MESA in modeling the trigger distribution and the robustness of the proposed defense method.(NISP ’19)

A.3.3 ABS

ABS is a technique for scanning neural network AI models to determine if they are trojaned. ABS develops a novel approach that analyzes inner neuron behaviors by examining how output activations change when different levels of stimulation are applied to neurons. ABS identifies neurons that substantially elevate the activation of a particular output label regardless of the provided input as potentially compromised. ABS then reverse-engineers the trojan trigger through an optimization procedure using the stimulation analysis results to confirm that a neuron is truly compromised(CCS ’19).

A.3.4 B3D

B3D is a query-based backdoor detection method that identifies backdoor attacks using only query access to the model. B3D employs a gradient-free optimization algorithm to reverse-engineer potential triggers for each class, thereby revealing the presence of backdoor attacks. Beyond detection, B3D also introduces a simple strategy that enables reliable predictions even when using identified backdoored models.(ICCV ’21))

A.3.5 AEVA

AEVA is a query-based backdoor detection method. AEVA approaches this problem from the optimization perspective and shows that the backdoor detection objective is bounded by an adversarial objective. AEVA’s theoretical and empirical studies reveal that this adversarial objective leads to a solution with highly skewed distribution, where a singularity is often observed in the adversarial map of a backdoor-infected example, termed the adversarial singularity phenomenon. AEVA detects backdoors in neural networks based on an extreme value analysis of the adversarial map, computed from monte-carlo gradient estimation.(ICLR ’22)

A.3.6 FreeEagle

FreeEagle is a data-free backdoor detection method that can effectively detect complex backdoor attacks on deep neural networks without relying on access to any clean samples or samples with the trigger. FreeEagle addresses scenarios where defenders may not have access to clean validation samples or trigger samples, such as when the defender is a maintainer of model-sharing platforms. FreeEagle demonstrates effectiveness against various complex backdoor attacks across diverse datasets and model architectures, even outperforming some state-of-the-art non-data-free backdoor detection methods in certain cases.(USENIX ’23)

A.3.7 DeBackdoor

DeBackdoor(DB) is a novel framework for detecting backdoors under realistic restrictions, targeting the practical scenario where a developer obtains a deep model from a third party and wants to inspect it for potential backdoors prior to system deployment. DeBackdoor generates candidate triggers by deductively searching over the space of possible triggers. DeBackdoor constructs and optimizes a smoothed version of Attack Success Rate as its search objective. Starting from a broad class of template attacks and using only the forward pass of a deep model, DeBackdoor reverse engineers the backdoor attack. (USENIX ’25)