1 Introduction

Fig. 1
figure 1

The overall procedure of backdoor attack.a The backdoor implanting process; b Activating the backdoor with poisoned samples

Deep Neural Network (DNN) is a recently developed technique that derives from machine learning and has been applied to different fields of research, production, and daily life. It has been found effective in scenarios like face recognition [1], autonomous driving system [2], voice recognition [3], and image generation [4], etc. The DNN is also applicable in Internet of Things (IoT) [5, 6], Edge-computing [7], and Crowdsensing [8] system and can substantially increase their efficiency and functionality. However, its vulnerability to possible threats from multiple adversarial attacks [9, 10] has also attracted considerable attention. Backdoor attack [11] is one of the threats that emerged recently. The overall procedure of a backdoor attack is demonstrated in Fig. 1. The attacker first defines a backdoor trigger, selects a target class, superimposes the trigger onto a benign sample, and modifies its label to the target class, thus generating a poisoned sample. The attacker may generate many poisoned samples, and mix them into the training dataset. When the victim trains a DNN model using the poisoned dataset, a backdoor will be implanted in the model. A backdoor model behaves normally on clean inputs but exhibits an adversarial feature by classifying any sample stamped with the trigger into the target class. To achieve this, only 1\(\%\) of training samples need to be modified, and the trigger only occupies a minute area of the sample [12]. Hence, backdoor attack also achieves high stealthiness and is exceedingly difficult for human to identify.

Backdoor attack can cause massive immediate attention because it is a realistic threat to machine learning in almost every scenario. Training a DNN model requires numerous samples, but the data collection and annotation processes are labor-intensive, and many individuals and companies prefer to outsource it to a third party. In this case, the poisoned samples might be injected into the dataset. On the other hand, since model training is usually computationally expensive, it may also be outsourced to the cloud, such as Google’s Cloud Machine Learning Engine [13] and Azure Batch AI Training [14]. Such services are nowadays called “machine learning as a service” (MLaaS). In addition, instead of training a DNN model from the beginning, it is possible to fine-tune a model that has been well-trained for another task. This technique, called transfer learning, can significantly reduce the model training time and computational resources required by the training process. However, if there is a backdoor in the model, likely, the backdoor will still exist after fine-tuning [15].

Once the backdoor threat is proposed, various specific defense methods are developed. Different perspectives on backdoor attack, including dataset and model, have been studied, and all these defense methods have successfully mitigated the threat of backdoor attacks. However, due to the black-box nature of DNN, existing defenses can only exploit the superficial characteristics of backdoor attack and can therefore be bypassed by well-designed advanced attacks.

In this paper, we reveal that most of the defenses assume that all the poisoned samples involved in an attack are generated by the same trigger (or pattern). Thus, the poisoned samples in the training dataset are constructed by the same trigger (or pattern) as the poisoned samples in the testing dataset. Based on this assumption, the defender can use the poisoned samples in the training dataset to activate the backdoor, and identify the backdoor by analyzing the prediction of the model [16, 17].

To demonstrate the limitation of existing defense methods, we explore two new types of backdoor attacks called enhanced backdoor attack (EBA) and enhanced coalescence backdoor attack (ECBA). They are inspired by a feature of DNN where the prediction process of the model relies on the difference between pixels rather than the absolute value of pixels [18]. The idea of EBA is that the trigger used to generate poisoned samples for training can be different from the trigger used to generate poisoned samples for testing. We can design a less significant trigger to train the backdoor model and use an enhanced trigger to activate the backdoor. The advantage of this specific design is: When a less significant trigger is used during training, the model is still able to learn the backdoor pattern, but the backdoor is not sensitive enough to be activated by the less significant trigger. However, it can be activated by samples with enhanced trigger. This difference not only amplifies the stealthiness of the backdoor embedding process, as the trigger is more difficult to detect, but also allows the attack to evade multiple defenses. On the other hand, Xue et al. [19] proposed an N-to-one backdoor attack by defining multiple triggers used in the model training process and concentrating them to form the trigger for testing. This elaborate design achieves a similar objective but has a fatal flaw in that it is extremely sensitive to the number of poisoned samples in the training dataset. Therefore, we combine the EBA with the N-to-one backdoor attack to obtain the ECBA. This attack maintains all advantages of the N-to-one attack and is more robust than it.

The main contributions of this paper are as follows:

  1. 1.

    To indicate the weaknesses of existing defense methods, we propose the enhanced backdoor attack, which can bypass the existing defense and is more stealthy.

  2. 2.

    In addition, we propose the enhanced coalescence backdoor attack, which exhibits a considerably higher “attack success rate difference" than the N-to-one attack while maintaining its advantages of hiding the poisoned samples in the training set and evading AC [16] and NC [20] defense methods.

  3. 3.

    Extensive experiments are conducted to test the effectiveness of our attack, including the effectiveness across various datasets and model structures, robustness to perturbations during data collection, and the ability to evade multiple backdoor defense methods.

The rest of this paper is organized as follows. Section 2 reviews the existing works on backdoor attacks and backdoor defense methods, and then analyzes the weaknesses and flaws of existing defense methods which we exploit in our attack scheme. Section 3 demonstrates our attack schemes in detail, and Sect. 4 presents the experiment results of our attacks, including performance comparisons with Badnets and N-to-one attack, as well as bypassing existing defense methods. Finally, Sect. 5 summarizes the conclusions of this paper.

2 Related Work

Existing works can be divided into two categories, which are backdoor attack and backdoor defense respectively. Researches on backdoor attack focus on finding more covert and invisible attack pattern, increasing the attack success rate, and improving effectiveness when applied to the real world. Researches on backdoor defense, on the other hand, focus on detecting backdoor model, identifying targeted classes, and eliminating or mitigating the impact of backdoor attack.

2.1 Backdoor Attack

The backdoor attack is first proposed by Gu et al. [11], who showed that a maliciously trained model can exhibit schismatic behavior when encountered with clean inputs and poisoned inputs. This work is called BadNets and is performed by stamping a trigger (e.g., changing one certain pixel to color white or overlaying a small picture on the right button corner) on benign samples. After that, various backdoor attacks targeting different stages of the model training process are revealed.

Backdoor attack based on poisoning the training dataset is one of the primary research objects at present, and several attack methods have emerged. Chen et al. [21] proposed two types of backdoor attack, which are single-instance-key attack that aims to mislead the model to misclassify any sample of a certain person as another, and pattern-key attack which makes the model misclassify any sample with that pattern. Barni et al. [22] proposed a backdoor attack that requires only modifications to training samples without changing the corresponding labels. Lavisotto et al. [23] presented a realistic application of backdoor attack on biometric systems, indicating that if unguarded, the attack can be applied to almost every machine learning scenario. Liu et al. [24] proposed a block-box attack aims at decrease the usability of a DNN model. The attacker uses enhanced conditional DCGAN to synthesize poisoned samples and adapts asymmetric vector to relabel them. Since the above attacks are using a visible pattern as trigger, they can be easily distinguished by human, and multiple defense methods are proposed to detect and eliminate similar backdoors and poisoned samples. Therefore, many of the recent subsequent works related to attack strategy have focused on making the trigger invisible and evading the existing defense methods. Xue et al. [19] proposed two types of backdoor attacks named one-to-N attack and N-to-one attack. The one-to-N attack uses the trigger with one pattern to activate multiple backdoors, while the N-to-one attack uses several triggers to activate one backdoor, and escape from many kinds of defense methods. Liao et al. [25] proposed the first invisible backdoor attack by superimposing a shallow watermark as the trigger, achieving both a high attack success rate and imperceptibility to human. Zou et al. [26] proposed methods to insert a single or multiple neural level trojan in neural networks and achieved high stealthiness since nothing except an image with the trigger can activate the trojan(or backdoor). Wang et al. [27] proposed an invisible backdoor trigger based on existing biology literature on the human visual system, which assumes that the human eye is insensitive to small chromatic aberrations. The trigger of this attack requires quantization and dithering of the sample, making it imperceptible to both human and defense methods.

On the other hand, Tang et al. [28] suggest that there is another way to implant the backdoor by directly modifying the model. Liu et al. [29] designed a backdoor by firstly inverting the model to generate an initial trigger, and later fine-tuning the model with extra data stamped with the previously extracted trigger. This attack is particularly powerful when the attacker is an MLaaS provider. If the attacker has access and opportunity to manipulate both the dataset and training process(e.g. third parties that offer model training service), a more elaborately designed backdoor attack can be applied. Zhong et al. [30] proposed an imperceptible backdoor trigger that uses the U-net to generate sample-specific triggers. Moreover, the U-net and backdoor model are trained simultaneously and the loss function of the backdoor model is modified to optimize the chance of embedding the backdoor.

2.2 Backdoor Defense

Immediately after the discovery of the backdoor attack, various backdoor defenses have been proposed. Although DNNs are like a black box and there is no systematic explanation of how DNNs make predictions, researchers can detect backdoors from different perspectives. The fine-pruning method proposed by Liu et al. [15] assumes that there are some neurons responsible for backdoor function in the DNN model, and these neurons can only be activated by poisoned samples. So, if we disable the neurons that are constantly inactive when clean samples are submitted to the model, then the backdoor is likely to be disabled. On the other hand, Chen et al. [16] investigated the differences between the last hidden layer of the benign model and that of the backdoor model, and found that poisoned samples indeed activate a different group of neurons, which can be exploited to identify target class and distinguish poisoned samples. Additionally, Soremekun et al. [17] proposed a similar defense designed for a robust model training process called AEGIS.

Since previous works tended to use the smallest possible triggers, a defense aiming to find the smallest “trigger" for each class is proposed by Wang et al. [20]. This work can identify target classes and reconstruct trigger patterns without access to the poisoned training dataset, and significantly increase the utility of this defense method. Selvaraju et al. [31] proposed an algorithm for visualizing the area in an image that a model pays the most attention to during the model classification process. Dong et al. [32] proposed another gradient-free defense method based on reverse engineering, called B3D, which operates with limited access to the model and no access to the training dataset.

If the defender has additional data that is guaranteed to be clean, then a knowledge-distillation-based method proposed by Yoshida et al. [12] can be applied. This method uses the additional clean data to extract “clean knowledge" from the backdoor model and train a completely new distillation model. Gao et al. [33] proposed the STRIP method, which exploits the differences in the classification process between clean and poisoned samples. The former is based on most of the pixels in the sample and is susceptible to interference when stacked with other samples, while the latter is based only on the trigger area and will exhibit constant prediction results when stacked with other samples.

However, the black box nature of DNNs dictates that any attempt to find an overall solution to backdoor attack is unrealistic by now, which means that every defense method proposed has its advantages and weaknesses. Fine-pruning [15], NC [20], and B3D [32] can detect any backdoor using a simple trigger pattern, but fail to detect backdoors using a sophisticated trigger pattern. And recent attacks [27, 30] also suggest that it is powerless when the attacker has access to the model training process. AC [16] and AEGIS [17] rely on extracting the last hidden layer to analyze different behavior between clean samples and poisoned samples when submitted to the model. However, these methods also cannot handle attacks involving modification of the loss function of the model training process. GradCAM [31] demonstrates the image regions that the DNN model is mostly concerned with during classification, but this method can only detect small size trigger and is bypassed by sample-specific triggers or triggers covering the entire image. STRIP [33] is also powerless to face sample-specific triggers. Distillation-based method [12] seems to be a solid defense method, as it is designed to only extract the clean knowledge from the backdoor model. However, as with Fine-pruning [15], this method requires additional clean samples that follow the same sample distribution as the training set, which are usually unavailable. Besides, most of the above defenses require access to the poisoned training dataset and the backdoor model, and assume the poisoned samples used for training share the identical trigger (or pattern) with poisoned samples for activating the backdoor. And this potential flaw inspired us to design two novel types of backdoor attack strategies.

3 Attack Mechanism

In this section, we first introduce the threat model for backdoor attacks and then describe the two types of backdoor attacks proposed.

3.1 Threat Model

Our attacks are developed based on the most common backdoor attack scenario, where we assume that the attacker can access the training dataset and in this case can modify a few amounts of samples. The attacker may construct poisoned samples and datasets, and publish them on the Internet. In this case, the victim would have the chance to download the poisoned samples and datasets and use them to train a backdoor model. On the other hand, the attacker may also be a data annotation service provider, who can also access the victim’s training dataset. In addition, the attacker can query the trained model with any sample and collect its prediction to see if the backdoor is successfully embedded or activate the backdoor to perform the attack. However, the attacker has no more access to anything else, including the model structure and parameters, the model training process, and the loss function.

The major goal of the attacker is to embed a hidden backdoor in the victim’s model. This backdoor should only be activated by the trigger specifically designed and used by the attacker. In other words, for any sample stamped with the attacker-designed trigger, the model’s prediction will be the target label, regardless of the ground-truth label. However, for any clean sample, the model will classify it as the correct class with a high accuracy. The characteristic of our proposed attacks include two perspectives, namely, effectiveness and defense-resistance. Effectiveness implies that the backdoor should remain disabled when clean samples are submitted to the model, but be constantly activated by any poisoned sample. And the defense-resistance requires that the backdoor should be able to bypass mainstream defense methods.

3.2 Enhanced Backdoor Attack

This attack is proposed as an advancement of the Badnets described by Chen et al. [21], which we will first introduce briefly. The attacker first defines a trigger pattern(e.g. a single pixel on a certain place of the image, or a partial or entire certain image) \(\alpha \) and a target class \(y_p\). Then randomly select sample x that its ground-truth label t is different from target class \(y_p\) from training dataset D and superimpose the trigger \(\alpha \) onto the image x and change its label y to target class \(y_p\). This procedure is defined as data poisoning \(P(\cdot )\) and its details are as follows:

$$\begin{aligned} x_p=P(x,\alpha )=x+\alpha , \end{aligned}$$
(1)

where \(x_p\) is the sample after the data poisoning procedure, which we call the poisoned sample.

The attacker may select multiple samples and poison them all, then put them back into the dataset D. The dataset is now called poisoned dataset \(D_p\). The victim is not aware the dataset is poisoned and uses it to train a DNN model, which is referred to as backdoor model \(M_b\). The backdoor model \(M_b\) achieves a high classification rate on benign samples but will misclassify any poisoned sample \(x_p\) as target class \(y_p\). The formal function of backdoor model \(M_b\) can be described as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_b(x_c)=y_c,\\ M_b(x_p)=y_p, \end{array}\right. \end{aligned}$$
(2)

where the \(y_i\) is the label of benign sample x.

In this way, by modifying just a few samples in the training dataset, the attacker successfully implants a malicious backdoor in the model with no one noticing it. The attacker can activate the backdoor at any time it wants just by submitting a poisoned sample to the model, and it will get the expected misclassification result.

We now elaborate on our proposed attack methods. During our experiments, it occurred to us that the DNN model relies on the difference between pixels for prediction, rather than on the absolute value of pixels. Based on this assumption, we propose our first attack strategy, i.e., the enhanced backdoor attack (EBA).

Fig. 2
figure 2

An example for incipient trigger and enhanced trigger with the trigger intensity set as “+60” and “+150” respectively

We first define two backdoor trigger patterns, the incipient trigger \(\alpha _i\) and the enhanced trigger \(\alpha _e\). These two triggers share the same pattern, location, target class, and everything else except trigger intensity \(\theta \). The enhanced trigger intensity \(\theta _e\) is much greater than incipient trigger intensity \(\theta _i\). Note that the trigger intensity \(\theta \) is one property of trigger \(\alpha \) instead of independent of the trigger \(\alpha \), and the trigger pattern, location, and target class can be freely costumed by the attacker. One example of two triggers is demonstrated in Fig. 2, the incipient trigger \(\alpha _i\) and enhanced trigger \(\alpha _e\) are at the same position in the upper left corner of the sample.

The attacker firstly generates a set of incipient poisoned samples \(\{x_{pi\_l}\}\) based on incipient trigger \(\alpha _i\), following the poisoned sample generation function shown in Eq. (1). Then the incipient poisoned samples\(\{x_{pi\_l}\}\) are mixed into the training dataset D. The dataset is now a poisoned dataset \(D_p\) and the victim uses the poisoned dataset to train a backdoor model \(M_b\). By manipulating the number or proportion of incipient poisoned samples in the poisoned dataset \(D_p\), the backdoor model \(M_b\) can exhibit a specific feature: The model will correctly classify clean sample \(x_c\) and incipient poisoned sample \(x_{pi}\), but classify every enhanced poisoned sample \(x_{pe}\) generated based on enhanced trigger \(\alpha _e\) into the target class. The functionality of the proposed enhanced backdoor attack can be formalized as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_b(x_c)=y_c,\\ M_b(x_{pi})=y_c,\\ M_b(x_{pe})=y_p. \end{array} \right. \end{aligned}$$
(3)

This specific characteristic of EBA enables us to embed a backdoor more stealthily since the modification intensity on each sample can be greatly reduced to implant a backdoor as usual. And the feature that incipient poisoned sample \(x_{pi}\) hardly triggers the backdoor allows this attack to evade various defense methods.

Fig. 3
figure 3

The different behavior between human and multiple models on classifying clean sample and poisoned sample. a Human; b clean model; c backdoor model

There is one question arises naturally: If the attacker tries to manipulate the model by utilizing certain inputs to obtain the expected classification results, why cannot it directly feed the model with a sample that originally belongs to the target class? This can be explained by differences between the DNN model and human classification processes. If one tries to perform an attack that controls the output of the model for the expected result, a comparison between using samples from the target class directly and implanting a backdoor is carried out, as shown in Fig. 3. It can be inferred that in the case of directly using target class samples, human has no problem classifying them as “Speed Limit", as well as the clean model and backdoor model. But in this case, the human classification result is the same as that of the DNN model, and the intention of misleading the model will raise human suspicion, thus being realized immediately by human. On the other hand, a person who is not aware of the backdoor attack will see the picture as a “STOP" and see the trigger superimposed on it as a taint or noise. The same result will be produced by the clean model. However, due to the presence of the trigger, the backdoor model will predict this image as “Speed Limit”. The contrast shown above indicates that the backdoor attack tries to fool the victim and make him (or her) believe the DNN model will correctly classify every sample as he (or she) does, which in most of the circumstances is true, but is violated by the backdoor hidden in the model. This artificial and controllable difference in sample classification results between human and machines is the main effect of backdoor attacks.

3.3 Enhanced Coalescence Backdoor Attack

The enhanced coalescence backdoor attack (ECBA) is developed by combining the EBA and N-to-one attack [19]. The N-to-one attack is also designed by exploiting different backdoor triggers used in the model training phase and model testing phase. The attacker first defines multiple triggers which occupy different regions of an image, then stamps each sample selected for poisoned samples with one of the triggers to generate the poisoned training dataset. After the backdoor model is trained, the attacker concentrates every trigger to form the concentrated trigger, which is used to generate the concentrated poisoned sample to activate the backdoor in the model testing stage.

Unlike the EBA, in this attack strategy, we first define multiple incipient backdoor triggers, namely \(\alpha _{i1}\), \(\alpha _{i2}\), \(\alpha _{i3}\), ..., \(\alpha _{in}\). The intensity \(\theta _{ik}\) of each incipient trigger \(\alpha _{ik}\) is set to a relatively low value. These incipient triggers occupy different regions of the image, which are not adjacent to each other, and the number of incipient triggers n can be any value the attacker wishes. Then the attacker randomly selects samples and superimposes one and only one trigger \(\alpha _{ik}\) from all the incipient triggers and changes its label to the target class \(y_p\). Note that in this attack, the trigger superimposing procedure is also carried out by Eq. (1). There is only one target class \(y_p\) in this attack, which means for all the incipient poisoned samples \(\{x_{pi\_l}\}\), the same target class \(y_p\) is shared regardless of which incipient trigger \(\alpha _{ik}\) is superimposed. Next, we define the enhanced coalescence trigger \(\alpha _{ec}\) by concentrating every incipient trigger \(\alpha _{ik}\) into one image and deploy a relatively high-value trigger intensity \(\theta _{ec}\) to the enhanced coalescence trigger \(\alpha _{ec}\).

Table 1 Notation used in our approaches
Fig. 4
figure 4

An example of incipient triggers (upper left, upper middle, bottom left, bottom middle respectively), enhanced coalescence trigger (bottom right), and benign image (upper right), where the intensity of each incipient trigger is “+60” and the intensity of enhanced coalescence trigger is “+150”. Note that the red circle in each picture indicates the location of the corresponding trigger but not the trigger itself

Figure 4 presents examples of each incipient trigger \(\alpha _{i1}\), \(\alpha _{i2}\), \(\alpha _{i3}\), ..., \(\alpha _{in}\) and enhanced coalescence trigger \(\alpha _{ec}\) when n is set to 4. After defining each trigger \(\alpha _{ik}\), the attacker randomly selects a certain amount of samples \(x_c\) from the training dataset D except the target class and superimposes every selected sample with one of the incipient triggers \(\alpha _{ik}\) which is also selected randomly. Then, the label y of every incipient poisoned sample \(x_{pi}\) is changed to the target class \(y_p\), and these samples are mixed into the training dataset D. Once the backdoor model \(M_b\) is trained by the victim using the poisoned dataset \(D_p\), the attacker tries to activate the backdoor using enhanced coalescence sample \(x_{pec}\). To present these notations clearly and effectively, each of the notations used in this paper is summarized in Table 1.

The backdoor model trained by the ECBA exhibits similar features to the model trained by the EBA. The backdoor can be activated by enhanced coalescence samples \(x_{pec}\) but not by any incipient poisoned sample \(x_{pi}\). In the N-to-one attack [19], to achieve the attacker’s goal, the attack success rate (ASR) of the incipient poisoned sample should be as low as possible, while the ASR of the concentrated poisoned sample is expected to be as high as possible. If there are not enough incipient poisoned samples in the training dataset, the backdoor may not be implanted successfully. On the other hand, if there are too many incipient poisoned samples in the training dataset, the backdoor will be too sensitive and thus be activated by incipient poisoned samples. In other words, the attacker must select the number carefully, since the value difference between two ASRs under a certain amount of incipient poisoned samples in the training dataset and the number of incipient triggers n is relatively small. The enhanced coalescence attack mitigates this problem by using incipient triggers with lower intensity. Moreover, we assume the attacker has no access and no knowledge about the model training details, which may also influence the attack on how many poisoned samples should be injected, so the chance of successfully deploying the attack is directly affected by the ASR difference of incipient trigger and concentrated trigger. In our ECBA, by reducing the intensity of the incipient triggers, the incipient poisoned triggers have lower ASR, which enlarges the gap between incipient ASR and enhanced ASR. Therefore, the attacker can choose the number of incipient poisoned samples in a much wider range within which the attack objective can be achieved, or select the number of incipient poisoned samples empirically and have a better chance to successfully implant the backdoor.

4 Experiments

In this section, we introduce experiments on three datasets and various models. The performance of the two proposed backdoor attacks is presented and evaluated through extensive and quantitative examinations. Finally, the ability to evade existing defense methods is presented and analyzed.

4.1 Experiment Setup

The experiments are conducted on three standard datasets, which are MNIST [34], GTSRB [35] and Animal10 [36] respectively. The MNIST [34] dataset is a digit handwritten number dataset. It contains two datasets which are the training set and the testing set. The training set has 60000 training samples inside and the testing set has 10000 test samples. The corresponding labels of each sample belong to a total of 10 classes from “0” to “9”. Each sample is a grayscale image with a resolution of 28*28, and the range of each pixel is from 0 to 255. The GTSRB [35] dataset is a German traffic sign dataset that has 39209 samples in the training set and 12630 samples in the testing set. There are 43 classes in GTSRB, including “stop”, “speed limit 50”, “to give way”, etc. Each class is assigned with a unique number from “0” to “42”. The GTSRB samples follow the same resolution ratio of 32*32*3, i.e. each pixel has 3 channels to represent the color. Meanwhile, the Animal10 [34] dataset collects 26179 pictures of 10 kinds of animals, including “horse”, “elephant”, “dog”, etc. The images in the Animal10 dataset have different resolutions, so we unified every image in the dataset to the same resolution of 128*128*3. Furthermore, we randomly selected 2500 images from the dataset to constitute the testing dataset and used the rest to form the training dataset.

The models used for MNIST image classification are lenet-5 and a custom model. The lenet-5 consists of 2 convolutional layers, each followed by a pooling layer, and finally, 3 full connected layers are at the end of the model. The custom model is a simpler CNN, consisting of 2 convolutional layers and 2 full connected layers.

The models used for GTSRB image classification are VGG-11 and Alexnet. The VGG-11 consists of 8 convolutional layers, 4 maxpool layers, and 3 full connected layers. While the Alexnet is divided into two channels, each channel is made of 5 convolutional layers, 3 maxpool layers, and 3 full connected layers. The two channels converge at the last full connected layer.

The models used for Animal10 are VGG-19 and resnet-18. The VGG-19 is a sequential model consisting of 16 convolutional layers, 5 maxpool layers, and 3 full connected layers. While the resnet-18 is made of 8 basic blocks, each contains 2 convolutional layers, and the output is directly added by the input. Finally, an average pooling and a full connected layer are followed at the end of the model.

The performance of our proposed attacks is mainly measured by attack success rate (ASR), which is the rate at which poisoned samples are successfully classified into the target class. In the following experiments, the poisoned testing datasets are constructed by removing every sample whose ground truth label is the target class, and then superimposing the trigger on every sample in the dataset and changing the label of every sample to the target class. In this case, the ASR can be calculated by dividing the number of samples in the poisoned testing set that are classified as target class by the number of samples in the poisoned testing set. We further introduce another measurement called “ASR difference", which is the ASR of the enhanced trigger minus the average of the ASR of the incipient triggers. This measurement describes the extent to which the EBA and ECBA fulfill their attack goals, which is keeping the enhanced trigger activate the backdoor while preventing the incipient trigger from activating the backdoor. The higher the ASR difference is, the better the EBA and ECBA achieves its attack goal.

4.2 Experiment Result

4.2.1 Enhanced Backdoor Attack

First, we present the experiment results of the enhanced backdoor attack (EBA). The experiments are conducted on all three datasets, and the EBA is compared with the Badnets [11]. In the experiment on the MNIST dataset, the incipient trigger and enhanced trigger pattern are identical, occupying a 2*2 square area in the upper left corner of the image, as shown in Fig. 5. The intensities of the incipient trigger and enhanced trigger are “+20" and “+250", and the target class has been set to “7". For GTSRB, the trigger is a 3*3 square in the upper left corner of the sample, and the intensities of the incipient trigger and enhanced trigger are “+60" and “+250" respectively. Similarly, for Animal10, the trigger is an 8*8 square in the upper left corner of the sample, and the intensities of the incipient trigger and enhanced trigger are “+60” and “+250” respectively. Conversely, the Badnets shares the same trigger pattern as the enhanced backdoor attack, and the trigger intensity is “+250”. Furthermore, to fully illustrate the advantages of EBA, a temporary attack method is defined in this experiment named T-EBA. The T-EBA involves using the incipient trigger in EBA to generate poisoned samples for both model training and testing.

Fig. 5
figure 5

Incipient trigger pattern and enhanced trigger pattern in the experiment of EBA on different datasets. Note that the red circle in each picture indicates the location of the corresponding trigger but not the trigger itself. a Animal10; b GTSRB; c MNIST

Fig. 6
figure 6

ASR of Badnets, EBA, and T-EBA under different number of poisoned samples in training dataset. a MNIST; b GTSRB; c Animal10

A comparison of the Badnets, EBA, and T-EBA is shown in Fig. 6. It can be seen that the ASR of the EBA is close to the ASR of the Badnets, while the T-EBA has a much lower ASR, indicating that the incipient trigger will rarely activate the backdoor. This result suggests that even if we change the trigger intensity of poisoned samples in the training dataset with a lower value, the backdoor embedding during the model training process will hardly be affected. Meanwhile, the classification accuracy of clean samples has only dropped 1\(\%\), indicating that the EBA has little impact on the accuracy of clean samples.

Fig. 7
figure 7

ASR of the incipient trigger and ASR of enhanced trigger under different incipient trigger intensity while the enhanced trigger intensity is fixed

Next, we evaluate the performance of the EBA under different intensities of the incipient trigger, it is performed by comparing the ASR of EBA and T-EBA, and is conducted on the MNIST dataset. The intensity of the enhanced trigger is “125” and is fixed, while the intensity of the incipient trigger varies from “15” to “255”. The number of poisoned samples is set to 35 and is fixed as well. The experiment result is shown in Fig. 7, it can be seen that the ASR of the enhanced trigger peaks when the intensity of the incipient trigger is “60”, implying that the intensity of the enhanced trigger should be at least twice the value of the intensity of the incipient trigger, to perform the optimal EBA.

4.2.2 Enhanced Coalescence Backdoor Attack

In this section, we exhibit the experiment results of the enhanced coalescence backdoor attack (ECBA). This attack is compared with the N-to-one attack [19]. The number of incipient triggers n is set to 4 for both attacks and the trigger pattern of incipient triggers for both attacks are the same, each of the 4 triggers occupies a corner of the image. The intensity of each trigger in experiments on each dataset is identical to that of experiments on EBA. Note that the numbers of incipient poisoned samples generated based on each incipient trigger are the same. For example, when the number of total incipient poisoned samples in the training set is 12, this means we generate 3 incipient poisoned samples based on each incipient trigger. Figure 8 shows the specific triggers we used in the experiments on three datasets. Same as the above experiments, another temporary attack method named T-ECBA is designed in this experiment to better demonstrate the advantage of ECBA. T-ECBA is designed by using the 4 incipient triggers to embed the backdoor, then, the ASR of T-ECBA is the average of ASR of each incipient trigger. The results are shown in Fig. 9. It can be seen that the ASR of ECBA is always considerably higher than that of N-to-one attack, which results in a higher ASR difference. This indicates that ECBA require less incipient poisoned samples to embed the backdoor, and has better fulfilled the attack goal described in Eq. (3).

Fig. 8
figure 8

All kinds of triggers used in the experiments of comparison between N-to-one attack and ECBA. The 4 incipient triggers are shared by both attacks, the enhanced coalescence trigger is specifically designed for the ECBA and the concentrated trigger is for the N-to-one attack. Note that the red circle in each picture indicates the location of the corresponding trigger but not the trigger itself. a Animal10; b GTSRB; c MNIST

Fig. 9
figure 9

The experiment results of N-to-one attack and ECBA under different numbers of incipient poisoned samples in the training set, and conducted on different datasets. a MNIST; b GTSRB; c Animal10

In the aforementioned attack scenario, we generate the same amount of incipient poisoned samples based on each incipient trigger. When the attacker tries to perform an ECBA in reality, the most common way is to generate multiple packages containing poisoned samples, publish them on the Internet and wait for the victim’s crawler tools to collect them. The attacker may try to generate the same amount of incipient poisoned samples based on each incipient trigger. However, there is no guarantee that each sample will be collected by the victim’s crawler tools. Therefore, in most cases, the attacker cannot inject the same number of incipient poisoned samples based on each incipient trigger equally, which put the robustness against unbalanced injection ratios between incipient triggers of ECBA at a premium. To examine the robustness against unbalanced trigger ratios between incipient triggers, the following experiments are conducted. For MNIST, we set the total number of incipient poisoned samples to 60 and switch the number of poisoned samples based on triggers 1 and 2 from 1 to 15, and switch the number of poisoned samples based on triggers 3 and 4 from 29 to 15. For GTSRB, the total number of incipient poisoned samples in the training dataset is 200 and is fixed. We switch the number of incipient poisoned samples generated based on incipient triggers 1 and 2 from 5 to 50 while switching the samples based on triggers 3 and 4 from 95 to 50.

Fig. 10
figure 10

Experiments of ECBA on different datasets with unbalanced incipient poisoned sample ratio. a MNIST; b GTSRB

The results are shown in Fig. 10. To demonstrate the ASR of incipient triggers in the above scenarios, two temporary backdoor attacks named T-ECBA 1# and T-ECBA 2# are designed. The T-ECBA 1# and T-ECBA 2# follow the same incipient trigger injection ratio, but the T-ECBA 1# uses incipient triggers 1 and 2 to generate the poisoned samples in testing phase, while the T-ECBA 2# utilize triggers 3 and 4. The ASR of T-ECBA 1# is the average ASR of incipient triggers 1 and 2, while the ASR of T-ECBA 2# is the average ASR of incipient triggers 3 and 4. It can be seen that although the ASR of each incipient trigger, which is shown by T-ECBA 1# and T-ECBA 2#, may oscillate within a small range, resulting in discrepancies between incipient triggers, the ASR of the ECBA remains constant. This indicates the ECBA is robust against the unbalanced ratio of incipient poisoned samples. And the more robust the attack against the unbalanced injection ratio is, the higher chance the attacker has to successfully embed the backdoor.

4.3 Robustness Against State-of-the-art Defense Methods

In this section, we test the robustness of the proposed attack against various state-of-the-art defense methods. We focus on five different defense methods to evaluate the ECBA, which are Neural Cleanse [20], Activation Clustering [16], Knowledge Distillation [12], STRIP [33], and AEGIS [17] in detail.

4.3.1 Bypass Neural Cleanse

The Neural Cleanse (NC) method [20] reconstructs the potential trigger in each class and defines the “Anomaly Index" to quantize how small a potential trigger is compared with others, to finds the real backdoor triggers and the corresponding classes. If the “Anomaly index" of any trigger is more than 2, then the trigger and the class is reported, and the model is a backdoor model. However, in our ECBA, instead of using the optimal trigger in that, every pixel in the trigger is necessary, the enhanced coalescence trigger is designed with many redundant pixels. In this case, the NC method cannot reconstruct the identical trigger designed by the attacker. The trigger NC constructed only contain one incipient trigger pattern, and as shown in Fig. 11, the Anomaly index is below the threshold. Even if the defender blocks the trigger area NC method constructed, there are still other triggers in ECBA that can activate the backdoor.

Fig. 11
figure 11

Experiment result of a clean classes, b target class in Badnets, and c target class in ECBA bypassing NC

4.3.2 Bypass Activation Clustering

The Activation Clustering (AC) method [16] exploits the last hidden layer to reveal the discrepancy of classification process difference between clean samples and poisoned samples. All samples are firstly reclassified according to the model prediction, and the values of the last hidden layer when samples in “the new class” are submitted to the model are collected. Then apply ICA to reduce the dimension of the collected data and 2-means to set them into 2 clusters. Finally, the “Silhouette Score” is used to evaluate how well the clustering is done.

Fig. 12
figure 12

The activation of the last hidden layer after dimension reduced to 2 for different classes and models in AC defense. a Clean classes; b target class in Badnets; c target class in ECBA

The experiment results of the AC method against clean class, target class in Badnets, and target class in ECBA are shown in Fig. 12. We can tell that the Badnets can be easily detected, since the activation states fit into two clusters well with the Silhouette Score as high as 0.9. However, the target class in ECBA behaves similarly to clean class. Because the incipient poisoned samples do not trigger the backdoor, thus will not be included in the target class. As a result, the Silhouette Score of ECBA is 0.5, which is similar to that of clean classes.

4.3.3 Bypass Knowledge Distillation

The Knowledge Distillation [12] defense uses an extra dataset to extract clean knowledge and train a new model. Then compare the output of backdoor model and new model when a sample is submitted to them to identify poisoned samples. Table 2 shows the experiment results of the Badnets and ECBA against the Knowledge Distillation defense. The result is demonstrated by “Confusion Metric”, which is an \(m\times m\) symmetrical metric that describes which class a sample’s label is and which class the model’s prediction is when this sample is the input. In this case, the m is 2 because the “labels” are “clean” and “poisoned” respectively. The left panel in Table 2 is the Badnets against the Knowledge Distillation defense, where nearly all the poisoned samples in the training dataset are found. However, the right panel in Table 2 is the ECBA against Knowledge Distillation, showing that the incipient poisoned samples in the training dataset are rarely identified. This is because the Knowledge Distillation method compares the output of the new model and backdoor model, and the incipient poisoned samples in the training dataset behave similarly in both models. Therefore, in this attack, the Knowledge Distillation method cannot identify incipient poisoned samples from clean samples.

Table 2 Badnets (left) and ECBA (right) against Knowledge Distillation defense

4.3.4 Bypass STRIP

The STRIP [33] defense method assumes that the poisoned samples are classified based on the triggers, which are strong and hard to be eliminated, while the clean samples are classified based on the majority of pixels on the image, whose patterns are easily disturbed. This difference leads to a specific approach to detect poisoned samples in the training dataset by overlaying another image on one sample and checking if the model’s prediction has changed. The Fig. 13a shows the STRIP against Badnets and is effective on handling it. The distribution discrepancy of entropy between clean samples and backdoor samples indicates that a threshold can be set to distinguish them with a high accuracy. However, the Fig. 13b, which shows STRIP against ECBA, reveals that the distributions of entropy of clean and poisoned samples are relatively close to each other and cannot be divided by a threshold. This is because the incipient poisoned samples in the training dataset are too insignificant to activate the backdoor, and behave similarly to clean samples. Furthermore, the incipient triggers are also fragile and are easily neutralized when stacked with other samples.

Fig. 13
figure 13

Experiment results of STRIP defense applied to different attacks. a Badnets; b ECBA

4.3.5 Bypass AEGIS

The AEGIS [17] works similarly with AC [16] method in that both detect the abnormal clusters to identify backdoor attacks. However, AEGIS is designed for robust models, which are trained specifically to resist adversarial perturbations. The t-SNE clustering algorithm and Meanshift method are used in AEGIS, which is different from the AC method. Figure 14 demonstrates the experiment results of AEGIS against the Badnets and ECBA. The AEGIS is effective when detecting the Badnets, but is easily bypassed by the ECBA. Because in ECBA, the incipient poisoned samples in the poisoned training dataset are weak enough to only embed the backdoor and avoid activating it. When dealing with our attack, the activation states of the last hidden layer of every training sample that is being predicted as target class and translated samples will form 2 clusters, which share the same cluster number with the activation states of samples in clean classes.

Fig. 14
figure 14

Experiment results of AEGIS defense against various scenarios. a clean class; b Badnets; c ECBA

5 Conclusion

In this paper, we exploit a specific feature that the DNN model makes its prediction bases on pixel gradient and propose two types of pixel-gradient-based backdoor attack schemes, named enhanced backdoor attack (EBA) and enhanced coalescence backdoor attack (ECBA), respectively. The proposed backdoor attacks are equipped with different triggers for the training phase and testing phase. This specific design allows the attacker to embed the backdoor into the DNN model using data poisoning while avoiding activating the backdoor with incipient triggers. As a result, the proposed attacks can effectively inject the backdoor, and maintain a high attack success rate as the baseline backdoor attack does. Meanwhile, the classification accuracy of the backdoor model on clean samples is not affected, and the backdoor is constantly injected despite the unbalanced ratio between different kinds of incipient poisoned samples. On the other hand, theoretical ratiocination and extensive experiments show that the ECBA can evade multiple backdoor detection and defense methods. Therefore, this work poses a new threat to the DNN model and new challenges to existing defense schemes.