Gradient-based Adversarial Attacks against Text Transformers

Abstract

We introduce the first general-purpose gradient-based attack targeting transformer models. Rather than focusing on finding a single adversarial example, we aim to discover a distribution of adversarial examples represented by a continuous matrix, which allows for gradient-based optimization. Through experiments, we show that our white-box attack achieves state-of-the-art performance across various natural language tasks. Additionally, we demonstrate that our approach enables a strong black-box transfer attack, which, by sampling from the adversarial distribution, performs as well as or better than existing methods, all while only needing hard-label outputs.

Introduction

Deep neural networks are sensitive to small, often imperceptible changes in the input, as evidenced by the existence of so-called adversarial examples.

The primary approach for generating adversarial examples involves defining an adversarial loss that promotes prediction errors and then minimizing this loss using standard optimization methods. To make the perturbation less noticeable to humans, existing techniques often include a perceptibility constraint in the optimization process. This general strategy has been effectively applied to image and speech data in various forms.

However, optimization-based approaches for generating adversarial examples are significantly more difficult when applied to text data. Attacks on continuous data types, like images and speech, can leverage gradient descent for greater efficiency, but the discrete nature of natural language makes the use of first-order techniques unfeasible. Additionally, while perceptibility in continuous data can be measured using L₂-norms or L∞-norms, these metrics do not directly apply to text data. To address this, some existing attack methods have turned to heuristic word replacement techniques and optimization through greedy or beam search using black-box queries. However, these heuristic approaches often result in unnatural modifications that are grammatically or semantically incorrect.

We present GBDA (Gradient-based Distributional Attack), a framework for gradient-based adversarial attacks on transformer models using text data. GBDA addresses the challenges of applying gradient descent to discrete data by searching for an adversarial distribution rather than a single example, using the Gumbel-SoftMax distribution. We also ensure the perturbations remain perceptible and fluent by incorporating BERTScore and language model perplexity as differentiable constraints. This approach enables efficient and powerful text-based adversarial attacks.

Our approach addresses several limitations of existing adversarial NLP methods. For example, some black-box attacks generate token-level candidates for replacement and rescore all possible combinations, leading to an exponentially large search space for rare or complex words. Additionally, mask prediction methods require sequential token modifications in arbitrary order. In contrast, GBDA works at the token level, using a language model constraint to directly handle token fluency, making our method more efficient and applicable to any model for text sequence prediction.

We demonstrate that GBDA works effectively against several transformer models. We also test it in a black-box scenario by using the optimized adversarial distribution on a different target model. On tasks like news categorization, sentiment analysis, and natural language inference, our method achieves top attack success rates while preserving fluency, grammar, and meaning.

In summary, the main contributions are as follows:

We define a parameterized distribution of adversarial examples and optimize it using gradient-based methods. In contrast, most prior work construct a single adversarial example using black-box search.
By incorporating differentiable fluency and semantic similarity constraints into the adversarial loss, our white-box attack produces more natural adversarial texts while setting a new state-of-the-art success rate.
The adversarial distribution can be sampled efficiently to query different target models in a black-box setting. This enables a powerful transfer attack that matches or exceeds the performance of existing black-box attack. Compared to prior work that operate a continuous-valued output from the target model, this transfer attack only requires hard-label outputs.

Background

Adversarial examples are a type of robustness attack against neural networks. Let h : X → Y be a classifier, where X is the input domain and Y is the output domain. If x ∈ X is a test input that the model correctly classifies as y = h(x), an (untargeted) adversarial example is a sample x₀ ∈ X where h(x₀) ≠ y, but x₀ is imperceptibly close to x. Perceptibility ensures that x₀ maintains the semantic meaning of x for human observers. In image data, perceptibility is often measured by distance metrics like Euclidean or Chebyshev distance. More generally, we define a perceptibility metric ρ : X × X → R≥0, with a threshold ε > 0, so that x₀ is considered imperceptible if ρ(x, x₀) ≤ ε.

The process of finding an adversarial example is usually framed as an optimization problem. For classification, the model h produces a logit vector φh(x) ∈ R^K, where y = arg max_k φh(x)_k. To make the model misclassify an input, we define an adversarial loss, like the margin loss, which encourages the model to misclassify x by a margin of κ > 0 when the loss equals 0. This margin loss is commonly used in image-based attack algorithms (Carlini and Wagner, 2017).

To create an adversarial example, we can set it up as a constrained optimization problem:

minimize x₀ ∈ X `margin(x₀, y; h) subject to ρ(x, x₀) ≤ ε

Alternatively, we can turn the constraint into a softer one with a penalty factor λ > 0:

minimize x₀ ∈ X `margin(x₀, y; h) + λ · ρ(x, x₀)

This softer version can be solved using gradient-based methods if ρ is differentiable.

Text Adversarial Examples

While the search problem works well for continuous data like images and speech, it doesn't directly apply to text data for two main reasons: (1) the data space X is discrete, so gradient-based optimization can't be used, and (2) defining a perceptibility constraint ρ for text is difficult. These issues arise for any discrete data, but they are especially important for text due to the sensitivity of language. For example, adding the word "not" to a sentence can completely change its meaning, even if the edit distance at the token level is just 1.

Several attack algorithms have been developed to address the challenges of adversarial text attacks. For character-level attacks, perceptibility is often measured by the number of character edits, such as replacements, swaps, or deletions. Word-level attacks use methods like synonym substitution or replacing words with similar embeddings. More recent approaches leverage masked language models like BERT to generate word substitutions. These attacks typically generate a set of candidate perturbations and optimize the adversarial loss using methods like greedy search or beam search.

Despite many attempts, attacks on natural language models are less effective than those on other data types. Character and word-level changes are often easy to detect because they can cause misspellings, grammatical errors, and unnatural text. Additionally, many previous attacks treat the target model as a black-box and use inefficient methods to minimize the adversarial loss, leading to suboptimal results.

For example, BERT-Attack, one of the top attacks against BERT, only lowers the model's accuracy on the AG News dataset from 94.2% to 10.6%. In contrast, attacks on image models can often reduce accuracy to nearly 0 across most tasks. This performance gap raises the question of whether gradient-based methods could create more fluent and effective adversarial examples for text. In this work, we show that our gradient-based attack reduces the same model's accuracy from 94.2% to 2.5%, while maintaining better semantic fidelity to the original text.

Other Attacks

While most adversarial text attacks follow the formulation we outlined in section 2, there are other types as well. One example is the universal adversarial trigger—a short piece of text that, when added to any input, causes the model to misclassify. However, these triggers often contain unnatural word or token combinations, making them easily detectable by humans.

Our work is part of the broader field of adversarial learning, where many studies have explored adversarial examples across different data types. While image data is the most widely studied, adversarial examples have also been created for speech and graphs.

GDBA: Gradient-based Distributional Attack

In this section, we detail GDBA - our general purpose framework for gradient-based text attacks against transformers. Our framework leverages two important insights: (1) we define a parameterized adversarial distribution that enables gradient based search using the Gumbel-SoftMax; and (2) we promote fluency and semantic faithfulness of the perturbed text using soft constraints on both perplexity and semantic similarity.

Adversarial Distribution

Let z = z₁z₂ · · · zn be a sequence of tokens, where each token zᵢ is from a fixed vocabulary V = {1, ..., V}. Consider a distribution PΘ, parameterized by a matrix Θ ∈ Rⁿˣᵥ, that generates samples z by independently sampling each token zᵢ from a Categorical distribution with probabilities πᵢ = Softmax(Θᵢ). We aim to optimize Θ such that samples z drawn from PΘ are adversarial examples for the model h. To do this, we define the objective function as:

minimize Θ ∈ Rⁿˣᵥ Ez∼PΘ `(z, y; h),

where ` is the chosen adversarial loss.

The objective function in is non-differentiable because the categorical distribution is discrete. To address this, we propose relaxing the problem by extending the model h to accept probability vectors as input. We then use the Gumbel-SoftMax approximation (Jang et al., 2016) for the categorical distribution, which allows us to compute the gradient and perform optimization.

Transformer models convert input tokens into embedding vectors using a lookup table. Let e(·) be the embedding function, where e(zᵢ) is the embedding for token zᵢ. If πᵢ is a probability vector for token zᵢ, we define its corresponding embedding as:

e(πᵢ) = Σᵥⱼ (πᵢ)ⱼ e(j),

where e(j) is the embedding of token j, and (πᵢ)ⱼ is the probability of selecting token j. If πᵢ is a one-hot vector for token zᵢ, then e(πᵢ) = e(zᵢ). For a sequence of probability vectors π = π₁ · · · πₙ, we combine their embeddings as:

e(π) = e(π₁) · · · e(πₙ).

To compute gradients, we extend the model to take probability vectors as input and use the Gumbel-SoftMax approximation. This allows us to estimate smooth gradients for the objective. We sample π˜ = π˜₁ · · · π˜ₙ from the Gumbel-SoftMax distribution using the following process:

(˜πᵢ)ⱼ = exp((Θᵢ,ⱼ + gᵢ,ⱼ) / T) / Σᵥ exp((Θᵢ,ᵥ + gᵢ,ᵥ) / T),

where gᵢ,ⱼ is sampled from a Gumbel distribution, and T is a temperature parameter controlling smoothness. As T → 0, the Gumbel-softmax distribution becomes closer to the categorical distribution.

We then optimize Θ using gradient descent with the smooth approximation:

minimize Θ Eπ˜∼P˜Θ `(e(π˜), y; h),

where the expectation is estimated using stochastic samples from π˜.

Soft Constraints

Black-box attacks using heuristic replacements can only limit the perturbation by proposing changes within certain constraints, such as limiting edit distance or using words with similar embeddings. In contrast, our adversarial distribution approach can easily incorporate any differentiable constraint into the objective. This allows us to include fluency and semantic similarity constraints, resulting in more fluent and semantically faithful adversarial texts.

Causal language models (CLMs) are trained to predict the next token in a sequence by maximizing the likelihood of the given tokens. This allows us to compute the likelihood of any token sequence. Given a CLM g with log-probability outputs, the negative log-likelihood (NLL) of a sequence x = x₁ · · · xₙ is computed autoregressively:

NLLg(x) = − Σ (log g(xᵢ | x₁, ..., xᵢ₋₁)) for i = 1, ..., n.

In our adversarial distribution approach, since the inputs are token probability vectors, we extend the NLL definition to:

NLLg(π) = − Σ log pg(πᵢ | π₁, ..., πᵢ₋₁),

where log pg(πᵢ | π₁, ..., πᵢ₋₁) is the cross-entropy between the predicted next token distribution and the current token distribution πᵢ. This extension matches the NLL for a token sequence x when each πᵢ is a delta distribution for the token xᵢ:

Σ log pg(xᵢ | x₁, ..., xᵢ₋₁), where log pg(xᵢ | x₁, ..., xᵢ₋₁) = g(x₁, ..., xᵢ₋₁)ₓᵢ.

Previous word-level attacks often used context-free embeddings like word2vec and GloVe or synonym substitution to ensure semantic similarity between the original and modified text. However, these methods often lead to unnatural changes that can alter the meaning of the text. Instead, we use BERTScore, a similarity score that measures the semantic similarity between tokens in contextualized embeddings from a transformer model. This approach better preserves the meaning and fluency of the text.

Let x = x₁ · · · xₙ and x' = x'₁ · · · x'ₘ be two token sequences, and let g be a language model that produces contextualized embeddings φ(x) = (v₁, ..., vₙ) and φ(x') = (v'₁, ..., v'ₘ). The BERTScore between x and x' (specifically recall) is defined as:

RBERT(x, x') = Σ (wᵢ * maxₖ vᵢᵀ v'ₖ),

where wᵢ is the normalized inverse document frequency (idf) of token xᵢ, computed across a corpus. We can replace x' with a sequence of probability vectors π = π₁ · · · πₘ, as described, and use ρg(x, π) = 1 − RBERT(x, π) as a differentiable soft constraint.

The objective function combines the components from earlier sections for gradient-based optimization. It uses the margin loss as the adversarial loss and incorporates two soft constraints: the fluency constraint with a causal language model g and the BERTScore similarity constraint using contextualized embeddings of g. The objective is:

L(Θ) = Eπ˜∼P˜Θ [ `(e(π˜), y; h) + λₗₘ NLLg(π˜) + λₛᵢₘ ρg(x, π˜) ],

where λₗₘ and λₛᵢₘ are hyperparameters that control the strength of the fluency and similarity constraints. We minimize this function stochastically using the Adam optimizer by sampling a batch of inputs from P˜Θ at each iteration.

Sampling Adversarial Texts

Once Θ is optimized, we can sample from the adversarial distribution PΘ to create adversarial examples. However, since the loss function L(Θ) is an approximation of the true objective, some samples may not be adversarial even if L(Θ) is minimized. Therefore, in practice, we draw multiple samples z ∼ PΘ and stop either when the model misclassifies a sample or when we reach a maximum number of samples.

We could filter out unnatural examples, but we only ensure that the generated example is misclassified by the model.

We can use the adversarial examples generated from PΘ to attack a different target model, creating a black-box transfer attack. Unlike most black-box attacks that require continuous-valued scores from the target model, our approach does not need this. We show in subsection 4.2 that this transfer attack, based on the adversarial distribution PΘ, is highly effective against various target models.

Experiments

In this section, we empirically validate our attack framework on a variety of natural language tasks. Code to reproduce our results is open sourced on GitHub.

Setup

Tasks. We evaluate on several benchmark text classification datasets, including DBPedia and AG News for article/news categorization, Yelp Reviews and IMDB for binary sentiment classification, and MNLI for natural language inference. The MNLI dataset contains two evaluation sets: matched (m.) and mismatched (mm.), corresponding to whether the test domain is matched or mismatched with the training distribution

Models. We attack three transformer architectures with our gradient-based white-box attack: GPT-2 (Radford et al., 2019), XLM (Lample and Conneau, 2019) (using the en-de cross-lingual model), and BERT (Devlin et al., 2019). For BERT, we use finetuned models from TextAttack (Morris et al., 2020b) for all tasks except for DBPedia, where finetuned models are unavailable. For BERT on DBPedia and GPT-2/XLM on all tasks, we finetune a pretrained model to serve as the target model.

The soft constraints in subsection 3.2 use a causal language model (CLM) g with the same tokenizer as the target model. For GPT-2, we use the pre-trained GPT-2 model without finetuning as g. For XLM, we use a checkpoint obtained after finetuning with the CLM objective. For masked language models like BERT, we train a causal language model g on WikiText-103 using the same tokenizer as the target model.

We compare our method with recent attacks on text transformers: TextFooler, BAE, and BERT-Attack. All baselines are tested on finetuned BERT models from the TextAttack library. We evaluate both our white-box attack on the finetuned BERT model and a transfer attack using the GPT-2 model for a fair comparison. See subsection 4.2 for more details on both attack settings.

We optimize the adversarial distribution parameter Θ using Adam with a learning rate of 0.3, a batch size of 10, and 100 iterations. Θ is initialized to zero, except for Θi,j = C, where xi = j is the i-th token of the clean input, with C ∈ {12, 15}. We set λperp = 1 and cross-validate λsim ∈ [20, 200] and κ ∈ {3, 5, 10} using held-out data.

Quantitative Evaluation

We first evaluate our attack performance in the white-box setting. Table 1 shows the results of our attacks against GPT-2, XLM (en-de), and BERT on different benchmark datasets. For each task, we randomly select 1000 inputs from the test set as attack targets. After optimizing Θ, we draw up to 100 samples z ∼ PΘ until the model misclassifies z. The model's accuracy after the attack ("Adv. Acc.") is based on the last drawn sample.

Our attack successfully generates adversarial examples for all three models across the five benchmark datasets. The test accuracy is reduced to below 10% for nearly all models and tasks. We also evaluate the semantic similarity between the adversarial example and the original input using cosine similarity from Universal Sentence Encoders (USE). Our attack consistently maintains a high level of semantic similarity, with cosine similarity often higher than 0.8.

We evaluate our attack in a black-box transfer setting, where we optimize the adversarial distribution PΘ on GPT-2 for each model and task. After optimizing, we draw up to 1000 samples from PΘ and evaluate them on the target BERT model. Unlike prior work, our attack only requires the target model to output a discrete label to determine when to stop sampling, whereas previous attacks relied on continuous-valued outputs like class probabilities.

Table 3 shows that our attack, when transferred to finetuned BERT classifiers, significantly reduces the target model's accuracy, often outperforming BERT-Attack with fewer queries. Additionally, the cosine similarity between the original input and the adversarial example is higher than that of BERT-Attack.

We also tested our transfer attack on three additional finetuned transformer models: ALBERT, RoBERTa, and XLNet, using the same Θ optimized on GPT-2. The results in Figure 4 show that the attack performance is similar to the transfer attack against BERT (Table 3), demonstrating that our adversarial distribution PΘ can capture common failure modes across different transformer models. This highlights the effectiveness of our approach, requiring minimal access to the target model for successful attacks in real-world systems.

Analysis

We further tested our transfer attack on three other finetuned transformer models: ALBERT, RoBERTa, and XLNet, using the same Θ optimized on GPT-2. The results in Figure 4 show that the attack performs similarly to the one against BERT (Table 3), proving that our adversarial distribution PΘ captures common weaknesses across various transformer models. This suggests that our approach is effective and requires minimal access to the target model, making it suitable for real-world attacks.

Figure 2 illustrates how the similarity constraint (λsim) affects the performance of the transfer attack on GPT-2 for the AG News dataset. The different colors represent various target models, with darker shades indicating higher values of λsim (50, 20, 10). As λsim increases, the perturbation becomes less aggressive, but more queries are needed to reach the same adversarial accuracy. This shows a trade-off between the similarity constraint, attack success rate, and query budget.

Table 5 shows the effect of the fluency constraint on adversarial examples for GPT-2 on the AG News dataset. By keeping all hyperparameters constant except for the fluency regularization constant λlm, we generate successful adversarial texts after optimizing Θ. The fluency constraint helps ensure that the generated text has valid word combinations and proper grammar. Without it, the adversarial examples often contain nonsensical words, highlighting the importance of fluency in creating more natural-looking adversarial texts.

Our attack works directly on tokenized inputs, but classification systems typically receive raw text, which is tokenized before being processed by the model. This means that an adversarial example we generate might not be re-tokenized to the same set of tokens when converted back to raw text, potentially causing issues.

In one example, our adversarial text has the tokens "jui-" and "cy," which decode to "juicy" but, when re-encoded, might become "juic-" and "y." However, these re-tokenization artifacts are rare (with a "token error rate" of around 2%) and don't significantly affect adversarial accuracy. Even with re-tokenization, the example remains adversarial. If needed, a potential solution is to re-sample from the adversarial distribution PΘ until the text is stable under re-tokenization. All of our adversarial accuracy results account for this re-tokenization process.

Our method requires both forward and backward passes through the attacked model, the language model, and the similarity model, which increases the time per query compared to black-box attacks that only use forward passes. However, this is offset by the more efficient optimization process, resulting in a total runtime of about 20 seconds per generated example, which is comparable to black-box attacks like BERT-Attack.

Conclusion and Future Work

We introduced GBDA, a framework for gradient-based white-box attacks on text transformers. The key idea is to formulate a parameterized adversarial distribution, rather than a single adversarial example, which allows for efficient gradient-based optimization and the use of differentiable constraints like perplexity and BERTScore. Our attack performs exceptionally well on various natural language tasks and model architectures, achieving state-of-the-art results in both white-box and black-box transfer attack settings.

A key limitation of our method is its focus on token replacements, as the Gumbel-softmax-based adversarial distribution does not easily extend to token insertions or deletions. This restriction may impact the naturalness of the generated adversarial examples. We hope that future work can extend our adversarial distribution framework to include a wider range of token-level changes.

Another limitation is that the adversarial distribution PΘ is highly over-parameterized. Although most adversarial examples only require a few token changes, the parameter matrix Θ is n×V in size, which becomes excessive for longer sentences. Future work could focus on reducing the number of parameters without compromising the quality of the generated adversarial examples.

Gradient-based Adversarial Attacks against Text Transformers

Abstract

Introduction