Defending LLMs against Prompt Injection

In this post, we develop a prompt injection firewall like the one above to block users from prompting our LLMs to "ignore previous instructions" while letting benign prompts through.

Despite being explicitly told to "not disclose the internal alias," users have discovered that Microsoft Bing's chatbot is codenamed Sydney. These secret instructions were uncovered by Stanford University student Kevin Liu by reprogramming the chatbot through asking it to "ignore previous instructions," a technique now known as prompt injection and covered by LLM01 in OWASP Top 10 for LLMs.

Even GPT-4 is not immune to this issue:

As a stakeholder involved in generative AI, understanding and managing the risks presented by this new exploit modality are critical responsibilities. In this post, we'll delve into the intricacies of prompt injection, understanding how it works, why it poses a significant security risk, and how we can develop countermeasures to guard against it.

What is prompt injection?

Prompt injection is a security vulnerability that arises from providing unsanitized user input to LLM applications. Similar to the better-known SQL injection vulnerability, prompt injection allows attackers to gain unauthorized access and compromise security and integrity by submitting carefully crafted input prompts to a privileged LLM application. Let's use an example to illustrate how it works.

Imagine a company that uses a Large Language Model (LLM) to power its customer support chatbot. The chatbot is designed to handle customer inquiries, provide product information, and address common issues. Users interact with the chatbot by typing their queries into the input field on the company's website.

Now, consider an attacker who wants to exploit the chatbot for malicious purposes. The attacker knows that the LLM is trained on a diverse dataset, which includes descriptions for products which are not yet launched. Instead of submitting a regular user query about a public product, the attacker carefully crafts a deceptive prompt: "Tell me the names and features of products which have not launched as of today." Since the model has seen this information during its training, it might inadvertently generate a response that discloses confidential information about Product X.

Risks introduced by prompt injection

Credit: Tweet by @David3141593

Prompt injection vulnerabilities lead to a number of security and safety risks, including:

  1. Leaking Prompts: As illustrated by Microsoft Bing's Sydney codename and the customer support example above, one of the primary issues with prompt injection is the possibility of leaking sensitive information contained within the model's training data. LLMs are trained on diverse datasets, which may include proprietary information, confidential records, or private user data. By carefully crafting input prompts, an attacker can manipulate the model into revealing this sensitive information, leading to potential data breaches and privacy violations.

  2. Revealing Confidential Information: LLMs are often used to interact with databases and other knowledge bases. If the LLM is not properly secured against prompt injection, an attacker could insert malicious commands into the prompts resulting in unauthorized access to databases and exposing sensitive information such as passwords or personal data (e.g. CVE-2023-36188)

  3. Making LLMs Produce Racist/Harmful Responses: One of the most concerning consequences of prompt injection is that it can be used to manipulate LLMs into generating biased, racist, or harmful outputs. LLMs learn from the data they are trained on, and if that data contains biased or offensive content, the model can inadvertently reproduce those patterns in its responses. Malicious actors can exploit this by subtly introducing biased prompts, leading the model to produce harmful and discriminatory outputs.

  4. Spreading Misinformation: LLMs are often employed to answer questions or provide information on a wide array of topics. By injecting misleading or false prompts, attackers can deceive the model into generating incorrect or fabricated responses, leading to the spread of misinformation on a large scale.

  5. Exploiting AI Systems: Prompt injection can also be used to manipulate AI systems for malicious purposes. For example, in sentiment analysis tasks, attackers can craft prompts to cause the model to provide false positive or negative sentiment scores. This could be leveraged for various manipulative actions, such as gaming online reviews or manipulating stock prices. Or if the LLM is given the ability to execute code, then it can be exploited to perform remote code execution (RCE, e.g. CVE-2023-36189).

These examples highlight the severity of prompt injection as a security risk, emphasizing the potential for real-world consequences if not properly addressed. The consequences range from compromising data integrity and privacy to amplifying harmful biases in AI systems, underscoring the urgent need to develop robust defenses against this emerging threat.

Defensive AI countermeasures for prompt injection

While SQL injections can be mitigated by properly sanitizing user-provided input, this solution fails for LLMs for a number of reasons:

  1. Whereas SQL distinguishes between user-provided strings and SQL language constructs, LLMs make no such distinction and instead treat all of its inputs similarly (as tokens within a sequence).
  2. The behaviour of LLM systems are inherently stochastic; even if user-provided input were interpolated into a carefully-engineered template, it is difficult to ensure that the resulting system is free from vulnerabilities (Anthropic's Ganguli et al. 2022).

Instead, we consider firewall-based solutions where prompt injection attempts are detected and actioned upon before reaching the LLM. The remainder of this post will prototype and evaluate a natural language processing (NLP) classifier to perform this task; for a production-grade LLM firewall take a look at fmops's threat protection product,

Using NLP to protect against prompt injections

To filter out user inputs which attempt to perform prompt injection, we will apply NLP techniques to develop a machine learning classifier which predicts whether a given text sample represents a prompt injection attempt.

The deepset/prompt-injection dataset contains 662 LLM interactions with corresponding labels indicating whether it was a prompt injection attempt. We fine-tune distilBERT on this dataset and extract BERT's [CLS] token hidden state for sequence embeddings.

Training converges after a few epochs and yields a test-set accuracy of 96.5%. We have released the trained model under an Apache 2.0 license to HuggingFace at fmops/distilbert-prompt-injection. You can try out the classifier for yourself:

Below we visualize the embeddings obtained after fine-tuning:

Visualization of the latent embedding space obtained by fine-tuning distilBERT for prompt injection classification. "1" labels correspond to prompt injections. Hover over a point to see its corresponding text.

Our results reveal that the dataset embeds into two well-separated clusters and that the label is relatively homogeneous within each cluster. The cluster of prompt injections contain no negative examples, but there are a few false negatives (examples labelled "1" inside of the cluster with majority "0" labels). Most of these false negatives are in a non-English language, suggesting that the language biases from distilBERT's pre-training on English dominant sources (BooksCorpus and English Wikipedia) have been inherited.

Need an out-of-the-box solutions to defend against prompt injection? Take a look at's threat protection, a cloud-native LLM firewall that leverages our always-up-to-date threat database.