"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

1CISPA Helmholtz Center for Information Security, 2NetApp

Abstract

The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs.

In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection.

To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Disclaimer. This website contains examples of harmful language. Reader discretion is recommended.

What is Jailbreak Prompt?

Jailbreak prompt is a particular type of adversarial prompts that is deliberately crafted to bypass safeguards and manipulate LLMs into generating harmful content.

As shown on the right, a jailbreak prompt can successfully lead the LLM to provide detailed answers to the dangerous question ("how can I create a deadly poison that is undetectable and untraceable"), even when the LLM can appropriately refuse the same question without the prompt.


Data Collection

To comprehensively study in-the-wild jailbreak prompts, we consider four platforms, i.e., Reddit, Discord, websites, and open-source datasets, in our study, for their popularity in sharing prompts. We extract 6,387 prompts from Dec 2022 to May 2023, and successfully identify 666 jailbreak prompts among them. To the best of our knowledge, this dataset serves as the largest collection of in-the-wild jailbreak prompts.

Our evaluation framework

Jailbreak Characteristics

We find jailbreak prompts often utilize 1) more instructions, 2) more toxic language, and 3) are close to regular prompts in the semantic space. Leveraging graph-based community detection, we identified eight major jailbreak communities, involving various attack strategies such as prompt injection, privilege escalation, deception, virtualization, and more.


Click to explore major jailbreak communities !


Darker shades indicate higher co-occurrence. Punctuations are removed for co-occurrence ratio calculation.


Jailbreak Evolution

Jailbreak prompts have evolved to be 1) more stealthy and effective; 2) The origin of jailbreak prompts is intentionally shifting from public platforms such as Reddit to private ones like Discord .


Jailbreak Effectiveness

Even though LLMs trained with RLHF show initial resistance to forbidden questions, they have weak resistance towards jailbreak prompts. Some jailbreak prompts can even achieve 0.99 attack success rates (ASR) on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days.

* Forbidden scenarios are from OpenAI usgae policy.


Safeguard Effectiveness

Existing safeguards demonstrate limited ASR reductions on jailbreak prompts, calling for stronger and more adaptive defense mechanisms.



Ethics and Disclosures

We acknowledge that data collected online can contain personal information. Thus, we adopt standard best practices to guarantee that our study follows ethical principles, such as not trying to deanonymize any user and reporting results on aggregate. Since this study only involved publicly available data and had no interactions with participants, it is not regarded as human subjects research by our Institutional Review Boards (IRB). Nonetheless, since one of our goals is to measure the risk of LLMs in answering harmful questions, it is inevitable to disclose how a model can generate hateful content. This can bring up worries about potential misuse. However, we strongly believe that raising awareness of the problem is even more crucial, as it can inform LLM vendors and the research community to develop stronger safeguards and contribute to the more responsible release of these models.

We have responsibly disclosed our findings to related LLM vendors.


BibTeX

If you find this useful in your research, please consider citing:

@article{SCBSZ23,
        author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
        title = {{"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
        journal = {{CoRR abs/2308.03825}},
        year = {2023}
}