Jailbreaking’ AI services such as ChatGPT and Claude 3 Opus is surprisingly simple, according to AI researchers who have uncovered a potential vulnerability in large language models (LLMs) like ChatGPT and Anthropic’s Claude 3 chatbot.
Known as “many shot jailbreaking,” this exploit capitalizes on “in-context learning,” where the AI absorbs information from text prompts written by users. The researchers detailed this flaw in a paper uploaded to the sanity.io cloud repository and tested it on Anthropic’ s Claude 2 AI chatbot.
The study suggests that people could manipulate LLMs into generating risky responses, bypassing the built-in security measures meant to prevent such outcomes. These security protocols typically control how the AI responds to sensitive queries, such as those related to constructing a bomb.
LLMs like ChatGPT rely on a “context window” to understand conversations, with longer context windows allowing for better responses. However, the researchers note that these windows have expanded significantly since the beginning of 2023, leading to more context-aware but potentially exploitable AI behaviors.
The attack involves crafting a fabricated conversation between a user and the AI in a text prompt, followed by a query designed to trigger a harmful response. The AI, having learned from the input text, may bypass its safety protocols and provide an answer. The success of this attack increases with the length of the “script” and the number of “shots” or question-answer pairs included.
The researchers found that adding an extra step to classify and modify the prompt before the AI processes it could mitigate the attack’s success rate significantly. While many shot jailbreaking currently poses limited risks due to the relatively low power of LLMs, it could become more hazardous as more potent models emerge in the future.
The researchers have alerted other AI companies and researchers to this vulnerability, emphasizing the need for proactive measures to prevent potential harm in the future.