The Hacking of ChatGPT Is Just Getting Started

As a result, jailbreak authors have become more creative. The most prominent jailbreak was DAN, where ChatGPT was told to pretend it was a rogue AI model called Do Anything Now. This could, as the name implies, avoid OpenAI’s policies dictating that ChatGPT shouldn’t be used to produce illegal or harmful material. To date, people have created around a dozen different versions of DAN.

However, many of the latest jailbreaks involve combinations of methods—multiple characters, ever more complex backstories, translating text from one language to another, using elements of coding to generate outputs, and more. Albert says it has been harder to create jailbreaks for GPT-4 than the previous version of the model powering ChatGPT. However, some simple methods still exist, he claims. One recent technique Albert calls “text continuation” says a hero has been captured by a villain, and the prompt asks the text generator to continue explaining the villain’s plan.

When we tested the prompt, it failed to work, with ChatGPT saying it cannot engage in scenarios that promote violence. Meanwhile, the “universal” prompt created by Polyakov did work in ChatGPT. OpenAI, Google, and Microsoft did not directly respond to questions about the jailbreak created by Polyakov. Anthropic, which runs the Claude AI system, says the jailbreak “sometimes works” against Claude, and it is consistently improving its models.

“As we give these systems more and more power, and as they become more powerful themselves, it’s not just a novelty, that’s a security issue,” says Kai Greshake, a cybersecurity researcher who has been working on the security of LLMs. Greshake, along with other researchers, has demonstrated how LLMs can be impacted by text they are exposed to online through prompt injection attacks.

In one research paper published in February, reported on by Vice’s Motherboard, the researchers were able to show that an attacker can plant malicious instructions on a webpage; if Bing’s chat system is given access to the instructions, it follows them. The researchers used the technique in a controlled test to turn Bing Chat into a scammer that asked for people’s personal information….

Source…