Simple Hacking Technique Can Extract ChatGPT Training Data


Can getting ChatGPT to repeat the same word over and over again cause it to regurgitate large amounts of its training data, including personally identifiable information and other data scraped from the Web?

The answer is an emphatic yes, according to a team of researchers at Google DeepMind, Cornell University, and four other universities who tested the hugely popular generative AI chatbot’s susceptibility to leaking data when prompted in a specific way.

‘Poem’ as a Trigger Word

In a report this week, the researchers described how they got ChatGPT to spew out memorized portions of its training data merely by prompting it to repeat words like “poem,” “company,” “send,” “make,” and “part” forever.

For example, when the researchers prompted ChatGPT to repeat the word “poem” forever, the chatbot initially responded by repeating the word as instructed. But after a few hundred times, ChatGPT began generating “often nonsensical” output, a small fraction of which included memorized training data such as an individual’s email signature and personal contact information.

The researchers discovered that some words were better at getting the generative AI model to spill memorized data than others. For instance, prompting the chatbot to repeat the word “company” caused it to emit training data 164 times more often than other words, such as “know.”

Data that the researchers were able to extract from ChatGPT in this manner included personally identifiable information on dozens of individuals; explicit content (when the researchers used an NSFW word as a prompt); verbatim paragraphs from books and poems (when the prompts contained the word “book” or “poem”); and URLs, unique user identifiers, bitcoin addresses, and programming code.

A Potentially Big Privacy Issue?

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5-turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper titled “Scalable Extraction of Training Data from (Production) Language Models.”

“Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data,” they wrote. The researchers estimated an adversary could extract 10 times more…

Source…