Random Image Display on Page Reload

DeepSeek’s Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

Jan 31, 2025 1:30 PM

DeepSeek’s Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

Security researchers tested 50 well-known jailbreaks against DeepSeek’s popular new AI chatbot. It didn’t stop a single one.

DeepSeeks Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

Photo-Illustration: Wired Staff/Getty Images

Ever since OpenAI released ChatGPT at the end of 2022, hackers and security researchers have tried to find holes in large language models (LLMs) to get around their guardrails and trick them into spewing out hate speech, bomb-making instructions, propaganda, and other harmful content. In response, OpenAI and other generative AI developers have refined their system defenses to make it more difficult to carry out these attacks. But as the Chinese AI platform DeepSeek rockets to prominence with its new, cheaper R1 reasoning model, its safety protections appear to be far behind those of its established competitors.

Today, security researchers from Cisco and the University of Pennsylvania are publishing findings showing that, when tested with 50 malicious prompts designed to elicit toxic content, DeepSeek’s model did not detect or block a single one. In other words, the researchers say they were shocked to achieve a “100 percent attack success rate.”

The findings are part of a growingbodyofevidence that DeepSeek’s safety and security measures may not match those of other tech companies developing LLMs. DeepSeek’s censorship of subjects deemed sensitive by China’s government has also been easily bypassed.

“A hundred percent of the attacks succeeded, which tells you that there’s a trade-off,” DJ Sampath, the VP of product, AI software and platform at Cisco, tells WIRED. “Yes, it might have been cheaper to build something here, but the investment has perhaps not gone into thinking through what types of safety and security things you need to put inside of the model.”

Other researchers have had similar findings. Separate analysis published today by the AI security company Adversa AI and shared with WIRED also suggests that DeepSeek is vulnerable to a wide range of jailbreaking tactics, from simple language tricks to complex AI-generated prompts.

DeepSeek, which has been dealing with an avalanche of attention this week and has not spoken publicly about a range of questions, did not respond to WIRED’s request for comment about its model’s safety setup.

Generative AI models, like any technological system, can contain a host of weaknesses or vulnerabilities that, if exploited or set up poorly, can allow malicious actors to conduct attacks against them. For the current wave of AI systems, indirect prompt injection attacks are considered one of the biggest security flaws. These attacks involve an AI system taking in data from an outside source—perhaps hidden instructions of a website the LLM summarizes—and taking actions based on the information.

Jailbreaks, which are one kind of prompt-injection attack, allow people to get around the safety systems put in place to restrict what an LLM can generate. Tech companies don’t want people creating guides to making explosives or using their AI to create reams of disinformation, for example.

Jailbreaks started out simple, with people essentially crafting clever sentences to tell an LLM to ignore content filters—the most popular of which was called “Do Anything Now” or DAN for short. However, as AI companies have put in place more robust protections, some jailbreaks have become more sophisticated, often being generated using AI or using special and obfuscated characters. While all LLMs are susceptible to jailbreaks, and much of the information could be found through simple online searches, chatbots can still be used maliciously.

“Jailbreaks persist simply because eliminating them entirely is nearly impossible—just like buffer overflow vulnerabilities in software (which have existed for over 40 years) or SQL injection flaws in web applications (which have plagued security teams for more than two decades),” Alex Polyakov, the CEO of security firm Adversa AI, told WIRED in an email.

Cisco’s Sampath argues that as companies use more types of AI in their applications, the risks are amplified. “It starts to become a big deal when you start putting these models into important complex systems and those jailbreaks suddenly result in downstream things that increases liability, increases business risk, increases all kinds of issues for enterprises,” Sampath says.

The Cisco researchers drew their 50 randomly selected prompts to test DeepSeek’s R1 from a well-known library of standardized evaluation prompts known as HarmBench. They tested prompts from six HarmBench categories, including general harm, cybercrime, misinformation, and illegal activities. They probed the model running locally on machines rather than through DeepSeek’s website or app, which send data to China.

Beyond this, the researchers say they have also seen some potentially concerning results from testing R1 with more involved, non-linguistic attacks using things like Cyrillic characters and tailored scripts to attempt to achieve code execution. But for their initial tests, Sampath says, his team wanted to focus on findings that stemmed from a generally recognized benchmark.

Cisco also included comparisons of R1’s performance against HarmBench prompts with the performance of other models. And some, like Meta’s Llama 3.1, faltered almost as severely as DeepSeek’s R1. But Sampath emphasizes that DeepSeek’s R1 is a specific reasoning model, which takes longer to generate answers but pulls upon more complex processes to try to produce better results. Therefore, Sampath argues, the best comparison is with OpenAI’s o1 reasoning model, which fared the best of all models tested. (Meta did not immediately respond to a request for comment).

Polyakov, from Adversa AI, explains that DeepSeek appears to detect and reject some well-known jailbreak attacks, saying that “it seems that these responses are often just copied from OpenAI’s dataset.” However, Polyakov says that in his company’s tests of four different types of jailbreaks—from linguistic ones to code-based tricks—DeepSeek’s restrictions could easily be bypassed.

“Every single method worked flawlessly,” Polyakov says. “What’s even more alarming is that these aren’t novel ‘zero-day’ jailbreaks—many have been publicly known for years,” he says, claiming he saw the model go into more depth with some instructions around psychedelics than he had seen any other model create.

“DeepSeek is just another example of how every model can be broken—it’s just a matter of how much effort you put in. Some attacks might get patched, but the attack surface is infinite,” Polyakov adds. “If you’re not continuously red-teaming your AI, you’re already compromised.”

Matt Burgess is a senior writer at WIRED focused on information security, privacy, and data regulation in Europe. He graduated from the University of Sheffield with a degree in journalism and now lives in London. Send tips to Matt_Burgess@wired.com. … Read more
Senior writer

Lily Hay Newman is a senior writer at WIRED focused on information security, digital privacy, and hacking. She previously worked as a technology reporter at Slate, and was the staff writer for Future Tense, a publication and partnership between Slate, the New America Foundation, and Arizona State University. Her work … Read more
Senior Writer

Read More

Exposed DeepSeek Database Revealed Chat Prompts and Internal Data

China-based DeepSeek has exploded in popularity, drawing greater scrutiny. Case in point: Security researchers found more than 1 million records, including user data and API keys, in an open database.
Lily Hay Newman

Scammers Are Creating Fake News Videos to Blackmail Victims

“Yahoo Boy” scammers are impersonating CNN and other news organizations to create videos that pressure victims into making blackmail payments.
Matt Burgess

Foreign Hackers Are Using Google’s Gemini in Attacks on the US

Plus: WhatsApp discloses nearly 100 targets of spyware, hackers used the AT&T breach to hunt for details on US politicians, and more.
Dhruv Mehrotra

A New Jam-Packed Biden Executive Order Tackles Cybersecurity, AI, and More

US president Joe Biden just issued a 40-page executive order that aims to bolster federal cybersecurity protections, directs government use of AI—and takes a swipe at Microsoft’s dominance.
Eric Geller

GitHub’s Deepfake Porn Crackdown Still Isn’t Working

Over a dozen programs used by creators of nonconsensual explicit images have evaded detection on the developer platform, WIRED has found.
Lydia Morrish

Hands On With DeepSeek’s R1 Chatbot

DeepSeek’s chatbot with the R1 model is a stunning release from the Chinese startup. While it’s an innovation in training efficiency, hallucinations still run rampant.
Reece Rogers

New US Rule Aims to Block China’s Access to AI Chips and Models by Restricting the World

The US government has announced a radical plan to control exports of cutting-edge AI technology to most nations.
Will Knight

DOGE Teen Owns ‘Tesla.Sexy LLC’ and Worked at Startup That Has Hired Convicted Hackers

Experts question whether Edward Coristine, a DOGE staffer who has gone by “Big Balls” online, would pass the background check typically required for access to sensitive US government systems.
Andy Greenberg

DeepSeek’s Popular AI App Is Explicitly Sending US Data to China

Amid ongoing fears over TikTok, Chinese generative AI platform DeepSeek says it’s sending heaps of US user data straight to its home country, potentially setting the stage for greater scrutiny.
Matt Burgess

OpenAI’s Operator Lets ChatGPT Use the Web for You

The company that kicked off the AI chatbot craze now wants AI to do more than just talk.
Will Knight

Here’s How DeepSeek Censorship Actually Works—and How to Get Around It

A WIRED investigation shows that the popular Chinese AI model is censored on both the application and training level.
Zeyi Yang

Google Lifts a Ban on Using Its AI for Weapons and Surveillance

Google published principles in 2018 barring its AI technology from being used for sensitive purposes. Weeks into President Donald Trump’s second term, those guidelines are being overhauled.
Paresh Dave

*****
Credit belongs to : www.wired.com

Check Also

President Trump’s War on ‘Information Silos’ Is Bad News for Your Personal Data

Steven Levy Business Apr 4, 2025 10:00 AM President Trump’s War on ‘Information Silos’ Is …