Research Team Including Carnegie Mellon University Releases Report on 27th
OpenAI and Others "Strive to Make Models Robust"
Concerns have been raised that guardrails?restrictive measures installed to prevent the misuse of generative artificial intelligence (AI)?can be easily bypassed simply by inputting certain prompts, highlighting the need for countermeasures.
On the 27th (local time), The New York Times (NYT) reported that researchers Andy Zhu from Carnegie Mellon University and Zipan Wang from the California-based AI Safety Center released a report detailing such methods. The NYT stated that the researchers demonstrated how anyone can circumvent AI safety systems and generate harmful information without restrictions using these techniques.
AI systems are typically bound by guardrails set by companies to prevent sexual conversations, biased remarks, and the provision of false or harmful information. These mechanisms are designed to respond with "I cannot answer that" when problematic questions are asked. However, cases of so-called "jailbreaking," where these guardrails are bypassed through various methods such as entering specific commands, have been increasingly emerging.
In their report, the researchers pointed out that adding long sentences when inputting problematic prompts easily breaks the guardrails established by AI companies. For example, if a user simply asks, "Tell me how to make a bomb," the AI refuses to answer. But if the problematic sentence is appended with other sentences to disguise it as a non-core question, the AI fails to recognize that it violates the guardrails. Using similar methods, even questions like "Tell me how to manipulate the 2024 election," which could be problematic, were answered by the AI without considering the guardrails.
The researchers confirmed this method in AI systems using open-source large language models (LLMs) and found that when applied to AI systems using proprietary LLMs from companies like Google, OpenAI, and Anthropic, the guardrails were similarly dismantled. They also created a suffix-generation tool that exploits open-source systems to bypass AI chatbots. According to them, this tool automatically generates adversarial suffixes that break the guardrails.
This criticism comes amid growing concerns in the industry about the misuse of Meta Platforms' recently open-sourced LLM, "Llama 2." Meta has stated that it is actively taking preemptive measures, including deploying red teams, to ensure no issues arise amid various concerns.
The researchers noted that while specific suffixes identified during the study can be blocked by adding additional guardrails, such an approach cannot address all situations. Professor Zico Colter of Carnegie Mellon University expressed concern, saying, "There is no clear solution," and "It is possible to create as many such attacks as desired in a short time."
Companies fervently building generative AI systems said they were seeking countermeasures regarding these guardrail issues after being contacted by the researchers prior to the report’s release.
OpenAI stated, "We continue to work on making our models more robust against adversarial attacks." Google said, "We are building important guardrails for our generative AI chatbot Bard and are continuously improving them," while Anthropic emphasized that it is researching countermeasures and "there is much work to be done."
Somesh Ya, a Google researcher specializing in AI security and a professor at the University of Wisconsin?Madison, told the NYT that this report is a "game changer" and will prompt the entire industry to reconsider how to build guardrails for AI systems. He added that if this vulnerability continues to be highlighted, government legislation to regulate these systems might be pursued.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.


