Can you jailbreak Anthropic latest AI safety measure? Researchers want you to try — and are offering up to $15,000 if you succeed.

On Monday, the company released a new paper outlining an AI safety system based on Constitutional Classifiers. The process is based on Constitutional AI, a system Anthropic used to make Claude “harmless,” in which one AI helps monitor and improve another. Each technique is guided by a constitution, or “list of principles” that a model must abide by, Anthropic explained in a blog. 

Trained on synthetic data, these “classifiers” were able to filter the “overwhelming majority” of jailbreak attempts without excessive over-refusals (incorrect flags of harmless content as harmful), according to Anthropic. 

“The principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not),” Anthropic noted. Researchers ensured prompts accounted for jailbreaking attempts in different languages and styles. 

In initial testing, 183 human red-teamers spent more than 3,000 hours over two months attempting to jailbreak Claude 3.5 Sonnet from a prototype of the system, which was trained not to share any information about “chemical, biological, radiological, and nuclear harms.” Jailbreakers were given 10 restricted queries to use as part of their attempts; breaches were only counted as successful if they got the model to answer all 10 in detail. 

Explore IT Tech News for the latest advancements in Information Technology & insightful updates from industry experts! 

Source : https://www.zdnet.com/article/jailbreak-anthropics-new-ai-safety-system-for-a-15000-reward/