{"id":79374,"date":"2024-04-02T21:14:30","date_gmt":"2024-04-02T21:14:30","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2024\/04\/02\/anthropic-researchers-wear-down-ai-ethics-with-repeated-questions\/"},"modified":"2024-04-02T21:14:30","modified_gmt":"2024-04-02T21:14:30","slug":"anthropic-researchers-wear-down-ai-ethics-with-repeated-questions","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2024\/04\/02\/anthropic-researchers-wear-down-ai-ethics-with-repeated-questions\/","title":{"rendered":"Anthropic researchers wear down AI ethics with repeated questions"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\">How do you get an AI to answer a question it\u2019s not supposed to? There are many such \u201cjailbreak\u201d techniques, and Anthropic researchers just found a new one, in which a large language model can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.<\/p>\n<p>They call the approach<a href=\"https:\/\/www.anthropic.com\/research\/many-shot-jailbreaking\"> \u201cmany-shot jailbreaking,\u201d<\/a> and have both <a href=\"https:\/\/www-cdn.anthropic.com\/af5633c94ed2beb282f6a53c595eb437e8e7b630\/Many_Shot_Jailbreaking__2024_04_02_0936.pdf\">written a paper<\/a> about it and also informed their peers in the AI community about it so it can be mitigated.<\/p>\n<p>The vulnerability is a new one, resulting from the increased \u201ccontext window\u201d of the latest generation of LLMs. This is the amount of data they can hold in what you might call short-term memory, once only a few sentences but now thousands of words and even entire books.<\/p>\n<p>What Anthropic\u2019s researchers found was that these models with large context windows tend to perform better on many tasks if there are lots of examples of that task within the prompt. So if there are lots of trivia questions in the prompt (or priming document, like a big list of trivia that the model has in context), the answers actually get better over time. So a fact that it might have gotten wrong if it was the first question, it may get right if it\u2019s the hundredth question.<\/p>\n<p>But in an unexpected extension of this \u201cin-context learning,\u201d as it\u2019s called, the models also get \u201cbetter\u201d at replying to inappropriate questions. So if you ask it to build a bomb right away, it will refuse. But if you ask it to answer 99 other questions of lesser harmfulness and then ask it to build a bomb\u2026 it\u2019s a lot more likely to comply.<\/p>\n<div id=\"attachment_2686303\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img fetchpriority=\"high\" fetchpriority=\"high\" decoding=\"async\" aria-describedby=\"caption-attachment-2686303\" class=\"size-full wp-image-2686303\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp\" alt=\"\" width=\"1024\" height=\"642\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp 2200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=150,94 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=300,188 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=768,482 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=680,427 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=1536,963 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=2048,1285 2048w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=1200,753 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/04\/many-shot-jailbreak.webp?resize=50,31 50w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/><\/p>\n<p id=\"caption-attachment-2686303\" class=\"wp-caption-text\"><strong>Image Credits:<\/strong> Anthropic<\/p>\n<\/div>\n<p>Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM, but clearly there is some mechanism that allows it to home in on what the user wants, as evidenced by the content in the context window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers.<\/p>\n<p>The team already informed its peers and indeed competitors about this attack, something it hopes will \u201cfoster a culture where exploits like this are openly shared among LLM providers and researchers.\u201d<\/p>\n<p>For their own mitigation, they found that although limiting the context window helps, it also has a negative effect on the model\u2019s performance. Can\u2019t have that \u2014 so they are working on classifying and contextualizing queries before they go to the model. Of course, that just makes it so you have a different model to fool\u2026 but at this stage, goalpost-moving in AI security is to be expected.<\/p>\n<\/p><\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2024\/04\/02\/anthropic-researchers-wear-down-ai-ethics-with-repeated-questions\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How do you get an AI to answer a question it\u2019s not supposed to? There are many such \u201cjailbreak\u201d techniques, and Anthropic researchers just found<\/p>\n","protected":false},"author":1,"featured_media":79375,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-79374","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/79374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=79374"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/79374\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/79375"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=79374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=79374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=79374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}