{"id":92599,"date":"2025-03-25T03:14:02","date_gmt":"2025-03-25T03:14:02","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/03\/25\/a-new-challenging-agi-test-stumps-most-ai-models\/"},"modified":"2025-03-25T03:14:02","modified_gmt":"2025-03-25T03:14:02","slug":"a-new-challenging-agi-test-stumps-most-ai-models","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/03\/25\/a-new-challenging-agi-test-stumps-most-ai-models\/","title":{"rendered":"A new, challenging AGI test stumps most AI models"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher Fran\u00e7ois Chollet, announced in a <a rel=\"nofollow\" href=\"https:\/\/arcprize.org\/blog\/announcing-arc-agi-2-and-arc-prize-2025\">blog post<\/a> on Monday that it has created a new, challenging test to measure the general intelligence of leading AI models.<\/p>\n<p class=\"wp-block-paragraph\">So far, the new test, called ARC-AGI-2, has stumped most models.<\/p>\n<p class=\"wp-block-paragraph\">\u201cReasoning\u201d AI models like OpenAI\u2019s o1-pro and DeepSeek\u2019s R1 score between 1% and 1.3% on ARC-AGI-2, according to the <a rel=\"nofollow\" href=\"https:\/\/arcprize.org\/leaderboard\">Arc Prize leaderboard<\/a>. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%.<\/p>\n<p class=\"wp-block-paragraph\">The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct \u201canswer\u201d grid. The problems were designed to force an AI to adapt to new problems it hasn\u2019t seen before. <\/p>\n<p class=\"wp-block-paragraph\">The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, \u201cpanels\u201d of these people got 60% of the test\u2019s questions right \u2014 much better than any of the models\u2019 scores.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" loading=\"lazy\" decoding=\"async\" width=\"1624\" height=\"786\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?w=680\" alt=\"\" class=\"wp-image-2985527\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png 1624w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=150,73 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=300,145 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=768,372 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=680,329 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1200,581 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1280,620 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=430,208 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=720,348 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=900,436 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=800,387 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1536,743 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=668,323 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1275,617 1275w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=708,343 708w\" sizes=\"auto, (max-width: 1624px) 100vw, 1624px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">a sample question from Arc-AGI-2 (credit: Arc Prize).<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In a <a rel=\"nofollow\" href=\"https:\/\/x.com\/fchollet\/status\/1904265979192086882\">post on X<\/a>, Chollet claimed ARC-AGI-2 is a better measure of an AI model\u2019s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation\u2019s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.<\/p>\n<p class=\"wp-block-paragraph\">Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on \u201cbrute force\u201d \u2014 extensive computing power \u2014 to find solutions. Chollet previously acknowledged <a href=\"https:\/\/techcrunch.com\/2024\/12\/09\/a-test-for-agi-is-closer-to-being-solved-but-it-may-be-flawed\/\">this was a major flaw of ARC-AGI-1.<\/a><\/p>\n<p class=\"wp-block-paragraph\">To address the first test\u2019s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.<\/p>\n<p class=\"wp-block-paragraph\">\u201cIntelligence is not solely defined by the ability to solve problems or achieve high scores,\u201d Arc Prize Foundation co-founder Greg Kamradt wrote in a <a rel=\"nofollow\" href=\"https:\/\/arcprize.org\/blog\/announcing-arc-agi-2-and-arc-prize-2025\">blog post<\/a>. \u201cThe efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, \u2018Can AI acquire [the] skill to solve a task?\u2019 but also, \u2018At what efficiency or cost?\u2019\u201d<\/p>\n<p class=\"wp-block-paragraph\">ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its <a href=\"https:\/\/techcrunch.com\/2024\/12\/20\/openai-announces-new-o3-model\/\">advanced reasoning model, o3<\/a>, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, <a href=\"https:\/\/techcrunch.com\/2024\/12\/23\/openais-o3-suggests-ai-models-are-scaling-in-new-ways-but-so-are-the-costs\/\">o3\u2019s performance gains on ARC-AGI-1 came with a hefty price tag<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The version of OpenAI\u2019s o3 model \u2014 o3 (low) \u2014 that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" loading=\"lazy\" decoding=\"async\" width=\"1602\" height=\"902\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?w=680\" alt=\"\" class=\"wp-image-2985529\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png 1602w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=150,84 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=300,169 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=768,432 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=680,383 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1200,676 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1280,721 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=430,242 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=720,405 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=900,507 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=800,450 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1536,865 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=668,376 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=666,375 666w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1096,617 1096w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=708,399 708w\" sizes=\"auto, (max-width: 1602px) 100vw, 1602px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face\u2019s co-founder, Thomas Wolf, recently told TechCrunch that <a href=\"https:\/\/techcrunch.com\/2025\/03\/19\/the-ai-leaders-bringing-the-agi-debate-down-to-earth\/\">the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence<\/a>, including creativity.<\/p>\n<p class=\"wp-block-paragraph\">Alongside the new benchmark, the Arc Prize Foundation announced <a rel=\"nofollow\" href=\"https:\/\/arcprize.org\/competition\">a new Arc Prize 2025 contest<\/a>, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/03\/24\/a-new-challenging-agi-test-stumps-most-ai-models\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher Fran\u00e7ois Chollet, announced in a blog post on Monday that it has created a<\/p>\n","protected":false},"author":1,"featured_media":92600,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-92599","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/92599","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=92599"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/92599\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/92600"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=92599"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=92599"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=92599"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}