{"id":91720,"date":"2025-03-04T02:43:33","date_gmt":"2025-03-04T02:43:33","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/03\/04\/people-are-using-super-mario-to-benchmark-ai-now\/"},"modified":"2025-03-04T02:43:33","modified_gmt":"2025-03-04T02:43:33","slug":"people-are-using-super-mario-to-benchmark-ai-now","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/03\/04\/people-are-using-super-mario-to-benchmark-ai-now\/","title":{"rendered":"People are using Super Mario to benchmark AI now"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Thought <a href=\"https:\/\/techcrunch.com\/2025\/02\/25\/anthropics-claude-ai-is-playing-pokemon-on-twitch-slowly\/\">Pok\u00e9mon was a tough benchmark for AI<\/a>? One group of researchers argues that Super Mario Bros. is even tougher. <\/p>\n<p class=\"wp-block-paragraph\">Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic\u2019s <a href=\"https:\/\/techcrunch.com\/2025\/02\/24\/anthropic-launches-a-new-ai-model-that-thinks-as-long-as-you-want\/\">Claude 3.7<\/a> performed the best, followed by Claude 3.5. Google\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/04\/09\/googles-gemini-pro-1-5-enters-public-preview-on-vertex-ai\/\">Gemini 1.5 Pro<\/a> and OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/05\/13\/openais-newest-model-is-gpt-4o\/\">GPT-4o<\/a> struggled.<\/p>\n<p class=\"wp-block-paragraph\">It wasn\u2019t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, <a href=\"https:\/\/github.com\/lmgame-org\/GamingAgent\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">GamingAgent<\/a>, to give the AIs control over Mario.<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"592\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/ezgif-12b952f5417751.gif?w=640\" alt=\"Super Mario Bros. AI benchmark\" class=\"wp-image-2974640\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Hao Lab<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, \u201cIf an obstacle or enemy is near, move\/jump left to dodge\u201d and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.<\/p>\n<p class=\"wp-block-paragraph\">Still, Hao says that the game forced each model to \u201clearn\u201d to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/12\/05\/openais-o1-model-sure-tries-to-deceive-humans-a-lot\/\">o1<\/a>, which \u201cthink\u201d through problems step by step to arrive at solutions, performed worse than \u201cnon-reasoning\u201d models, despite being generally stronger on most benchmarks.<\/p>\n<p class=\"wp-block-paragraph\">One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while \u2014 seconds, usually \u2014 to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death.<\/p>\n<p class=\"wp-block-paragraph\">Games have been used to benchmark AI for decades. But <a href=\"https:\/\/venturebeat.com\/uncategorized\/why-games-may-not-be-the-best-benchmark-for-ai\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">some experts have questioned the wisdom<\/a> of drawing connections between AI\u2019s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.<\/p>\n<p class=\"wp-block-paragraph\">The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an \u201cevaluation crisis.\u201d<\/p>\n<p class=\"wp-block-paragraph\">\u201cI don\u2019t really know what [AI] metrics to look at right now,\u201d he wrote in a <a href=\"https:\/\/x.com\/karpathy\/status\/1896266683301659068\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">post on X<\/a>. \u201cTLDR my reaction is I don\u2019t really know how good these models are right now.\u201d<\/p>\n<p class=\"wp-block-paragraph\">At least we can watch AI play Mario.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/03\/03\/people-are-using-super-mario-to-benchmark-ai-now\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Thought Pok\u00e9mon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research<\/p>\n","protected":false},"author":1,"featured_media":91721,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-91720","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91720","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=91720"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91720\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/91721"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=91720"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=91720"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=91720"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}