{"id":93445,"date":"2025-04-15T03:38:21","date_gmt":"2025-04-15T03:38:21","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/04\/15\/debates-over-ai-benchmarking-have-reached-pokemon\/"},"modified":"2025-04-15T03:38:21","modified_gmt":"2025-04-15T03:38:21","slug":"debates-over-ai-benchmarking-have-reached-pokemon","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/04\/15\/debates-over-ai-benchmarking-have-reached-pokemon\/","title":{"rendered":"Debates over AI benchmarking have reached Pok\u00e9mon"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Not even Pok\u00e9mon is safe from AI benchmarking controversy. <\/p>\n<p class=\"wp-block-paragraph\">Last week, a <a rel=\"nofollow\" href=\"https:\/\/x.com\/Jush21e8\/status\/1910293595422413051\">post on X<\/a> went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s flagship Claude model in the original Pok\u00e9mon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer\u2019s Twitch stream; Claude was <a href=\"https:\/\/techcrunch.com\/2025\/02\/24\/anthropic-used-pokemon-to-benchmark-its-newest-ai-model\/\">stuck at Mount Moon<\/a> as of late February.<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town<\/p>\n<p class=\"wp-block-paragraph\">119 live views only btw, incredibly underrated stream <a href=\"https:\/\/t.co\/8AvSovAI4x\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">pic.twitter.com\/8AvSovAI4x<\/a><\/p>\n<p class=\"wp-block-paragraph\">\u2014 Jush (@Jush21e8) <a href=\"https:\/\/twitter.com\/Jush21e8\/status\/1910293595422413051?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">April 10, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">But what the post failed to mention is that Gemini had an advantage.<\/p>\n<p class=\"wp-block-paragraph\">As <a href=\"https:\/\/www.reddit.com\/r\/singularity\/comments\/1jvwqc9\/gemini_plays_pok%C3%A9mon_has_made_it_through_rock\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">users on Reddit<\/a> pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify \u201ctiles\u201d in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.<\/p>\n<p class=\"wp-block-paragraph\">Now, Pok\u00e9mon is a semi-serious AI benchmark at best \u2014 few would argue it\u2019s a very informative test of a model\u2019s capabilities. But it <em>is<\/em> an instructive example of how different implementations of a benchmark can influence the results.<\/p>\n<p class=\"wp-block-paragraph\">For example, Anthropic <a rel=\"nofollow\" href=\"https:\/\/www.anthropic.com\/news\/claude-3-7-sonnet\">reported<\/a> two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model\u2019s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a \u201ccustom scaffold\u201d that Anthropic developed.<\/p>\n<p class=\"wp-block-paragraph\">More recently, Meta <a href=\"https:\/\/techcrunch.com\/2025\/04\/06\/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading\/\">fine-tuned<\/a> a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The <a href=\"https:\/\/techcrunch.com\/2025\/04\/11\/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark\/\">vanilla version<\/a> of the model scores significantly worse on the same evaluation.<\/p>\n<p class=\"wp-block-paragraph\">Given that AI benchmarks \u2014 Pok\u00e9mon included \u2014 are <a href=\"https:\/\/techcrunch.com\/2024\/03\/07\/heres-why-most-ai-benchmarks-tell-us-so-little\/\">imperfect measures<\/a> to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn\u2019t seem likely that it\u2019ll get any easier to compare models as they\u2019re released.<\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/04\/14\/debates-over-ai-benchmarking-have-reached-pokemon\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Not even Pok\u00e9mon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s<\/p>\n","protected":false},"author":1,"featured_media":93446,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-93445","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/93445","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=93445"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/93445\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/93446"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=93445"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=93445"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=93445"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}