{"id":91337,"date":"2025-02-23T02:24:21","date_gmt":"2025-02-23T02:24:21","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/02\/23\/did-xai-lie-about-grok-3s-benchmarks\/"},"modified":"2025-02-23T02:24:21","modified_gmt":"2025-02-23T02:24:21","slug":"did-xai-lie-about-grok-3s-benchmarks","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/02\/23\/did-xai-lie-about-grok-3s-benchmarks\/","title":{"rendered":"Did xAI lie about Grok 3&#8217;s benchmarks?"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Debates over AI benchmarks \u2014 and how they\u2019re reported by AI labs \u2014 are spilling out into public view. <\/p>\n<p class=\"wp-block-paragraph\">This week, an OpenAI employee <a rel=\"nofollow\" href=\"https:\/\/x.com\/BorisMPower\/status\/1892407015038996740\">accused<\/a> Elon Musk\u2019s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, <a rel=\"nofollow\" href=\"https:\/\/x.com\/ibab\/status\/1892418351084732654\">insisted<\/a> that the company was in the right. <\/p>\n<p class=\"wp-block-paragraph\">The truth lies somewhere in between. <\/p>\n<p class=\"wp-block-paragraph\">In a <a rel=\"nofollow\" href=\"https:\/\/x.ai\/blog\/grok-3\">post on xAI\u2019s blog<\/a>, the company published a graph showing Grok 3\u2019s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have <a rel=\"nofollow\" href=\"https:\/\/x.com\/DimitrisPapail\/status\/1888325914603516214\">questioned AIME\u2019s validity as an AI benchmark<\/a>. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model\u2019s math ability. <\/p>\n<p class=\"wp-block-paragraph\">xAI\u2019s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI\u2019s best-performing available model, <a href=\"https:\/\/techcrunch.com\/2025\/01\/31\/openai-launches-o3-mini-its-latest-reasoning-model\/\">o3-mini-high<\/a>, on AIME 2025. But OpenAI employees on X were quick to point out that xAI\u2019s graph didn\u2019t include o3-mini-high\u2019s AIME 2025 score at \u201ccons@64.\u201d <\/p>\n<p class=\"wp-block-paragraph\">What is cons@64, you might ask? Well, it\u2019s short for \u201cconsensus@64,\u201d and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models\u2019 benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that\u2019s isn\u2019t the case.<\/p>\n<p class=\"wp-block-paragraph\">Grok 3 Reasoning Beta and Grok 3 mini Reasoning\u2019s scores for AIME 2025 at \u201c@1\u201d \u2014 meaning the first score the models got on the benchmark \u2014 fall below o3-mini-high\u2019s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/09\/12\/openai-unveils-a-model-that-can-fact-check-itself\/\">o1 model<\/a> set to \u201cmedium\u201d computing. Yet xAI is <a rel=\"nofollow\" href=\"https:\/\/x.com\/xai\/status\/1892400129719611567\">advertising Grok 3<\/a> as the \u201cworld\u2019s smartest AI.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Babushkin <a rel=\"nofollow\" href=\"https:\/\/x.com\/ibab\/status\/1892418351084732654\">argued on X<\/a> that OpenAI has published similarly misleading benchmark charts in the past \u2014 albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more \u201caccurate\u201d graph showing nearly every model\u2019s performance at cons@64:<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it\u2019s DeepSeek propaganda<br \/>(I actually believe Grok looks good there, and openAI\u2019s TTC chicanery behind o3-mini-*high*-pass@\u201d\u201d\u201d1\u2033\u201d\u201d deserves more scrutiny.) <a rel=\"nofollow\" href=\"https:\/\/t.co\/dJqlJpcJh8\">https:\/\/t.co\/dJqlJpcJh8<\/a> <a rel=\"nofollow\" href=\"https:\/\/t.co\/3WH8FOUfic\">pic.twitter.com\/3WH8FOUfic<\/a><\/p>\n<p class=\"wp-block-paragraph\">\u2014 Teortaxes\u25b6\ufe0f (DeepSeek \u63a8\u7279\ud83d\udc0b\u94c1\u7c89 2023 \u2013 \u221e) (@teortaxesTex) <a rel=\"nofollow\" href=\"https:\/\/twitter.com\/teortaxesTex\/status\/1892535507352961221?ref_src=twsrc%5Etfw\">February 20, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">But as AI researcher Nathan Lambert <a rel=\"nofollow\" href=\"https:\/\/x.com\/natolambert\/status\/1892675458166382687\">pointed out in a post<\/a>, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models\u2019 limitations \u2014 and their strengths.<\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/02\/22\/did-xai-lie-about-grok-3s-benchmarks\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Debates over AI benchmarks \u2014 and how they\u2019re reported by AI labs \u2014 are spilling out into public view. This week, an OpenAI employee accused<\/p>\n","protected":false},"author":1,"featured_media":91338,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[178],"tags":[],"class_list":["post-91337","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=91337"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91337\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/91338"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=91337"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=91337"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=91337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}