{"id":93689,"date":"2025-04-21T03:46:27","date_gmt":"2025-04-21T03:46:27","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/04\/21\/openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied\/"},"modified":"2025-04-21T03:46:27","modified_gmt":"2025-04-21T03:46:27","slug":"openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/04\/21\/openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied\/","title":{"rendered":"OpenAI&#8217;s o3 AI model scores lower on a benchmark than the company initially implied"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">A discrepancy between first- and third-party benchmark results for OpenAI\u2019s o3 AI model is <a rel=\"nofollow\" href=\"https:\/\/www.reddit.com\/r\/singularity\/comments\/1k2lap5\/epoch_ai_has_released_o3_o4mini_gpt41_gpt41_mini\/\">raising questions about the company\u2019s transparency<\/a> and model testing practices.<\/p>\n<p class=\"wp-block-paragraph\">When OpenAI <a href=\"https:\/\/techcrunch.com\/2024\/12\/20\/openai-announces-new-o3-model\/\">unveiled o3 in December<\/a>, the company claimed the model could answer just over  a fourth of questions on FrontierMath, a challenging set of math problems. That score blew the competition away \u2014 the next-best model managed to answer only around 2% of FrontierMath problems correctly.<\/p>\n<p class=\"wp-block-paragraph\">\u201cToday, all offerings out there have less than 2% [on FrontierMath],\u201d Mark Chen, chief research officer at OpenAI, <a rel=\"nofollow\" href=\"https:\/\/www.youtube.com\/watch?v=SKBG1sqdyIU\">said during a livestream<\/a>. \u201cWe\u2019re seeing [internally], with o3 in aggressive test-time compute settings, we\u2019re able to get over 25%.\u201d<\/p>\n<p class=\"wp-block-paragraph\">As it turns out, that figure was likely an upper bound, achieved by a version of o3 with more computing behind it than the model OpenAI publicly launched last week.<\/p>\n<p class=\"wp-block-paragraph\">Epoch AI, the research institute behind FrontierMath, released results of its independent benchmark tests of o3 on Friday. Epoch found that o3 scored around 10%, well below OpenAI\u2019s highest claimed score.<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini.<\/p>\n<p class=\"wp-block-paragraph\">We evaluated the new models on our suite of math and science benchmarks. Results in thread! <a rel=\"nofollow\" href=\"https:\/\/t.co\/5gbtzkEy1B\">pic.twitter.com\/5gbtzkEy1B<\/a><\/p>\n<p class=\"wp-block-paragraph\">\u2014 Epoch AI (@EpochAIResearch) <a rel=\"nofollow\" href=\"https:\/\/twitter.com\/EpochAIResearch\/status\/1913379475468833146?ref_src=twsrc%5Etfw\">April 18, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">That doesn\u2019t mean OpenAI lied, per se. The benchmark results the company published in December show a lower-bound score that matches the score Epoch observed. Epoch also noted its testing setup likely differs from OpenAI\u2019s, and that it used an updated release of FrontierMath for its evaluations.<\/p>\n<p class=\"wp-block-paragraph\">\u201cThe difference between our results and OpenAI\u2019s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private),\u201d <a rel=\"nofollow\" href=\"https:\/\/epoch.ai\/gradient-updates\/how-much-energy-does-chatgpt-use\">wrote<\/a> Epoch.<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow\" href=\"https:\/\/x.com\/arcprize\/status\/1912567067024453926\">According to a post on X<\/a> from the ARC Prize Foundation, an organization that tested a pre-release version of o3, the public o3 model \u201cis a different model [\u2026] tuned for chat\/product use,\u201d corroborating Epoch\u2019s report.<\/p>\n<p class=\"wp-block-paragraph\">\u201cAll released o3 compute tiers are smaller than the version we [benchmarked],\u201d wrote ARC Prize. Generally speaking, bigger compute tiers can be expected to achieve better benchmark scores.<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Re-testing released o3 on ARC-AGI-1 will take a day or two. Because today\u2019s release is a materially different system, we are re-labeling our past reported results as \u201cpreview\u201d:<\/p>\n<p class=\"wp-block-paragraph\">o3-preview (low): 75.7%, $200\/task<br \/>o3-preview (high): 87.5%, $34.4k\/task<\/p>\n<p class=\"wp-block-paragraph\">Above uses o1 pro pricing\u2026<\/p>\n<p class=\"wp-block-paragraph\">\u2014 Mike Knoop (@mikeknoop) <a rel=\"nofollow\" href=\"https:\/\/twitter.com\/mikeknoop\/status\/1912606277257298415?ref_src=twsrc%5Etfw\">April 16, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">OpenAI\u2019s own Wenda Zhou, a member of the technical staff, <a rel=\"nofollow\" href=\"https:\/\/www.youtube.com\/watch?v=sq8GBPUb3rk\">said during a livestream last week<\/a> that the o3 in production is \u201cmore optimized for real-world use cases\u201d and speed versus the version of o3 demoed in December. As a result, it may exhibit benchmark \u201cdisparities,\u201d he added.<\/p>\n<p class=\"wp-block-paragraph\">\u201c[W]e\u2019ve done [optimizations] to make the [model] more cost efficient [and] more useful in general,\u201d Zhou said. \u201cWe still hope that \u2014 we still think that \u2014 this is a much better model [\u2026] You won\u2019t have to wait as long when you\u2019re asking for an answer, which is a real thing with these [types of] models.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Granted, the fact that the public release of o3 falls short of OpenAI\u2019s testing promises is a bit of a moot point, since the company\u2019s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and OpenAI plans to debut a more powerful o3 variant, o3-pro, in the coming weeks.<\/p>\n<p class=\"wp-block-paragraph\">It is, however, another reminder that AI benchmarks are best not taken at face value \u2014 particularly when the source is a company with services to sell.<\/p>\n<p class=\"wp-block-paragraph\">Benchmarking \u201ccontroversies\u201d are becoming a common occurrence in the AI industry as vendors race to capture headlines and mindshare with new models.<\/p>\n<p class=\"wp-block-paragraph\">In January, Epoch was <a href=\"https:\/\/techcrunch.com\/2025\/01\/19\/ai-benchmarking-organization-criticized-for-waiting-to-disclose-funding-from-openai\/\">criticized<\/a> for waiting to disclose funding from OpenAI until after the company announced o3. Many academics who contributed to FrontierMath weren\u2019t informed of OpenAI\u2019s involvement until it was made public.<\/p>\n<p class=\"wp-block-paragraph\">More recently, Elon Musk\u2019s xAI was <a href=\"https:\/\/techcrunch.com\/2025\/02\/22\/did-xai-lie-about-grok-3s-benchmarks\/\">accused<\/a> of publishing misleading benchmark charts for its latest AI model, Grok 3. Just this month, Meta admitted to touting benchmark scores for a version of <a href=\"https:\/\/techcrunch.com\/2025\/04\/11\/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark\/\">a model that differed from the one the company made available to developers<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><em>Updated 4:21 p.m. Pacific: Added comments from Wenda Zhou, a member of the OpenAI technical staff, from a livestream last week.<\/em><\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/04\/20\/openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A discrepancy between first- and third-party benchmark results for OpenAI\u2019s o3 AI model is raising questions about the company\u2019s transparency and model testing practices. When<\/p>\n","protected":false},"author":1,"featured_media":93690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-93689","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/93689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=93689"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/93689\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/93690"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=93689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=93689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=93689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}