{"id":91085,"date":"2025-02-17T02:24:37","date_gmt":"2025-02-17T02:24:37","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/02\/17\/these-researchers-used-npr-sunday-puzzle-questions-to-benchmark-ai-reasoning-models\/"},"modified":"2025-02-17T02:24:37","modified_gmt":"2025-02-17T02:24:37","slug":"these-researchers-used-npr-sunday-puzzle-questions-to-benchmark-ai-reasoning-models","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/02\/17\/these-researchers-used-npr-sunday-puzzle-questions-to-benchmark-ai-reasoning-models\/","title":{"rendered":"These researchers used NPR Sunday Puzzle questions to benchmark AI &#8216;reasoning&#8217; models"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Every Sunday, NPR host Will Shortz, The New York Times\u2019 crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the <a rel=\"nofollow\" href=\"https:\/\/www.npr.org\/2025\/02\/14\/nx-s1-5290940\/sunday-puzzle-p-e-class\">Sunday Puzzle<\/a>. While written to be solvable without <em>too<\/em> much foreknowledge, the brainteasers are usually challenging even for skilled contestants.<\/p>\n<p class=\"wp-block-paragraph\">That\u2019s why some experts think they\u2019re a promising way to test the limits of AI\u2019s problem-solving abilities.<\/p>\n<p class=\"wp-block-paragraph\">In a <a href=\"https:\/\/arxiv.org\/pdf\/2502.01584\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"> recent study<\/a>, a team of researchers hailing from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models \u2014 OpenAI\u2019s o1, among others \u2014 sometimes \u201cgive up\u201d and provide answers they know aren\u2019t correct.<\/p>\n<p class=\"wp-block-paragraph\">\u201cWe wanted to develop a benchmark with problems that humans can understand with only general knowledge,\u201d Arjun Guha, a computer science faculty member at Northeastern and one of the co-authors on the study, told TechCrunch.<\/p>\n<p class=\"wp-block-paragraph\">The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren\u2019t relevant to the average user. Meanwhile, many benchmarks \u2014 even <a href=\"https:\/\/gigazine.net\/gsc_news\/en\/20250205-openai-deep-research-high-score\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">benchmarks released relatively recently<\/a> \u2014 are quickly approaching the saturation point.<\/p>\n<p class=\"wp-block-paragraph\">The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn\u2019t test for esoteric knowledge, and the challenges are phrased such that models can\u2019t draw on \u201crote memory\u201d to solve them, explained Guha.<\/p>\n<p class=\"wp-block-paragraph\">\u201cI think what makes these problems hard is that it\u2019s really difficult to make meaningful progress on a problem until you solve it \u2014 that\u2019s when everything clicks together all at once,\u201d Guha said. \u201cThat requires a combination of insight and a process of elimination.\u201d<\/p>\n<p class=\"wp-block-paragraph\">No benchmark is perfect, of course. The Sunday Puzzle is U.S. centric and English only. And because the quizzes are publicly available, it\u2019s possible that models trained on them can \u201ccheat\u201d in a sense, although Guha says he hasn\u2019t seen evidence of this.<\/p>\n<p class=\"wp-block-paragraph\">\u201cNew questions are released every week, and we can expect the latest questions to be truly unseen,\u201d he added. \u201cWe intend to keep the benchmark fresh and track how model performance changes over time.\u201d<\/p>\n<p class=\"wp-block-paragraph\">On the researchers\u2019 benchmark, which consists of around 600 Sunday Puzzle\u00a0riddles, reasoning models such as o1 and DeepSeek\u2019s R1 far outperform the rest. Reasoning models thoroughly fact-check themselves before giving out results, which\u00a0helps them\u00a0<a href=\"https:\/\/techcrunch.com\/2024\/08\/27\/why-ai-cant-spell-strawberry\/\">avoid some of the\u00a0pitfalls<\/a>\u00a0that normally trip up AI models. The trade-off is that reasoning models take a little longer to arrive at solutions \u2014 typically seconds to minutes longer.<\/p>\n<p class=\"wp-block-paragraph\">At least one model, DeepSeek\u2019s R1, gives solutions it knows to be wrong for some of the Sunday Puzzle questions. R1 will state verbatim \u201cI give up,\u201d followed by an incorrect answer chosen seemingly at random \u2014 behavior this human can certainly relate to.<\/p>\n<p class=\"wp-block-paragraph\">The models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck \u201cthinking\u201d forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOn hard problems, R1 literally says that it\u2019s getting \u2018frustrated,\u2019\u201d Guha said. \u201cIt was funny to see how a model emulates what a human might say. It remains to be seen how \u2018frustration\u2019 in reasoning can affect the quality of model results.\u201d<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" loading=\"lazy\" decoding=\"async\" width=\"1912\" height=\"734\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?w=680\" alt=\"NPR benchmark\" class=\"wp-image-2961179\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png 1912w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=150,58 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=300,115 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=768,295 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=680,261 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=1200,461 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=1280,491 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=430,165 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=720,276 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=900,346 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=800,307 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=1536,590 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=668,256 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=1440,553 1440w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.41.18AM.png?resize=708,272 708w\" sizes=\"auto, (max-width: 1912px) 100vw, 1912px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">R1 getting \u201cfrustrated\u201d on a question in the Sunday Puzzle challenge set.<\/span><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Guha et al.<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released <a href=\"https:\/\/techcrunch.com\/2025\/01\/31\/openai-launches-o3-mini-its-latest-reasoning-model\/\">o3-mini<\/a> set to high \u201creasoning effort\u201d (47%). (R1 scored 35%.) As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" loading=\"lazy\" decoding=\"async\" width=\"2390\" height=\"1644\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?w=680\" alt=\"NPR benchmark\" class=\"wp-image-2961178\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png 2390w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=150,103 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=300,206 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=768,528 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=680,468 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=1200,825 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=1280,880 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=430,296 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=720,495 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=900,619 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=800,550 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=1536,1057 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=2048,1409 2048w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=668,459 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=545,375 545w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=897,617 897w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-06-at-12.31.38AM.png?resize=708,487 708w\" sizes=\"auto, (max-width: 2390px) 100vw, 2390px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">The scores of the models the team tested on their benchmark.<\/span><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Guha et al.<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">\u201cYou don\u2019t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don\u2019t require PhD-level knowledge,\u201d Guha said. \u201cA benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are \u2014 and aren\u2019t \u2014 capable of.\u201d<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/02\/16\/these-researchers-used-npr-sunday-puzzle-questions-to-benchmark-ai-reasoning-models\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every Sunday, NPR host Will Shortz, The New York Times\u2019 crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the<\/p>\n","protected":false},"author":1,"featured_media":91086,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-91085","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91085","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=91085"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91085\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/91086"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=91085"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=91085"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=91085"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}