These researchers used NPR Sunday puzzle questions for benchmark AI ‘logic’ models

Every Sunday, Crossword Puzzle Jupiter of New York Times, NPR Host Will Shortz, quiz thousands of listeners in a long -running segment Sunday. When written without being solved Also Very inaccessible, Brintzers are usually challenging for skilled competitors.
That is why some experts believe that they are a promising way to test the limits of AI’s problem.
In one Recent studyWellesley K College Lage, Oberlin College, University of Texas, AUST Stein, Northistorn University, Charles University, and startup cursor created AI benchmarks using redes from puzzle episodes on Sunday. The team says that their testimonial infant insights, such as logic models – Openai’s O1, among others – sometimes “leave” and provide answers they know.
“We wanted to develop a benchmark with problems that humans could only understand the common man,” Arjun Guha told Techcranch, one of the Northeast Computer Science Faculty member and co-authors of the study.
The AI industry is in a few benchmarking quarters at the moment. Most of the commonly used tests, which are not relevant to the average user, are commonly used to evaluate AI models for skills such as PhD-level math and mathematics such as Wiz Science questions. Meanwhile, many benchmarks – also Benchmarks have been released recently – quickly approaching the saturation point.
Guha explained that the advantages of a public radio quiz game like Sunday puzzle are that it does not test for specific Junowledge, and the example of challenges that models cannot draw them on “rot memory” to solve them, Guha explained.
Guha said, “I think these problems are hard to make it really difficult to make meaningful progress on a problem unless you solve it – when everything clicks together.” “It needs a combination of insights and removal processes.”
No benchmark is perfect, of course. Sunday puzzle only in the U.S. Is concentrated and English. And because the quiz is publicly available, it is possible that the trained models can “cheat” in a sense, though Guha says she has not seen evidence.
He added, “Every week new questions are released, and we can expect the latest questions to be really invisible.” “We intend to keep the benchmark fresh and how the model performance changes over time.”
On a benchmark of researchers, which contain about 600 puzzle puzzles on Sunday, logic models like O1 and Dippic R1, leading the rest. Logic models are fully facts before giving results, which helps them Avoid some difficulties He usually makes a trip to AI models. Trade-F F is that logic models take some time to reach solutions- normally from seconds to minutes.
At least one model, Dippic R1, gives solutions to some Sunday puzzle knows to be wrong for questions. R1 Verbatim “I leave”, then stated by the wrong answer mostly chosen on random – this human being is definitely relevant.
The models make other fantastic choices, such as just to withdraw it, the better answer to try out and fail again. They stop “get” forever and give a unnoticed revelation to the answers, or they immediately reach the correct answer but then consider alternative answers for no clear reasons.
“On hard problems, R1 literally says that it is ‘disappointed’,” Guha said. “It was funny to see how a model says. It remains to be seen how the ‘frustration’ model in logic can affect the quality of the results. “

The current best display model on the benchmark is O1 with a score of 59%, then recently published -M no Set to the high “logic effort” (47%). (R1 made 35%.) As the next step, researchers plan to expand their test into additional logic models, hoping that these models will help identify areas that may increase.

Guha said, “You do not need a PhD to be good in logic, so it should be possible to create a logged benchmark that does not require a PhD-level Junowledge.” “Benchmark with widespread with cess allows a large set of researchers to understand and analyze the results, which in turn can lead to better solutions in the future. Moreover, the modern models are more and more deployed in the settings affecting everyone, as we believe what these models are-and not-enabled. “
https://techcrunch.com/wp-content/uploads/2025/01/GettyImages-1287582736.jpg?resize=1200,657