These researchers used NPR Sunday puzzle questions for benchmark AI ‘logic’ models

Aljwadh4 weeks ago

0 1,590 3 minutes read

Every Sunday, Crossword Puzzle Jupiter of New York Times, NPR Host Will Shortz, quiz thousands of listeners in a long -running segment Sunday. When written without being solved Also Very inaccessible, Brintzers are usually challenging for skilled competitors.

That is why some experts believe that they are a promising way to test the limits of AI’s problem.

In one Recent studyWellesley K College Lage, Oberlin College, University of Texas, AUST Stein, Northistorn University, Charles University, and startup cursor created AI benchmarks using redes from puzzle episodes on Sunday. The team says that their testimonial infant insights, such as logic models – Openai’s O1, among others – sometimes “leave” and provide answers they know.

“We wanted to develop a benchmark with problems that humans could only understand the common man,” Arjun Guha told Techcranch, one of the Northeast Computer Science Faculty member and co-authors of the study.

The AI industry is in a few benchmarking quarters at the moment. Most of the commonly used tests, which are not relevant to the average user, are commonly used to evaluate AI models for skills such as PhD-level math and mathematics such as Wiz Science questions. Meanwhile, many benchmarks – also Benchmarks have been released recently – quickly approaching the saturation point.

Guha explained that the advantages of a public radio quiz game like Sunday puzzle are that it does not test for specific Junowledge, and the example of challenges that models cannot draw them on “rot memory” to solve them, Guha explained.

Guha said, “I think these problems are hard to make it really difficult to make meaningful progress on a problem unless you solve it – when everything clicks together.” “It needs a combination of insights and removal processes.”

No benchmark is perfect, of course. Sunday puzzle only in the U.S. Is concentrated and English. And because the quiz is publicly available, it is possible that the trained models can “cheat” in a sense, though Guha says she has not seen evidence.

He added, “Every week new questions are released, and we can expect the latest questions to be really invisible.” “We intend to keep the benchmark fresh and how the model performance changes over time.”

On a benchmark of researchers, which contain about 600 puzzle puzzles on Sunday, logic models like O1 and Dippic R1, leading the rest. Logic models are fully facts before giving results, which helps them Avoid some difficulties He usually makes a trip to AI models. Trade-F F is that logic models take some time to reach solutions- normally from seconds to minutes.

At least one model, Dippic R1, gives solutions to some Sunday puzzle knows to be wrong for questions. R1 Verbatim “I leave”, then stated by the wrong answer mostly chosen on random – this human being is definitely relevant.

The models make other fantastic choices, such as just to withdraw it, the better answer to try out and fail again. They stop “get” forever and give a unnoticed revelation to the answers, or they immediately reach the correct answer but then consider alternative answers for no clear reasons.

“On hard problems, R1 literally says that it is ‘disappointed’,” Guha said. “It was funny to see how a model says. It remains to be seen how the ‘frustration’ model in logic can affect the quality of the results. “

NPR benchmark — R1 “frustrated” on a question in the puzzle challenge set on Sunday.Image credits:Guha et al.

The current best display model on the benchmark is O1 with a score of 59%, then recently published -M no Set to the high “logic effort” (47%). (R1 made 35%.) As the next step, researchers plan to expand their test into additional logic models, hoping that these models will help identify areas that may increase.

Guha said, “You do not need a PhD to be good in logic, so it should be possible to create a logged benchmark that does not require a PhD-level Junowledge.” “Benchmark with widespread with cess allows a large set of researchers to understand and analyze the results, which in turn can lead to better solutions in the future. Moreover, the modern models are more and more deployed in the settings affecting everyone, as we believe what these models are-and not-enabled. “

https://techcrunch.com/wp-content/uploads/2025/01/GettyImages-1287582736.jpg?resize=1200,657

Aljwadh4 weeks ago

0 1,590 3 minutes read

These researchers used NPR Sunday puzzle questions for benchmark AI ‘logic’ models

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

Villarreal V Real Madrid: line-ups, statistics and preview

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

Aljwadh

Today's NYT, February 17, # 351- Tips for CNET, Answer and Help

CFPB Firings in Holdid Divine Against Trump Layoffs

Related Articles

Apple Pal Purez applications without contact information from EUP store, such as DSA Deadline Hits

Amazon tests send customers directly to the brand’s websites when they do not stock their products

US The biggest breach of government data is going on

EU AI Act: Tiptoos towards light guidance for the latest draft code for AI model manufacturers Big AI