Did xAI lie about Grok 3’s benchmarks?

February 23, 2025

38

Debates over AI benchmarks — and the way they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI worker accused Elon Musk’s AI firm, xAI, of publishing deceptive benchmark outcomes for its newest AI mannequin, Grok 3. One of many co-founders of xAI, Igor Babushkin, insisted that the corporate was in the correct.

The reality lies someplace in between.

In a put up on xAI’s weblog, the corporate revealed a graph exhibiting Grok 3’s efficiency on AIME 2025, a group of difficult math questions from a current invitational arithmetic examination. Some specialists have questioned AIME’s validity as an AI benchmark. Nonetheless, AIME 2025 and older variations of the check are generally used to probe a mannequin’s math skill.

xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing obtainable mannequin, o3-mini-high, on AIME 2025. However OpenAI staff on X have been fast to level out that xAI’s graph didn’t embody o3-mini-high’s AIME 2025 rating at “cons@64.”

What’s cons@64, you would possibly ask? Nicely, it’s brief for “consensus@64,” and it principally provides a mannequin 64 tries to reply every drawback in a benchmark and takes the solutions generated most incessantly as the ultimate solutions. As you’ll be able to think about, cons@64 tends to spice up fashions’ benchmark scores fairly a bit, and omitting it from a graph would possibly make it seem as if one mannequin surpasses one other when in actuality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — that means the primary rating the fashions received on the benchmark — fall beneath o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly behind OpenAI’s o1 mannequin set to “medium” computing. But xAI is promoting Grok 3 because the “world’s smartest AI.”

Babushkin argued on X that OpenAI has revealed equally deceptive benchmark charts up to now — albeit charts evaluating the efficiency of its personal fashions. A extra impartial social gathering within the debate put collectively a extra “correct” graph exhibiting almost each mannequin’s efficiency at cons@64:

Hilarious how some folks see my plot as assault on OpenAI and others as assault on Grok whereas in actuality it’s DeepSeek propaganda
(I truly imagine Grok appears good there, and openAI’s TTC chicanery behind o3-mini-*excessive*-pass@”””1″”” deserves extra scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

However as AI researcher Nathan Lambert identified in a put up, maybe crucial metric stays a thriller: the computational (and financial) value it took for every mannequin to attain its finest rating. That simply goes to point out how little most AI benchmarks talk about fashions’ limitations — and their strengths.

Did xAI lie about Grok 3’s benchmarks?

Related Articles

Pinterest Launches International Rollout of Gen AI Labels for Better Transparency

The Philippines Launches Digital Nomad Visa. Is It The Greatest in Southeast Asia? | by Greyson Ferguson | The Startup | Could, 2025

Dr. Lal PathLabs Ltd – Powering Diagnostic LeadershipInsights

Latest Articles

Pinterest Launches International Rollout of Gen AI Labels for Better Transparency

The Philippines Launches Digital Nomad Visa. Is It The Greatest in Southeast Asia? | by Greyson Ferguson | The Startup | Could, 2025

Dr. Lal PathLabs Ltd – Powering Diagnostic LeadershipInsights

These States Have the Most Reasonably priced Housing in US: Rating

“Be.EV goes locations” – British EV charging community indicators €23 million deal to put in charging bays throughout the UK