Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.
This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.
The truth lies somewhere in between.
In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.
xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”
What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”
Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.
Keep reading the article on Tech Crunch
OpenAI is forecasting a major shift in the next five years around who it gets most of its computing power from, The Information reported on Friday.
By 2030, OpenAI expects to get three-quarters of its data center capacity from Stargate, a project that’s expected to be heavily financed by SoftBank, one of OpenAI’s newest financial backers. That represents a major shift away from Microsoft, OpenAI’s biggest shareholder, who fulfills most of the startup’s power needs today.
The change won’t happen overnight. OpenAI still plans to increase its spending on Microsoft-owned data centers in the next few years.
During that time, OpenAI’s overall costs are set to grow dramatically. The Information reports that OpenAI projects to burn $20 billion in cash during 2027, far more than the $5 billion it reportedly burned through in 2024. By 2030, OpenAI reportedly forecasts that its costs around running AI models, also known as inference, will outpace what the startup spends on training AI models.
Keep reading the article on Tech Crunch
OpenAI said on Friday that it is rolling out Operator, its so-called AI agent that can perform tasks on behalf of users, for ChatGPT Pro subscribers in Australia, Brazil, Canada, India, Japan, Singapore, South Korea, the U.K., and more countries.
OpenAI said Operator will be available in most places where ChatGPT is available, apart from the EU, Switzerland, Norway, Liechtenstein and Iceland.
Operator, launched in January in the U.S., is one of several “AI agent” tools on the market that can be instructed to do things like book tickets, make restaurant reservations, file expense reports, or shop on e-commerce websites.
The tool is currently only available to subscribers on the $200-per-month ChatGPT Pro plan. You can only use it via a dedicated webpage, but the company has said it plans to make Operator available with all ChatGPT clients. Operator runs on a separate browser window (that users can take control of at any time) to complete tasks.
There’s ample competition in this space, with companies like Google, Anthropic and Rabbit building agents that can perform similar tasks. However, Google’s project is still on a waitlist, Anthropic gives access to its agentic interface through an API, and Rabbit’s action model is only available to users who own its device.
Keep reading the article on Tech Crunch