Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.
This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.
The truth lies somewhere in between.
In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.
xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”
What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”
Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.
Keep reading the article on Tech Crunch
The National Institute of Standards and Technology could fire as many as 500 staffers, according to multiple reports — cuts that further threaten a fledgling AI safety organization.
Axios reported this week that the US AI Safety Institute (AISI) and Chips for America, both part of NIST, would be “gutted” by layoffs targeting probationary employees (who are typically in their first year or two on the job). And Bloomberg said some of those employees had already been given verbal notice of upcoming terminations.
Even before the latest layoff reports, AISI’s future was looking uncertain. The institute, which is supposed to study risks and develop standards around AI development, was created last year as part of then-President Joe Biden’s executive order on AI safety. President Donald Trump repealed that order on his first day back in office, and AISI’s director departed earlier in February.
Fortune spoke to a number of AI safety and policy organizations who all criticized the reported layoffs.
“These cuts, if confirmed, would severely impact the government’s capacity to research and address critical AI safety concerns at a time when such expertise is more vital than ever,” said Jason Green-Lowe, executive director of the Center for AI Policy.
Keep reading the article on Tech Crunch
The beauty of podcasting is that anyone can do it. It’s a rare medium that’s nearly as easy to make as it is to consume. And as such, no two people do it exactly the same way. There are a wealth of hardware and software solutions open to potential podcasters, so setups run the gamut from NPR studios to USB Skype rigs (the latter of which became a kind of default during the pandemic).
This week, we spoke to Jody Avirgan, who co-hosts “Summer Album / Winter Album” with the frontman of the American indie rock band The Hold Steady, Craig Finn. Each episode finds Avirgan and Finn debating whether a classic record should be categorized as a “summer album” or “winter album.”
Avirgan – who previously hosted shows for Radiotopia, TED, FiveThirtyEight, and ESPN – told us about his podcasting set-up of choice. Here he is in his own words:
“Even when I worked at ESPN/FiveThirtyEight, I always had a home recording setup. Since leaving — which happened to coincide with the start of the pandemic — I’ve made my basement recording studio my main home. It’s actually the kitchen of a basement studio apartment, so just off-frame, behind some curtains, is a fridge (unplugged), sink, and lots of cabinets.
“But I’ve hung tons of curtains, scattered soft things around, and put some sound dampening panels up. I think it’s now both cozy and pretty warm-sounding. My mic is an Electro-Voice RE27N/D, a $500 studio mic.
“To be clear: I don’t make RE27 money. We bought this mic when I was hosting 30 for 30. I left ESPN three weeks before the pandemic hit, and somewhere in there I wrote them an email asking if they wanted me to return the mic. I never got a response, and I certainly didn’t write a follow-up. So I kept it. This is probably why Disney stock is down 20% over the last five years. It’s a very warm mic, but it’s a behemoth.
“When I’m on the road, I pack an AT2020-USB+, which plugs right into my computer and I can knock out tracking from wherever — usually under a blanket in a hotel closet, which is a podcaster’s natural habitat.
“I run my mic through the FocusRite Scarlett 2i2, a simple but mighty interface that lets me control my mic levels and route right into my computer, where I am often joining people over Zoom or Riverside. I always record a local backup file using Hindenburg, which I then save to Dropbox. All roads eventually lead to Dropbox.
“The one place I deviate from the typical Podcaster 101 kit is in my headphones. Everyone has the Sony MDR-7506, and I’ve run through my fair share of those, but I really like the Rode NTH-100 headphones. They are just a little more comfortable, look a little slicker, and so far the padding hasn’t broken down in the way that the padding on the Sony’s inevitably does, leading one to find little black flecks in their ears after taping.
“Like a lot of podcasters, I’ve been doing more and more video stuff lately. I’ve used Descript for years, but as the worlds of audio and video have merged, I do almost all my editing in it at this point. I make social videos of our conversations for “This Day” and “Summer Album / Winter Album,” but also original stuff I’ve been playing with on Instagram.
“I’m doing a series each week where I try to guess the title of that week’s New Yorker cover, and I record that right into Descript and turn it around in like 20 minutes using a template I built. Descript — I’m a big fan. It’s very versatile, and it’s nice to work with a program that seems to give a crap about what podcasters want, as opposed to ProTools.
“I suppose I’ve had to think about my visual setup a fair amount, too. I bought the webcam that the Wirecutter recommended, but honestly I prefer the look of the MacBook camera, so I usually just use that. In my background, I put some books to prove that I know how to read; a signed photo of George Mikan, about whom there was a running bit in “Death At The Wing” — and $28 worth of fake plants from Ikea.
“I block the view so I don’t think people can even see that the plants are there; but I like to know that they are there, and will always be there, because of forever plastics.”
We’ve previously asked others of our favorite podcast hosts and producers to highlight their workflows — the equipment and software they use to get the job done. The list so far includes:
Keep reading the article on Tech Crunch
We’ve all been there. A favorite item is suddenly unavailable for purchase. Couldn’t the manufacturer have given you advance warning?
Whether owing to low sales, changing habits, production costs, or even because something is a little wrong with your favorite product (shh), discontinued items are part of life. In a weekend piece, the New York Times delves into the not-so-dark underbelly of online places where shoppers find these items, share tips and yes, find emotional support.
The story highlights a padded laptop bag made by Filson that a super fan now hunts “down everywhere” to snag as many as possible “before everyone figures out how great they are.” It points to Discontinued Beauty, a site whose offerings are old to visitors but new to the site. Among its latest products: an “essential protein restructurizer” by Redkin priced at an eye-popping $169.95. (The newest version of the product costs shoppers $32.)
Could it be dangerous to use these discontinued products? Who cares, suggests one creative director, who tells the Times about a lip pencil the beauty company NARS no longer sells and she has found elsewhere. “Now, do I know the proper way to store this for optimal conditions? No,” she says. “They’re under my sink.”
Keep reading the article on Tech Crunch
Welcome back to Week in Review. This week we’re looking at the internal chaos surrounding HP’s $116 million acquisition of AI Pin maker Humane; Mira Murati’s new AI venture coming out of stealth; Duolingo killing its iconic owl mascot with a Cybertruck; and more! Let’s get into it.
Humane’s AI pin is dead. The hardware startup announced that most of its assets have been acquired by HP for $116 million, less than half of the $240 million it raised in VC funding. The startup will immediately discontinue sales of its $499 AI Pins, and after February 28, the wearable will no longer connect to Humane’s servers. After that, the devices won’t be capable of calling, messaging, AI queries/responses, or cloud access. Customers who bought an AI Pin in the last 90 days are eligible for a refund, but anyone who bought a device before then is not.
Hours after the HP acquisition was announced, several Humane employees received job offers from HP with pay increases between 30% and 70%, plus HP stock and bonus plans, according to internal documents seen by TechCrunch and two sources who requested anonymity. Meanwhile, other Humane employees — especially those who worked closer to the AI Pin devices — were notified they were out of a job.
Apple’s long-awaited iPhone SE refresh has been revealed, three years after the last major update to the budget-minded smartphone. The 16e is part of an exclusive group of handsets capable of running Apple Intelligence due to the addition of an A18 processor. The iPhone 16e also ditched the Touch ID home button in favor of Face ID and swapped out the Lightning port in favor of USB-C. The iPhone 6e starts at $599 and will begin shipping February 28.
This is TechCrunch’s Week in Review, where we recap the week’s biggest news. Want this delivered as a newsletter to your inbox every Saturday? Sign up here.
RIP, Duo: Duolingo “killed” its iconic owl mascot with a Cybertruck, and the marketing stunt is going surprisingly well. The company launched a campaign to save Duo — and encourage users to do more lessons — as the company says it’s “Duo or die.” Read more
OpenAI “uncensors” ChatGPT: OpenAI no longer wants ChatGPT to take an editorial stance, even if some users find it “morally wrong or offensive.” That means ChatGPT will now offer multiple perspectives on controversial subjects in an effort to be neutral. Read more
Uber vs. DoorDash: Uber is suing DoorDash, accusing its delivery rival of stifling competition by intimidating restaurant owners into exclusive deals. Uber alleges that DoorDash bullied restaurants into only working with them. Read more
Mira Murati’s next move: Former OpenAI CTO Mira Murati’s new AI startup, Thinking Machines Lab, has come out of stealth. The startup, which includes OpenAI co-founder John Schulman and former OpenAI chief research officer Barret Zoph, will focus on building collaborative “multimodal” systems. Read more
Introducing Grok 3: Elon Musk’s xAI released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok iOS and web apps. Musk claims that the new family of models is a “maximally truth-seeking AI” that is sometimes “at odds with what is politically correct.” Read more
Hackers on Steam: Valve removed a video game from Steam that was essentially designed to spread malware. Security researchers found that whoever planted it modified an existing video game in an attempt to trick gamers into installing an info-stealer called Vidar. Read more
Another DEI U-turn: Mark Zuckerberg and Priscilla Chan’s charity will end internal DEI programs and stop providing “social advocacy funding” for racial equity and immigration reforms. The switch comes just weeks after the organization assured staff it would continue to support DEI efforts. Read more
Amazon shuts down its Android app store: Amazon will discontinue its app store for Android in August in an effort to put more focus on the company’s own devices. The company told developers that they will no longer be able to submit new apps to the store. Read more
Mark Zuckerberg’s rebrand didn’t pay off: A study by the Pew Research Center found that Americans’ views of Elon Musk and Mark Zuckerberg are more negative than positive. About 54% of U.S. adults say they have an unfavorable view of Musk while a whopping 67% feel negatively toward Zuckerberg. Read more
Noise-canceling headphones could hurt your brain: A new BBC report considers whether noise-canceling tech might be rewiring the brains of people who use it to tune out pesky background noise — and could lead to the brain forgetting how to filter sounds itself. Read more
An exhaustive look at the DOGE universe: The dozens of individuals who work under, or advise, Elon Musk and DOGE are a real-life illustration of Musk’s weblike reach in the tech industry. TechCrunch has unveiled the major players in the DOGE universe, from Musk’s inner circle to senior figures, worker bees, and aides — some of whom are advising and recruiting for DOGE. We highlight both the connections between them and how they entered Musk’s orbit. Read more
Keep reading the article on Tech Crunch
The General Services Administration, the agency that manages buildings owned by the federal government, is planning to shut down its entire network of electric vehicle chargers, according to a report in The Verge.
The GSA reportedly operates a network of hundreds of EV chargers with a total of 8,000 plugs that can be used to charge vehicles owned by the government and by federal employees. A source told The Verge that federal workers will receive guidance next week to shut those chargers down, with some regional offices already told to take their chargers offline.
Earlier this week, Colorado Public Radio obtained an internal email stating that charging stations at the Denver Federal Center would be shut down as they are “not mission critical.”
More broadly, President Donald Trump’s administration has been aggressively cutting government agencies and pulling back federal support for renewable energy, including for an EV charging infrastructure program that previously provided millions to Tesla.
TechCrunch has reached out to the GSA for comment.
Keep reading the article on Tech Crunch