Blue Diamond Web Services

Your Best Hosting Service Provider!

February 22, 2025

Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.

Keep reading the article on Tech Crunch


US AI Safety Institute could face big cuts

The National Institute of Standards and Technology could fire as many as 500 staffers, according to multiple reports — cuts that further threaten a fledgling AI safety organization.

Axios reported this week that the US AI Safety Institute (AISI) and Chips for America, both part of NIST, would be “gutted” by layoffs targeting probationary employees (who are typically in their first year or two on the job). And Bloomberg said some of those employees had already been given verbal notice of upcoming terminations.

Even before the latest layoff reports, AISI’s future was looking uncertain. The institute, which is supposed to study risks and develop standards around AI development, was created last year as part of then-President Joe Biden’s executive order on AI safety. President Donald Trump repealed that order on his first day back in office, and AISI’s director departed earlier in February.

Fortune spoke to a number of AI safety and policy organizations who all criticized the reported layoffs.

“These cuts, if confirmed, would severely impact the government’s capacity to research and address critical AI safety concerns at a time when such expertise is more vital than ever,” said Jason Green-Lowe, executive director of the Center for AI Policy.

Keep reading the article on Tech Crunch


The fallout from HP’s Humane acquisition 

Welcome back to Week in Review. This week we’re looking at the internal chaos surrounding HP’s $116 million acquisition of AI Pin maker Humane; Mira Murati’s new AI venture coming out of stealth; Duolingo killing its iconic owl mascot with a Cybertruck; and more! Let’s get into it.

Humane’s AI pin is dead. The hardware startup announced that most of its assets have been acquired by HP for $116 million, less than half of the $240 million it raised in VC funding. The startup will immediately discontinue sales of its $499 AI Pins, and after February 28, the wearable will no longer connect to Humane’s servers. After that, the devices won’t be capable of calling, messaging, AI queries/responses, or cloud access. Customers who bought an AI Pin in the last 90 days are eligible for a refund, but anyone who bought a device before then is not.

Hours after the HP acquisition was announced, several Humane employees received job offers from HP with pay increases between 30% and 70%, plus HP stock and bonus plans, according to internal documents seen by TechCrunch and two sources who requested anonymity. Meanwhile, other Humane employees — especially those who worked closer to the AI Pin devices — were notified they were out of a job.

Apple’s long-awaited iPhone SE refresh has been revealed, three years after the last major update to the budget-minded smartphone. The 16e is part of an exclusive group of handsets capable of running Apple Intelligence due to the addition of an A18 processor. The iPhone 16e also ditched the Touch ID home button in favor of Face ID and swapped out the Lightning port in favor of USB-C. The iPhone 6e starts at $599 and will begin shipping February 28.


This is TechCrunch’s Week in Review, where we recap the week’s biggest news. Want this delivered as a newsletter to your inbox every Saturday? Sign up here.


News

Duolingo owl
Image Credits:Duolingo (opens in a new window)

RIP, Duo: Duolingo “killed” its iconic owl mascot with a Cybertruck, and the marketing stunt is going surprisingly well. The company launched a campaign to save Duo — and encourage users to do more lessons — as the company says it’s “Duo or die.” Read more

OpenAI “uncensors” ChatGPT: OpenAI no longer wants ChatGPT to take an editorial stance, even if some users find it “morally wrong or offensive.” That means ChatGPT will now offer multiple perspectives on controversial subjects in an effort to be neutral. Read more

Uber vs. DoorDash: Uber is suing DoorDash, accusing its delivery rival of stifling competition by intimidating restaurant owners into exclusive deals. Uber alleges that DoorDash bullied restaurants into only working with them. Read more

Mira Murati’s next move: Former OpenAI CTO Mira Murati’s new AI startup, Thinking Machines Lab, has come out of stealth. The startup, which includes OpenAI co-founder John Schulman and former OpenAI chief research officer Barret Zoph, will focus on building collaborative “multimodal” systems. Read more

Introducing Grok 3: Elon Musk’s xAI released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok iOS and web apps. Musk claims that the new family of models is a “maximally truth-seeking AI” that is sometimes “at odds with what is politically correct.” Read more

Hackers on Steam: Valve removed a video game from Steam that was essentially designed to spread malware. Security researchers found that whoever planted it modified an existing video game in an attempt to trick gamers into installing an info-stealer called Vidar. Read more

Another DEI U-turn: Mark Zuckerberg and Priscilla Chan’s charity will end internal DEI programs and stop providing “social advocacy funding” for racial equity and immigration reforms. The switch comes just weeks after the organization assured staff it would continue to support DEI efforts. Read more

Amazon shuts down its Android app store: Amazon will discontinue its app store for Android in August in an effort to put more focus on the company’s own devices. The company told developers that they will no longer be able to submit new apps to the store. Read more

Mark Zuckerberg’s rebrand didn’t pay off: A study by the Pew Research Center found that Americans’ views of Elon Musk and Mark Zuckerberg are more negative than positive. About 54% of U.S. adults say they have an unfavorable view of Musk while a whopping 67% feel negatively toward Zuckerberg. Read more

Noise-canceling headphones could hurt your brain: A new BBC report considers whether noise-canceling tech might be rewiring the brains of people who use it to tune out pesky background noise — and could lead to the brain forgetting how to filter sounds itself. Read more 

Analysis

an illustration of Elon Musk, stood in front of a graphic of the U.S. Capitol, with various faces around Musk of those who are in his inner circle, including DOGE members.
Image Credits:Sean O’Kane / TechCrunch

An exhaustive look at the DOGE universe: The dozens of individuals who work under, or advise, Elon Musk and DOGE are a real-life illustration of Musk’s weblike reach in the tech industry. TechCrunch has unveiled the major players in the DOGE universe, from Musk’s inner circle to senior figures, worker bees, and aides — some of whom are advising and recruiting for DOGE. We highlight both the connections between them and how they entered Musk’s orbit. Read more

Keep reading the article on Tech Crunch


February 21, 2025

Court filings show Meta staffers discussed using copyrighted content for AI training

For years, Meta employees have internally discussed using copyrighted works obtained through legally questionable means to train the company’s AI models, according to court documents unsealed on Thursday.

The documents were submitted by plaintiffs in the case Kadrey v. Meta, one of many AI copyright disputes slowly winding through the U.S. court system. The defendant, Meta, claims that training models on IP-protected works, particularly books, is “fair use.” The plaintiffs, who include authors Sarah Silverman and Ta-Nehisi Coates, disagree.

Previous materials submitted in the suit alleged that Meta CEO Mark Zuckerberg gave Meta’s AI team the OK to train on copyrighted content and that Meta halted AI training data licensing talks with book publishers. But the new filings, most of which show portions of internal work chats between Meta staffers, paint the clearest picture yet of how Meta may have come to use copyrighted data to train its models, including models in the company’s Llama family.

In one chat, Meta employees, including Melanie Kambadur, a senior manager for Meta’s Llama model research team, discussed training models on works they knew may be legally fraught.

“[M]y opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call,” wrote Xavier Martinet, a Meta research engineer, in a chat dated February 2023, according to the filings. “[T]his is why they set up this gen ai org for [sic]: so we can be less risk averse.”

Martinet floated the idea of buying e-books at retail prices to build a training set rather than cutting licensing deals with individual book publishers. After another staffer pointed out that using unauthorized, copyrighted materials might be grounds for a legal challenge, Martinet doubled down, arguing that “a gazillion” startups were probably already using pirated books for training.

“I mean, worst case: we found out it is finally ok, while a gazillion start up [sic] just pirated tons of books on bittorrent,” Martinet wrote, according to the filings. “[M]y 2 cents again: trying to have deals with publishers directly takes a long time …”

In the same chat, Kambadur, who noted Meta was in talks with document hosting platform Scribd “and others” for licenses, cautioned that while using “publicly available data” for model training would require approvals, Meta’s lawyers were being “less conservative” than they had been in the past with such approvals.

“Yeah we definitely need to get licenses or approvals on publicly available data still,” Kambadur said, according to the filings. “[D]ifference now is we have more money, more lawyers, more bizdev help, ability to fast track/escalate for speed, and lawyers are being a bit less conservative on approvals.”

Talks of Libgen

In another work chat relayed in the filings, Kambadur discusses possibly using Libgen, a “links aggregator” that provides access to copyrighted works from publishers, as an alternative to data sources that Meta might license.

Libgen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. One of Kambadur’s colleagues responded with a screenshot of a Google Search result for Libgen containing the snippet “No, Libgen is not legal.”

Some decision-makers within Meta appear to have been under the impression that failing to use Libgen for model training could seriously hurt Meta’s competitiveness in the AI race, according to the filings.

In an email addressed to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen “essential to meet SOTA numbers across all categories,” referring to topping the best, state-of-the-art (SOTA) AI models and benchmark categories.

Theakanath also outlined “mitigations” in the email intended to help reduce Meta’s legal exposure, including removing data from Libgen “clearly marked as pirated/stolen” and also simply not publicly citing usage. “We would not disclose use of Libgen datasets used to train,” as Theakanath put it.

In practice, these mitigations entailed combing through Libgen files for words like “stolen” or “pirated,” according to the filings.

In a work chat, Kambadur mentioned that Meta’s AI team also tuned models to “avoid IP risky prompts” — that is, configured the models to refuse to answer questions like “reproduce the first three pages of ‘Harry Potter and the Sorcerer’s Stone’” or “tell me which e-books you were trained on.”

The filings contain other revelations, implying that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit said in April 2023 that it planned to begin charging AI companies to access data for model training.

In one chat dated March 2024, Chaya Nayak, director of product management at Meta’s generative AI org, said that Meta leadership was considering “overriding” past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company’s models had sufficient training data.

Nayak implied that Meta’s first-party training datasets — Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — simply weren’t enough. “[W]e need more data,” she wrote.

The plaintiffs in Kadrey v. Meta have amended their complaint several times since the case was filed in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest alleges that Meta, among other claims, cross-referenced certain pirated books with copyrighted books available for license to determine whether it made sense to pursue a licensing agreement with a publisher. 

In a sign of how high Meta considers the legal stakes to be, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.

Meta didn’t immediately respond to a request for comment.

Keep reading the article on Tech Crunch


iOS 18.4 will bring Apple Intelligence-powered ‘Priority Notifications’

Apple on Friday released its first developer beta for iOS 18.4, which adds a new “Priority Notifications” feature, powered by Apple Intelligence. The addition aims to help users manage their notifications by prioritizing important alerts and minimizing distractions from less important ones. 

These priority notifications are displayed in a separate section on the phone’s Lock Screen. Apple Intelligence will analyze which notifications it believes should be shown in this section, but you can still swipe up to view all of your notifications. 

Currently, the iPhone will sort notifications chronologically, with the most recent alerts displayed on top. With the new feature, you’ll see important notifications first — even if you received them a while ago when compared to others.  

According to 9to5Mac, Priority Notifications is off by default, but you can enable the feature by heading to your Settings app, selecting the “Notifications” option, and then opening the “Prioritize Notifications” section. Here, you can toggle the feature on. 

Apple announced today that Apple Intelligence is heading to the Vision Pro as part of visionOS 2.4. A beta version of the software is currently available for developers, while the public version is set for an April release. The tech giant also revealed Apple News+ Food, an upcoming section that will allow users to search and save recipes from dozens of existing News+ publishing partners.

Keep reading the article on Tech Crunch


Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Nvidia founder and CEO Jensen Huang said the market got it wrong when it comes to DeepSeek’s technological advancements and its potential to negatively impact the chipmaker’s business.

Instead, Huang called DeepSeek’s R1 open source reasoning model “incredibly exciting” while speaking with Alex Bouzari, CEO of DataDirect Networks, in a pre-recorded interview that was released on Thursday.

“I think the market responded to R1, as in, ‘Oh my gosh. AI is finished,’” Huang told Bouzari. “You know, it dropped out of the sky. We don’t need to do any computing anymore. It’s exactly the opposite. It’s [the] complete opposite.”

Huang said that the release of R1 is inherently good for the AI market and will accelerate the adoption of AI as opposed to this release meaning that the market no longer had a use for compute resources — like the ones Nvidia produces.

“It’s making everybody take notice that, okay, there are opportunities to have the models be far more efficient than what we thought was possible,” Huang said. “And so it’s expanding, and it’s accelerating the adoption of AI.”

He also pointed out that, despite the advancements DeepSeek made in pre-training AI models, post-training will remain important and resource-intensive.

“Reasoning is a fairly compute-intensive part of it,” Huang added.

Nvidia declined to provide further commentary.

Huang’s comments come almost a month after DeepSeek released the open source version of its R1 model which rocked the AI market in general and seemed to disproportionately affect Nvidia. The company’s stock price plummeted 16.9% in one market day upon the release of DeepSeek’s news.

Nvidia’s stock closed at $142.62 a share on January 24, according to data from Yahoo Finance. The following Monday, January 27, the stock dropped rapidly and closed at $118.52 a share. This event wiped $600 billion off of Nvidia’s market cap in just three days.

The chip company’s stock has almost fully recovered since then. On Friday the stock opened at $140 a share, which means the company has been able to almost fully regain that lost value in about a month. Nvidia reports its Q4 earnings on February 26 which will likely address the market reaction more.

Meanwhile, DeepSeek announced on Thursday that it plans to open source five code repositories as part of an “open source week” event next week.

Keep reading the article on Tech Crunch


and this