OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn’t license to train more sophisticated AI models.
AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.
While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.
The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)
In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.
“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”
The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.
The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.
According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.
“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.
It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.
Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.
That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.
It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.
Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.
OpenAI didn’t respond to a request for comment.
Keep reading the article on Tech Crunch
Meta’s VP of AI research, Joelle Pineau, is planning to leave the company, she announced in a post on Facebook Tuesday. Pineau said she’s leaving in May after more than two years overseeing FAIR, Meta’s internal AI research lab led by Yann LeCun.
Pineau’s exit comes as Meta ramps up its AI efforts, with the company planning to spend $65 billion on AI infrastructure in 2025.
In a statement to Bloomberg News, a Meta spokesperson said the company does not have an immediate replacement for Pineau but is conducting a search for her successor. Last year, Meta reportedly reorganized the company to have its AI research unit report to the company’s chief product officer, Chris Cox.
As for Pineau, the executive said she’ll take some time off before jumping into an unnamed “new adventure.”
Keep reading the article on Tech Crunch
Ashley St. Clair didn’t actually confirm the rumors.
In a series of posts on X on Monday, OpenAI CEO Sam Altman said that the popularity of the company’s new image-generation tool in ChatGPT will cause unspecified product delays.
“We are getting things under control, but you should expect new releases from OpenAI to be delayed, stuff to break, and for service to sometimes be slow as we deal with capacity challenges,” Altman wrote. “Working as fast we can to really get stuff humming.”
OpenAI’s new image-generation capability arrived with much fanfare — and controversy — for its impressive ability to recreate styles like Studio Ghibli’s hand-drawn animation. Over the weekend, Altman said in posts on X that the company “hasn’t been able to catch up” since launch and that staff have worked late nights and through the weekend to “keep the service up.”
In a single hour on Monday, ChatGPT added a million new users, Altman claimed in a post. ChatGPT now has 500 million weekly users and 20 million paying subscribers, up from 300 million users and 15.5 million subscribers at the end of 2024.
In an effort to ease its capacity issues, OpenAI delayed the release of the image-generation tool for free ChatGPT users and temporarily disabled video generation for new users of Sora, the company’s suite of generative AI media tools.
Keep reading the article on Tech Crunch
Qualcomm has acquired the generative AI division of VinAI, an AI research company headquartered in Hanoi, for an undisclosed amount, the companies announced on Monday.
The move marks Qualcomm’s continued expansion into the AI tooling sector. VinAI, which was founded by former DeepMind research scientist Hung Bui, develops a range of generative AI technologies, including computer vision algorithms and language models.
“This acquisition underscores our commitment to dedicating the necessary resources to R&D that makes us the driving force behind the next wave of AI innovation,” Qualcomm SVP of Engineering Jilei Hou said in a press release. “By bringing in high-caliber talent from VinAI, we are strengthening our ability to deliver cutting-edge AI solutions that will benefit a wide range of industries and consumers.”
VinAI, which Bui started in 2019, primarily focuses on AI-powered automotive products, but also conducts higher-level AI research. Backed by VinGroup, a Vietnamese conglomerate, the company creates solutions like in-cabin monitoring, security, and “smart parking” systems for carmakers and customers in other verticals.
In a 2023 interview with Forbes, Bui said that VinAI had around 200 employees spread across the startup’s offices in Hanoi, the U.S., and Australia.
Bui said that he expects VinAI will contribute to a number of Qualcomm’s product families, including its software and chips for smartphones, PCs, and vehicles. “Our team’s expertise in generative AI and machine learning will help accelerate the development of innovative solutions that can transform the way we live and work,” he added in a statement.
Bui, who serves as VinAI’s CEO, will join Qualcomm following the close of the acquisition, according to the aforementioned press release.
The VinAI acquisition is Qualcomm’s second this year following its purchase of Edge Impulse, a German AI and Internet of Things company, in early March. Qualcomm CEO Cristiano Amon recently called edge AI — AI that can run on devices without the need for data center infrastructure — a “tailwind” for the tech giant.
Keep reading the article on Tech Crunch
You know the online dating scene is bad when dating giants like Tinder are now introducing AI personas for users to flirt with.
On Tuesday, the company announced a new game powered by OpenAI, allowing users to interact with an AI bot to practice flirting, reenact meet-cute scenarios, and receive scores with suggestions for improving their dating skills.
To play Tinder’s The Game Game, tap the Tinder logo in the top left corner of the app. The game gives users a deck of cards, with each one featuring a different AI persona and scenario. Users must use their voices to respond and try to flirt their way into getting a date with the bot.
After the interaction, users are scored on a three-point scale using flame emojis. The AI provides real-time feedback throughout the experience. If users are rude, for instance, the AI offers suggestions to improve the conversation.
According to the company, the new game is intended to provide a fun and lighthearted experience, not to be taken too seriously. It’s only available for U.S. users on iOS for a limited time.
However, the trend of people flirting with AI bots is becoming scarily popular, and Tinder seems to be banking on this as a way to attract more users amid its struggles for growth. There are already existing apps in this space, such as Replika’s AI dating sim Blush, Teaser, and Rizz.
Tinder has announced other AI features, such as an AI photo selector tool that launched last year and upcoming features for discovery and matching.
Keep reading the article on Tech Crunch