Blue Diamond Web Services

Your Best Hosting Service Provider!

April 1, 2025

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn’t license to train more sophisticated AI models.

AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.

That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

OpenAI didn’t respond to a request for comment.

Keep reading the article on Tech Crunch


The Problem Isn’t Tinder’s New AI but the Dating Apps Themselves

Tinder The Game Game App Hero 1

Tinder and OpenAI’s ‘The Game Game’ won’t prove too effective at practicing your pick-up lines, but will remind you of what you’re missing because of dating apps.


Sam Altman says that OpenAI’s capacity issues will cause product delays

In a series of posts on X on Monday, OpenAI CEO Sam Altman said that the popularity of the company’s new image-generation tool in ChatGPT will cause unspecified product delays.

“We are getting things under control, but you should expect new releases from OpenAI to be delayed, stuff to break, and for service to sometimes be slow as we deal with capacity challenges,” Altman wrote. “Working as fast we can to really get stuff humming.”

OpenAI’s new image-generation capability arrived with much fanfare — and controversy — for its impressive ability to recreate styles like Studio Ghibli’s hand-drawn animation. Over the weekend, Altman said in posts on X that the company “hasn’t been able to catch up” since launch and that staff have worked late nights and through the weekend to “keep the service up.”

In a single hour on Monday, ChatGPT added a million new users, Altman claimed in a post. ChatGPT now has 500 million weekly users and 20 million paying subscribers, up from 300 million users and 15.5 million subscribers at the end of 2024.

In an effort to ease its capacity issues, OpenAI delayed the release of the image-generation tool for free ChatGPT users and temporarily disabled video generation for new users of Sora, the company’s suite of generative AI media tools.

Keep reading the article on Tech Crunch


ChatGPT isn’t the only chatbot that’s gaining users

OpenAI’s ChatGPT may be the world’s most popular chatbot app. But rival services are gaining, according to data from analytics firms Similarweb and Sensor Tower.

SimilarWeb, which estimates traffic to websites including chatbot web apps, has recorded healthy recent upticks in usage across bots like Google’s Gemini and Microsoft’s OpenAI-powered Copilot. Gemini’s web traffic grew to 10.9 million average daily visits worldwide in March, up 7.4% month-over-month, while daily visits to Copilot increased to 2.4 million — up 2.1% from February.

Similarweb reports that Anthropic’s Claude reached 3.3 million average daily visits in March, and Chinese AI lab DeepSeek’s chatbot eclipsed 16.5 million visits that same month. Meanwhile, xAI’s Grok, which only gained a web app several months ago, averaged the same number of daily web visits as DeepSeek’s chatbot: 16.5 million.

The numbers pale in comparison to ChatGPT, which surged past 500 million weekly active users in late March. Yet David Carr, editor at Similarweb, noted that there’s fierce competition for the No. 2 chatbot spot.

“[F]or March, DeepSeek is in second place, despite seeing traffic drop 25% from where it was in February, based on daily visits,” Carr told TechCrunch. “China’s DeepSeek came out of nowhere in January, but the AI platform with the greatest momentum at the moment is Grok from Elon Musk’s xAI, with traffic up nearly 800% month-over-month.”

AI companies’ mobile chatbot apps have been growing their user bases, too, perhaps fueled by recent AI model releases.

According to metrics from app data analysis company Sensor Tower, the Claude app saw a 21% week-over-week increase in weekly active users during the week of February 24, when Anthropic released its latest flagship AI model Claude 3.7 Sonnet. Two weeks prior, shortly after Google made its Gemini 2.0 Flash model generally available, the number of Gemini app weekly active users grew by 42%.

Abraham Yousef, senior insights analyst at Sensor Tower, attributed the rising tides not only to new models, but new capabilities, as well. Just this past month, Google brought a “canvas” feature to Gemini that lets users preview the output of coding projects, and Anthropic has steadily added tools to its Claude client.

“The rollout of popular new AI models, heightened consumer interest in the space, the introduction of various new features and functions, and the growing number of unique use cases has propelled user growth for AI chatbot apps,” Yousef told TechCrunch.

But OpenAI probably isn’t panicking yet. Yousef pointed out that ChatGPT had 10x mobile app weekly active users compared to Gemini and Claude combined as of March.

Keep reading the article on Tech Crunch


OpenAI’s new image generator is now available to all users

OpenAI’s new image generator, powered by its GPT-4o model, is now available to all users, CEO Sam Altman said in a post on X. The feature was until now available only to paying users of ChatGPT.

While it is not clear how many images users on the free tier can generate, Altman last week had mentioned a limit of three images per day.

OpenAI’s image generation tool took off instantly after launch, with Altman saying the demand was so high, the company’s GPUs were “melting.” The tool quickly also gained notoriety for being used to convert pictures into the style of Japanese animation firm Studio Ghibli, raising concerns around copyright and training data used by the company as well, given the similarity in style.

Some people also used it to generate fake receipts, such as restaurant bills. An OpenAI spokesperson told TechCrunch that all these images have metadata indicating that ChatGPT generated them, and that the company “takes actions” if the images violate the company’s guidelines.

Meanwhile, OpenAI today said it raised $40 billion in funding led by SoftBank at a $300 billion valuation. The company also said ChatGPT has hit 500 million weekly active users and 700 million monthly active users.

Keep reading the article on Tech Crunch


March 31, 2025

OpenAI raises $40B at $300B post-money valuation

OpenAI on Monday announced that it closed one of the largest private funding rounds in history.

According to a blog post on the company’s website, OpenAI raised $40 billion in a round that values the company at $300 billion post-money. SoftBank led the financing, CNBC reported. Other participants included Microsoft, Coatue, Altimeter, and Thrive, all of which are earlier backers in the outfit.

“[This new capital] enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT every week,” OpenAI wrote in the blog post. “We’re excited to be working in partnership with SoftBank Group — few companies understand how to scale transformative technology like they do.”

CNBC, citing a source familiar with the matter, says that around $18 billion of the funding will go toward OpenAI’s ambitious Stargate infrastructure project, which aims to establish a network of AI data centers around the U.S.

Keep reading the article on Tech Crunch


and this