Blue Diamond Web Services

Your Best Hosting Service Provider!

April 1, 2025

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn’t license to train more sophisticated AI models.

AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.

That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

OpenAI didn’t respond to a request for comment.

Keep reading the article on Tech Crunch


March 31, 2025

Hillary Clinton joins Bluesky

Former Secretary of State Hillary Clinton has joined Bluesky, following in the footsteps of former president Barack Obama, who made his first post on the platform last week. Clinton confirmed her Bluesky account’s legitimacy in an Instagram story and a post on X.

In her introductory post, Clinton wrote about the Wisconsin Supreme Court election taking place on Tuesday. The state judicial race is getting unprecedented attention on both sides of the aisle, since it determines the political bent of the swing state’s court, which will impact rulings on abortion rights, redistricting, and other issues.

Musk has allegedly pumped more than $20 million into Republican candidate Brad Schimel’s campaign; he gave Trump’s 2024 campaign around $250 million. Like his strategies during Trump’s presidential run, Musk gave two Schimel voters checks for $1 million each.

As X owner Elon Musk becomes more entrenched in the Trump administration, Bluesky has continued to emerge as an alternative social platform for people disillusioned by Musk’s politics and X’s shift to the right.

Though Bluesky’s reach is smaller than that of the platform formerly known as Twitter, the open source network has north of 33 million users. But as Bluesky attracts high-profile users like Clinton and Obama, the platform is building more legitimacy as a competitor to X.

Keep reading the article on Tech Crunch


Signal sees its downloads double after scandal

Encrypted messaging app Signal continues to see spiking downloads in the wake of the messaging scandal, which saw The Atlantic’s editor in chief Jeffrey Goldberg added to a group chat where high-ranking officials in the Trump administration were discussing an attack on Houthi rebels in Yemen. The resulting press coverage around the leak of these sensitive plans has been driving more people to check out Signal’s app for the first time, leading to a doubling of its downloads.

According to app intelligence provider Appfigures, downloads rose by 26% on the day the news broke, an indication that users were curious about the app being used by members of the Trump administration. The next day, downloads jumped to 193,000 and by Wednesday hit an all-time high of 195,000, the firm noted.

By comparison, Signal typically sees an average of 95,000 downloads on an average day.

Appfigures has been tracking the impact of the scandal on Signal’s app from day one. It earlier found that Signal app downloads across iOS and Google Play had jumped up 28% on the Monday when the news initially broke, with U.S. downloads up by 45% and downloads in Yemen up by 42%.

The Trump administration has dismissed the seriousness of the incident, where the Atlantic journalist was accidentally added to a group chat between officials.

Although Secretary of Defense Peter Hegseth, who was in the chat with VP J.D. Vance and others, denied that “war plans” were shared through the encrypted chat app, the Atlantic later published the full message threads which showed officials discussing the time, location, and weapons the U.S. would use in the attack. Trump has since taken to attacking the media for its continued coverage of the incident.

Signal’s app itself was not compromised to allow for this leak. Instead, the journalist was accidently added to the thread. Waltz has accepted responsibility for making the group chat in the first place but has deflected blame over the embarrassing mistake.

Appfigures chalks up the doubling of downloads to the old adage “all press is good press,” as the scandal increased Signal’s visibility and likely introduced the app to thousands of users for the first time.

Keep reading the article on Tech Crunch


Trump says TikTok deal will come before April 5 deadline

President Donald Trump has said that a deal with TikTok’s parent company ByteDance to sell the app will be finalized before the April 5 deadline, Reuters reports.

“We have a lot of potential buyers,” Trump told reporters. “There’s tremendous interest in TikTok. He added, “I’d like to see TikTok remain alive.”

Trump extended the deadline for the TikTok ban to April 5 back in January after signing an executive order on his first day in office.

Reuters reported on Friday that private equity firm Blackstone is mulling a small stake in TikTok’s U.S. operations. The firm is considering joining ByteDance’s current non-Chinese shareholders, led by Susquehanna International Group and General Atlantic, in providing fresh capital to bid for TikTok’s U.S. operations. The group has emerged as front-runners in TikTok deal talks.

Trump has previously stated that he is open to extending the deadline again if a deal isn’t reached.

Keep reading the article on Tech Crunch


and this