Blue Diamond Web Services

Your Best Hosting Service Provider!

April 1, 2025

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn’t license to train more sophisticated AI models.

AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.

That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

OpenAI didn’t respond to a request for comment.

Keep reading the article on Tech Crunch


CaaStle board confirms financial distress, furloughing employees

CaaStle, a startup that launched in 2011 as a plus-sized clothing subscription service and later became an inventory monetization platform for clothing retailers, is facing financial difficulties, the company confirmed to TechCrunch following a report by Axios.

Citing a letter from the board, Axios reported that the company is almost out of money, CEO Christine Hunsicker resigned from her CEO role and the board, and the company has involved law enforcement to investigate alleged financial misconduct.

The company also confirmed to TechCrunch that it furloughed all of its employees.  

“The Board is deeply disappointed by the conduct that has led to this moment. Our immediate focus is on addressing the company’s challenges, supporting our employees, and preserving the value of our technology and business operations. We regret having to temporarily furlough our employees, but we believe this will best position the company to successfully recover from our current situation,” the company said in an emailed statement after TechCrunch inquired about the company’s status.

CaaStle raised over $530 million total, with its last round raised in 2019 at $43 million, PitchBook estimates.

In that letter, also cited by Puck, the board is alleging that Hunsicker misled at least some of the company’s investors about financial performance, and about the company’s capital and outstanding shares, including two “falsified” audit opinions. 

Both Axios and Puck have reported that days before Hunsicker exited the company, she was out fundraising, and making claims about the company’s healthy finances.

Axios has noted that if the board’s allegations lead to a case of fraud made against the founder, this would be one of the largest such cases ever. 

Last week, Charlie Javice, the founder of student loan application startup Frank, which was purchased by JPMorgan for $175 million, was found guilty of defrauding the bank. The bank claimed Javice inflated the customer count. But the investment numbers for CaaStle are three times as large.

While this might not be a typical startup shutdown experience, experts have told TechCrunch that 2025 is on track to be another brutal year for failed startups. 

Keep reading the article on Tech Crunch


Andreessen Horowitz is trying to nab a piece of TikTok with Oracle, report says

The venture capital firm is reportedly in talks to invest in TikTok as part of a bid led by Oracle and other American investors looking to buy out TikTok from ByteDance, according to the Financial Times. 

TikTok is once again slated to be banned in the U.S. on April 5 unless its Chinese-based owner sells its U.S. branch to a non-Chinese owner. The Oracle deal is said to be one of the frontrunners, according to the FT. 

Andreessen Horowitz has a long history of investing in social media: It was an early investor in Facebook and Instagram and invested $400 million to help Elon Musk acquire Twitter. The firm did not immediately respond to a request for comment.

Keep reading the article on Tech Crunch


An accounting startup has turned tax preparations into a Pokémon Showdown game

Accounting software company Open Ledger has launched a new product in time for tax day. 

Meet PokéTax, a game that helps make tax filing quite fun. Instead of tax forms, users take on Tax Trainers — gym leaders — representing different parts of a tax form, such as income, deductions, and credits. Each leader asks questions that help players complete their tax forms. 

Image Credits:Open Ledger

“Once you finish your PokéTax run, we guide you to the IRS Direct File site to officially submit,” Open Ledger co-founder Pryce Adade-Yebesi told TechCrunch. The game is an adaptation of the open source Pokémon game called Pokémon Showdown, and he promised this was not an April Fool’s joke. 

“This is real; it works. Tax fraud isn’t funny — and neither is the IRS,” he said. 

Adade-Yebesi and Ashtyn Bell launched Open Ledger earlier this year and raised a $3 million round led by Kindred Ventures and Black Ventures. Adade-Yebesi said his team first built this product, which is open source, as a joke. “Could we actually pull this off?” he and his team pondered. The answer was clearly yes. 

The game has an AI assistant that helps organize users’ responses, and players can win badges — discover new deductions — as they take on the Tax Trainers. 

Image Credits:Open Ledger

Taxes are such an unloved part of being a good citizen that few founders think of turning the process into a game. Notably, in 2023, there was the dating-style game Tax Heaven 3000, where users went on a date with an avatar named Iris who asked questions to help complete a tax form. But that was only for the 2022 tax filing year. 

Adade-Yebesi hopes that by adding fun to such financial processes, they will be “more engaging and way less soul-sucking.” 

Taxes are due April 15.

Keep reading the article on Tech Crunch


Who are climate-conscious consumers? Not who you’d expect, says Northwind Climate

Sometimes, surprises are lurking in everyday data.

Take a category of consumers that Doug Rubin’s startup, Northwind Climate, calls “climate doers.” They’re concerned about climate change and tend to prioritize climate-friendly purchases, the sort of identifiers who might be stereotypically associated with things like buying organic foods or prioritizing local businesses. 

“Turns out that the climate doers category actually are the consumers who most frequent fast-food restaurants,” Rubin told TechCrunch. What’s more, some 30% of climate doers are Republicans, he added.

Northwind Climate evolved from Rubin’s work in the political world, where surveys are vital to understanding shifts in public sentiment and identifying likely voters. The startup has raised a $1.05 million pre-seed round, it exclusively told TechCrunch, with participation from angel investors, including Tom Steyer, former Massachusetts governor Deval Patrick, and Alexander Hoffmann of Susty Ventures.

Rather than divide people into demographic buckets that might segment along political, generational, or regional lines, Northwind Climate analyzes survey responses for behavioral clues that can be used to classify consumers.

In addition to climate doers, who comprise about 15% of all U.S. consumers, Northwind Climate has identified four other behavioral groups, ranging from “climate distressed,” or people who are slightly less concerned about climate change and aren’t as financially secure as the climate doers, to the climate deniers, who tend to be retirees who think the media is exaggerating the problem.

But, Rubin adds, “even in that [climate deniers] bucket, there are messages and ways that work with them.”

A graphic illustrating different consumer segments related to climate change.
Northwind Climate has found five discrete segments that describe consumers’ views on climate change.Image Credits:Northwind Climate

Take some analysis Northwind did on electric vehicles. For climate doers and “climate distressed,” two categories of consumers who are most likely to buy an EV, the startup suggests that automakers frame the cars as matter of choice. “We’re providing choices for those who care about reducing pollution, saving money on gas, and helping address climate change,” reads one of Northwind’s suggested pitches.

But for climate doubters and deniers, who are less likely to buy one, the focus of the pitch shifts from choice to freedom: “Americans should have the freedom to drive what they want. We want to make electric vehicles clean, affordable, and practical for the millions of Americans who want one.”

The startup has built a database that consists of 20,000 survey respondents across eight surveys, and Rubin says it’s growing by 2,500 respondents per month. Every three months, Northwind also runs an industry-specific survey to capture deeper insights for different customers.

Companies that subscribe to the service, which costs $10,000 per quarter or $40,000 per year for a typical customer, can add up to four of their own questions every quarter, which Rubin said is less than what they’d shell out for one annual survey.

Within the platform, customers get access to the data Northwind has collected, questions it has asked, and some basic analyses like cross tabulations. The startup is building a chatbot to allow users to ask for more specific analyses using plain language queries.

Concerned consumers might cast a wary eye on such a platform, worried that it might help companies greenwash their businesses. But Rubin isn’t concerned, saying surveys have shown that consumers are pretty savvy. “Our data shows there is a clear risk to brands and their reputations from making claims that are exaggerated or otherwise untrue,” Rubin said.

Rubin said that Northwind is also developing what he calls a virtual focus group. It’s essentially an AI model, trained on survey responses, that can analyze a company’s marketing materials like TV spots or social media ads and provide feedback, just like a human focus group would. The startup hopes to have it available in the next four to five months, Rubin said, though it will use new data to continually refine the model.

Rubin is convinced that companies have been missing opportunities to connect with climate-conscious consumers.  “If you look at the data and where consumers are — and it’s across the board, it’s not just Democrats or Independents — they really want this, and they will reward companies who are willing to be smart about it,” he said.

Keep reading the article on Tech Crunch


Zelle is shutting down its app, but you probably don’t need to worry

Zelle is shutting down its stand-alone app on Tuesday, according to a company blog post.

This news might be alarming if you’re one of the over 150 million customers in the U.S. who use Zelle for person-to-person payments. But only about 2% of transactions take place via Zelle’s app, which is why the company is discontinuing its stand-alone app.

Most consumers access Zelle via their bank, which then allows them to send money to their phone contacts. Zelle users who relied on the stand-alone app will have to re-enroll in the service through another financial institution.

Given the small user base of the Zelle app, it makes sense why the company would decide to get rid of it — maintaining an app takes time and money, especially one where people’s financial information is involved.

Zelle launched in 2017 with backing from 30 banks to be a more efficient alternative to Venmo. On Venmo, users can receive payments into their own Venmo wallet, which they can then deposit into their actual bank account — but if you don’t want to wait a few days for the deposit to process, you’ll have to pay a fee for an instant transfer. Because of Zelle’s connections with banks, it’s able to offer instant transfers without charging additional fees.

Zelle said that in 2024, users sent $1 trillion in payments, breaking the record of any other payment app. This might be the case because consumers tend to use Zelle for larger payments like rent. Venmo, on the other hand, is designed for more social use, like reimbursing a friend for dinner.

Keep reading the article on Tech Crunch


and this