AI Training Data Liability: Who Owns Creator Content? (2026)

7 min read · 1,608 words

Somewhere in a model that generates millions of words a day lives the prose of a novelist who cannot pay rent. She did not license her work. She was not asked. She found out the way most creators did: a journalist called, or a lawsuit got filed, and her name appeared in a legal brief as evidence of someone else’s problem.

That problem is now a docket. Multiple dockets, in multiple jurisdictions, moving slowly but no longer going nowhere. The question of AI training data liability — who owes what to whom when a machine learns from work it never licensed — is the most consequential unsettled question in intellectual property since the internet made copying frictionless. Courts are not yet ruling on the merits. They are doing something almost as important: they are letting discovery proceed.

What Discovery Means When the Evidence Is a Model

Discovery in these cases is not routine. Plaintiffs are not asking for emails. They are asking AI companies to explain, under oath, exactly which data was scraped, from where, and when — and then to show whether that data still lives, in some transformed but traceable form, inside the weights of a deployed model. That is a technical question dressed in legal clothing, and no court has a clean answer for it yet.

McKool Smith’s April 2025 litigation tracker documents the consolidation of cases against major AI developers, including OpenAI, Meta, and Google, across multiple venues. The pattern is consistent: authors, news organizations, and code repositories filing separately, courts threading them together, discovery timelines extending into late 2025 and beyond. No major defendant has prevailed on a fair use argument at the merits stage. No plaintiff has won damages. The legal ambiguity is not resolving — it is deepening.

For the writers and journalists whose work sits inside these models, that ambiguity is not academic. It is the difference between a licensing check and nothing.

The Creators Who Were Not in the Room

The decisions that determined how training data would be assembled — what to scrape, what to exclude, whether to license at all — were made in infrastructure teams, not editorial meetings. The affected parties were not stakeholders in any meaningful sense. They were sources.

Freelance journalists, mid-list novelists, academic bloggers, translators: the people whose work gave early large language models their fluency in human writing were precisely the workers least positioned to negotiate. Staff writers at major newspapers at least had institutional backing when their publishers eventually sued. The individual creator had a terms-of-service agreement they never read, and a copyright registration they may not have filed in time.

“The fundamental asymmetry here is that the people who created the most distinctive training signal — original voice, original argument, original style — are also the people with the least leverage to capture any of the value that signal produces.”

— IP litigation attorney, active in AI copyright proceedings

This is where the victim’s position becomes structural, not just sympathetic. The concentration of benefit is not incidental to how these systems were built. It is a consequence of building them at speed, before legal frameworks existed, and betting that fair use would cover the gap. That bet is now being stress-tested in court — and the people bearing the cost of the uncertainty are the ones who had no say in placing it.

Fair Use Was Always a Wager, Not a Shield

The AI industry’s working assumption, stated carefully in terms-of-service language and more bluntly in investor presentations, was that training a model on publicly available data constitutes transformative use under copyright law. Transformative use is a fair use defense. Fair use is an affirmative defense — meaning the defendant bears the burden of demonstrating it applies.

Fair use is not a blanket exemption. It is a four-factor balancing test applied case by case, and courts have not yet applied it to the specific question of ingesting copyrighted text at scale to build a commercial product. The closest analogues — Google Books, which digitized millions of volumes for search indexing — are favorable to defendants but distinguishable: Google Books did not generate new text in the style of the authors it scanned.

Norton Rose Fulbright’s 2026 litigation review notes that no court has yet issued a definitive ruling on whether large-scale AI training constitutes fair use. The silence is not reassuring. It means every company that trained on unlicensed data is carrying AI training data liability on its balance sheet that has not been quantified, provisioned for, or disclosed with any precision to investors.

What the Ongoing Cases Are Actually Testing

The procedural posture of these suits matters more than the headlines suggest. A successive wave of author suits against OpenAI, filed even as earlier cases proceed through discovery, signals that the plaintiff bar is not waiting for a landmark ruling before expanding the litigation surface. Each new filing extends the timeline and adds to the discovery burden — which is itself a form of pressure on defendants, regardless of the ultimate merits.

What courts are actually sorting through: whether training constitutes copying in the legally cognizable sense; whether the output of a model constitutes a derivative work; and whether statutory damages — which under U.S. copyright law can reach $150,000 per work for willful infringement — apply to each piece of training data separately. That last question is the one that makes general counsel lose sleep. A model trained on hundreds of thousands of copyrighted works, if statutory damages attach to each, produces a liability exposure that would dwarf the market capitalization of most AI companies not named Google or Microsoft.

The exposure is real. It is also, almost certainly, not what courts will ultimately impose — but the gap between the theoretical maximum and the negotiated reality is where this entire dispute will be resolved, and it will be resolved over years, not quarters.

The Licensing Market That Is Forming Without the People Who Need It Most

In response to legal pressure, some AI developers have begun negotiating licensing deals with publishers — the Associated Press, select news organizations, academic database operators. These deals are real. They are also narrow, covering institutional content producers who had the leverage and the legal infrastructure to demand payment.

The individual creator is largely absent from this emerging market. The Mishcon de Reya IP tracker documents dozens of cases, but no class action has yet achieved certification on terms that would deliver meaningful compensation to individual authors rather than legal fees. A licensing regime that covers Reuters archives and excludes the freelance essayist whose work shaped the model’s sense of irony is a licensing regime built for incumbents.

This is the second-order victim: not just the creator whose work was used, but the creator whose work was used and who will receive nothing from the licensing negotiations their lawsuits made possible — because they do not have an agent, a publisher, or a trade association with a seat at the table when the deals get structured.

What Builders Must Do Before the Discovery Window Closes

For practitioners and companies still assembling training pipelines, the posture has shifted — it is no longer sufficient to assume that public availability implies legal usability. The question of AI training data liability now attaches at the data sourcing stage, not after a lawsuit is filed.

The practical implications are specific. First: audit existing training sets against registered copyright holders, not just paywalled content. Web-accessible does not mean unprotected. Second: preserve records of what was ingested and when — courts in discovery are asking exactly this question, and companies that cannot answer it are in a worse position than those that can. Third: do not assume that model updates or retraining events reset the clock on liability for the original training corpus. The legal theory has not been tested, and betting on it is a decision that should go above the engineering team.

The companies building compliance infrastructure now are not being conservative — they are being early. The ones doing nothing are not saving time. They are accumulating exposure at the exact moment when plaintiff attorneys are becoming more sophisticated about how to quantify it.

The most important sentence in any briefing on this topic is the one that gets skipped: if you do nothing before a court rules on fair use in the context of commercial AI training, you will be responding to that ruling under time pressure, with a dataset you cannot fully describe, in front of a judge who has already decided the question matters.

FetchLogic Take

By the end of 2026, at least one U.S. court will issue a substantive ruling on whether commercial AI training constitutes fair use — not a dismissal on procedural grounds, not a settlement with a confidentiality clause, but a written opinion that the plaintiff bar and the defense bar will both cite going forward. That ruling will not resolve the question for every model or every dataset. It will, however, make AI training data liability a line item on audited financial statements rather than a footnote disclosure, and it will trigger the first wave of structured licensing negotiations that include individual creator compensation rather than only institutional deals. The companies that built data provenance systems before that ruling lands will close those negotiations faster and cheaper. The ones that did not will fund the next round of litigation instead.

About FetchLogic
FetchLogic is an independent AI news and analysis publication. Our editorial team tracks model releases, funding rounds, policy developments, and enterprise adoption. We cross-reference primary sources including research papers, company filings, and official announcements before publication. Editorial standards →