Investigative Review of OpenAI

The US Copyright Office, in its 2025 report on AI training, explicitly rejected the "inherently major" argument, noting that when a model is trained to produce content that appeals to the same audience as the original work, the use is "at best, modestly major." For years, Silicon Valley relied on.

Verified Against Public And Audited Records Long-Form Investigative Review

Reading time: ~35 min
File ID: EHGN-REVIEW-32696

Unauthorized utilization of copyrighted non-fiction literature and news archives for Large Language Model (LLM) training

If OpenAI cannot prove they purchased legal copies of the thousands of books in their training data their fair use.

Primary Risk Legal / Regulatory Exposure

Jurisdiction EPA

Public Monitoring Real-Time Readings

Report Summary

OpenAI stopped disclosing the specific contents of its training data, offering only vague descriptors like "internet-based books corpora." This sudden opacity was not a competitive strategy; it was a legal need. OpenAI contends that if scanning books to create a search index is fair use then scanning books to teach a machine how to write must also be fair use. Anthropic ruling (July 2025), the court recognized this distinction, finding that while training on legally purchased books might be fair use, the use of pirated datasets (like Books3) to create a competing product constituted infringement.

Key Data Points

Google (2015). In a written submission to the House of Lords Communications and Digital Committee in January 2024 the company stated explicitly that it would be "impossible" to train leading AI models without using copyrighted materials. The legal ground shifted significantly following judicial rulings in 2024 and 2025. Yet a more dangerous precedent emerged from the parallel Anthropic litigation in June 2025. This ruling struck at the heart of OpenAI's reliance on the "Books3" corpus. OpenAI has since attempted to distance itself from the "Books3" dataset. This claim posed a serious threat because the DMCA allows for statutory damages of.

Investigative Review of OpenAI

Why it matters:

OpenAI's transition to a closed-source entity led to the mysterious "Books2" dataset, sparking concerns about its origin.
Independent researcher Shawn Presser's creation of "Books3" shed light on the potential use of shadow libraries in AI training data, revealing copyrighted works within.

The 'Books3' Revelation: Uncovering the Shadow Library in Training Data

The transition of OpenAI from a non-profit research laboratory to a closed-source commercial entity is best illustrated by the sudden obfuscation of its training data. In the early years, the organization published detailed accounts of its inputs. By the release of GPT-3 in 2020, this transparency had. The technical paper for GPT-3 contained a single, innocuous-looking table listing five datasets. Among them was a dataset labeled simply “Books2.” This dataset, weighted at 55 billion tokens, represented approximately 8 percent of the model’s training data contributed a disproportionately high value to the model’s ability to generate coherent, long-form prose. Unlike the “Common Crawl” (web scrapes) or “Wikipedia,” the source of “Books2” was never disclosed. The existence of “Books2” presents a mathematical impossibility for a legally acquired dataset. To amass 55 billion tokens of high-quality, edited text, one requires approximately 294, 000 full-length books. There is no commercially available dataset of this magnitude that licenses copyrighted fiction and non-fiction for AI training. The only repositories containing such a volume of digitized literature are “shadow libraries”—pirate archives like Library Genesis (LibGen), Z-Library, and Bibliotik. These sites operate outside the law and offer millions of copyrighted titles for free download via torrents and direct file hosting. The size of “Books2” aligns almost perfectly with the curated collections found on these pirate trackers. The nature of this data remained a matter of speculation until the release of “Books3,” a dataset created by independent AI researcher Shawn Presser in 2020. Presser, a co-founder of the open-source shared EleutherAI, sought to democratize the power of Large Language Models (LLMs). He recognized that OpenAI’s advantage lay in its access to a massive, high-quality book corpus that the general public could not access. To level the field, Presser scraped the private torrent tracker Bibliotik, a notorious hub for pirated e-books. He compiled approximately 196, 640 books into a file named “books3. tar. gz” and released it as part of “The Pile,” a larger open-source dataset. Presser was explicit about his intent. He stated that “Books3” was designed to replicate the “Books2” dataset used by OpenAI. Books3 became the Rosetta Stone for investigative journalists and copyright lawyers. It provided a concrete, searchable proxy for the contents of OpenAI’s black box. If “Books2” was indeed a shadow library scrape as the metrics suggested, then “Books3” revealed exactly what OpenAI had stolen. In 2023, *The Atlantic* conducted a forensic investigation into Books3, led by writer and programmer Alex Reisner. Reisner wrote a script to parse the massive text file and extract International Standard Book Numbers (ISBNs). The results were damning. The dataset did not contain public domain classics or obscure texts. It contained the bedrock of modern non-fiction literature. The investigation identified thousands of copyrighted works by Pulitzer Prize winners, best-selling historians, and investigative journalists. The dataset included detailed histories of the American Civil War, biographies of tech moguls, analyses of geopolitical conflicts, and textbooks on quantum mechanics. This was particularly significant for the non-fiction sector. While fiction authors like Stephen King and George R. R. Martin garnered headlines for their inclusion, the theft of non-fiction archives represented a more functional transfer of value. LLMs rely on these texts not just for style, for *facts*, reasoning patterns, and the synthesis of complex ideas. By ingesting these books, the models memorized the life’s work of thousands of experts without consent or compensation. The “Books3” dispelled the myth that LLMs learn solely from the “open web.” The open web is messy, fragmented, and frequently unreliable. Books are edited, fact-checked, and structured. They provide the “reasoning” capabilities that users admire in models like GPT-4. The high performance of these models is directly attributable to the density of information found in copyrighted literature. When OpenAI trained its models on “Books2,” it was not reading words. It was ingesting the intellectual labor of the publishing industry. The correspondence between the size of “Books2” and the contents of “Books3” suggests that OpenAI likely used a direct dump from LibGen or a similar repository, stripping the copyright management information in the process. Legal filings in 2024 and 2025 have further substantiated these suspicions. In class-action lawsuits brought by the Authors Guild and major newspapers, discovery documents revealed that OpenAI engineers were aware of the legal risks associated with their data sources. Internal communications suggested a strategy of “deletion and denial.” In mid-2022, as scrutiny over AI copyright infringement intensified, OpenAI allegedly deleted the “Books1” and “Books2” datasets from their active servers. This digital shredding operation appears to have been an attempt to destroy evidence of willful infringement. yet, the models themselves remain. The weights of GPT-3 and its successors act as a compressed archive of the stolen data. The books are no longer in a folder, yet they are encoded in the neural connections of the AI. The distinction between “Books3” (the open-source proxy) and “Books2” (OpenAI’s proprietary secret) is a semantic shield that OpenAI uses in court. They that they never used the specific “Books3” file released by Presser. This is technically true functionally irrelevant. Both datasets are derived from the same pool of pirated content. Whether the data came from Bibliotik (Books3) or LibGen (Books2), the result is identical. The industry standard for training competitive LLMs is the unauthorized use of shadow libraries. Without this theft, the models would regress significantly in their ability to understand nuance, context, and long-form narrative structures. For non-fiction authors, the harm is acute. A historian who spends a decade researching a biography relies on the sale of that book to fund their work. When an LLM ingests that biography, it can answer detailed questions about the subject, summarize the author’s unique arguments, and even mimic the author’s prose style. The AI becomes a substitute for the book itself. The “Books3” investigation showed that this is not a theoretical risk. It is the operational reality of the current AI economy. The dataset contained everything from Robert Caro’s *The Power Broker* to recent bestsellers on climate change. These works were not scraped from a public website. They were downloaded from illegal servers, processed into plain text, and fed into the machine. The “Shadow Library” is no longer a niche resource for academics in developing nations or broke college students. It has become the foundational infrastructure of a trillion-dollar industry. OpenAI’s valuation rests, in part, on the unauthorized acquisition of this intellectual property. The company’s refusal to disclose the titles in “Books2” is an admission of guilt by omission. If the data were legal, there would be no reason to hide it. The secrecy confirms that the “Books2” dataset is, in effect, a laundered version of the same pirate archives exposed by the Books3 investigation. The extend beyond copyright law into the integrity of the information ecosystem. By training on shadow libraries, OpenAI has ingested a specific slice of the literary world. These libraries are curated by pirates and enthusiasts. They skew towards popular, academic, and Western texts. This introduces a hidden bias into the models—a bias born not of algorithmic design, of the specific tastes of the torrent community. The “Books3” dataset, for instance, is heavy on science fiction and computer programming manuals, also contains a vast array of radical political literature and conspiracy theories frequently found on fringe trackers. If “Books2” shares this lineage, as the metrics imply, then the “worldview” of GPT models is shaped by the upload habits of anonymous data hoarders. The Atlantic’s searchable database of Books3 allowed authors to see their own names in the training data. This tangible proof galvanized the legal resistance against OpenAI. It moved the conversation from abstract debates about “fair use” to concrete evidence of theft. When an author sees their specific ISBN in a training set, the argument that the AI “just learns like a human” collapses. Humans buy books or borrow them from libraries that pay for licenses. OpenAI did neither. It took the entire library, scanned it, and sold access to the resulting intelligence. As the legal battles proceed, the “Books3” dataset stands as the smoking gun. It is the artifact that stripped away the veneer of “high-tech magic” to reveal the raw material of the AI revolution: millions of stolen books. The sophistication of GPT-4 is not solely the result of brilliant engineering. It is the result of the largest act of copyright infringement in history. The “Books2” mystery is solved. We know what is in the box because we have seen its twin. The shadow library is the engine of the modern AI era, and “Books3” handed the world the blueprints to prove it.

The New York Times v. OpenAI: Allegations of Mass Copyright Infringement

The legal war between The New York Times and OpenAI began on December 27, 2023. This filing in the U. S. District Court for the Southern District of New York marked the moment the legacy press stopped negotiating and started shooting. The Times became the major American media organization to sue the makers of ChatGPT over copyright problem. They alleged that OpenAI had built a valuation exceeding $80 billion by stealing the shared work of journalists. The complaint described a business model based on “mass copyright infringement.” It argued that OpenAI and Microsoft sought to “free-ride” on the Times’s massive investment in journalism to build products that substitute for the newspaper itself. The Times did not allege that OpenAI learned from their articles. They claimed the AI company swallowed the archives whole. The lawsuit millions of articles used to train the Large Language Models (LLMs) without permission or payment. The core of the argument focused on the economic threat. ChatGPT does not just recommend a Times article. It provides the answer contained within the article. This removes the need for the user to visit the publisher’s site. It severs the relationship between the reader and the reporting. The Times argued this substitution destroys their ability to fund the expensive and dangerous work of on-the-ground journalism. Exhibit J of the complaint provided the smoking gun. This 127-page document contained one hundred examples of “memorization.” The Times showed that GPT-4 could regurgitate near-verbatim excerpts of paywalled articles when prompted with specific snippets. In one instance, the model reproduced the text of a Pulitzer Prize-winning investigation into predatory lending in the New York taxi industry. It generated the text paragraph by paragraph. The output included the same facts and the same sentence structures. It mimicked the unique expressive choices of the original authors. This evidence attacked OpenAI’s defense that their models only learn abstract concepts. It showed the models retained and distributed the exact expression of copyrighted works. OpenAI responded with a public relations and legal counter-offensive. They characterized the lawsuit as “without merit” and accused the Times of “hacking” their model. OpenAI stated that “regurgitation” is a rare bug that they are working to eliminate. They claimed the Times manipulated the prompts to force the model to violate its own guardrails. This defense relied on the idea that the model is a tool and the user is responsible for how it is used. They argued that normal users do not use ChatGPT to reconstruct old news articles. OpenAI maintained that training AI models on publicly available internet data constitutes “fair use” under U. S. copyright law. They compared it to a student reading a book to learn how to write. The legal battle turned ugly in late 2024 during the discovery phase. The Times demanded access to OpenAI’s training data to prove their works were ingested. OpenAI provided virtual machines for the Times’s experts to search the datasets. Then a significant “error” occurred. On November 14, 2024, OpenAI engineers erased all the search data stored on one of these dedicated virtual machines. The deletion wiped out 150 hours of work by the Times’s experts. OpenAI attorneys attributed the incident to a “system misconfiguration” and successfully recovered most of the data. Yet the folder structure and file names were irretrievably lost. This forced the plaintiffs to restart their analysis from scratch. The incident raised serious questions about the competence of the data management at the world’s leading AI company. It also fueled suspicions regarding the opacity of the “black box” training sets. The courtroom shifted again in March 2025. U. S. District Judge Sidney Stein issued a pivotal ruling that trimmed the lawsuit left its heart beating. Judge Stein dismissed the Times’s claims regarding the Digital Millennium Copyright Act (DMCA). The Times had argued that OpenAI illegally removed copyright management information such as bylines and metadata. The judge found this argument unpersuasive without proof that OpenAI intended to conceal infringement by stripping the data. He also dismissed the “unfair competition” claim based on misappropriation. Yet Judge Stein allowed the central copyright claims to proceed. He rejected OpenAI’s motion to dismiss the allegations of direct and contributory copyright infringement. The court found that the Times had plausibly alleged that the copying occurred at the point of input. The act of scraping the data to build the model could constitute infringement regardless of what the model outputted later. This ruling validated the Times’s strategic pivot. By mid-2024, the Times had signaled they might not even rely on Exhibit J at trial. They moved the battlefield from the “output” (what the user sees) to the “input” (what the machine ate). This made the “hacking” defense irrelevant. If the scraping itself was illegal, it did not matter how hard the Times had to try to get the text back out. The discovery process expanded into a privacy battle in mid-2025. The Times sought to prove that ChatGPT was a direct substitute for their product. They demanded OpenAI turn over millions of user chat logs. They wanted to find evidence of users asking for news and receiving Times content. OpenAI fought back. They argued this would violate the privacy of millions of users who had not consented to have their conversations read by lawyers. Judge Ona T. Wang initially ordered OpenAI to preserve these logs. This created a massive data retention load. By October 2025, the court modified the order. OpenAI was no longer required to retain all consumer data indefinitely had to secure specific historical data from the April-to-September 2025 window. The of this litigation remain existential for both industries. If the Times wins on the core copyright claim, the remedy could be the destruction of the models. A court could order OpenAI to delete any model trained on the stolen data. This is known in legal terms as the “fruit of the poisonous tree.” Such a ruling would force OpenAI to retrain GPT-4 and its successors from scratch using only licensed or public domain data. The cost would be astronomical. The performance degradation could be severe. Conversely, if OpenAI wins, it cements a legal precedent that human creativity is raw material for machine learning. It would mean that anything published on the open web is free for the taking. The conflict exposes the fundamental incompatibility between the subscription news business and the generative AI business. The Times sells trust and verified information. OpenAI sells a synthesis of the internet that detaches information from its source. The lawsuit documents show that OpenAI method the Times for a licensing deal prior to the suit. The talks broke down because the numbers were too far apart. The Times viewed their archives as a century of high-value assets. OpenAI viewed them as just another dataset in a training run of trillions of tokens. As of early 2026, the case continues to grind through the federal court system. The “accidental” deletion of evidence and the fierce battles over chat logs show that neither side is to yield. The initial shock of the Exhibit J “memorization” has faded. It has been replaced by a technical and tedious war over the definition of “copying” in the age of neural networks. The outcome decide who owns the past and who gets to sell the future of information. The Times is fighting for the right to be paid for the work they have already done. OpenAI is fighting for the right to use that work to build something that might eventually replace the workers who did it.

Systematic Theft or Fair Use? Analyzing OpenAI's Legal Defense Strategy

The ‘Fair Use’: Redefining Copyright for the Age of Algorithms

OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the “fair use” doctrine. The company asserts that training a Large Language Model is functionally identical to a human student reading a textbook in a library. In their view, the model does not “copy” the expressive content of a or a news article. Instead, it analyzes statistical relationships between words to learn the underlying patterns of language. This argument seeks to categorize the ingestion of billions of copyrighted works not as reproduction as “intermediate copying” for a “major” purpose. Legal teams for the AI giant frequently cite the precedent set in Authors Guild v. Google (2015). That ruling allowed Google Books to scan millions of volumes to create a searchable database. OpenAI contends that if scanning books to create a search index is fair use then scanning books to teach a machine how to write must also be fair use.

This defense hinges on the concept of “non-expressive use.” OpenAI that their models do not retain the artistic expression of the original authors. They claim the software extracts factual data and stylistic abstractions. When a model processes a copyrighted history book it is not “memorizing” the text to reprint it. It is learning how historians structure sentences and how dates correlate with events. This distinction is the bedrock of their motion to dismiss in cases like The New York Times v. OpenAI. The company posits that copyright law protects the specific arrangement of words not the facts or the functional patterns of language contained within them. By framing the training process as a statistical analysis rather than a literary reproduction OpenAI attempts to bypass the need for licensing altogether.

The ‘Impossible’ Admission: A confession to the House of Lords

The confidence of this fair use defense was severely tested by OpenAI’s own admissions to the UK Parliament. In a written submission to the House of Lords Communications and Digital Committee in January 2024 the company stated explicitly that it would be “impossible” to train leading AI models without using copyrighted materials. This declaration stripped away any pretense that the company could rely solely on public domain works or licensed data. OpenAI argued that because copyright covers virtually every form of modern human expression, from blog posts to government reports, an AI trained only on out-of-copyright books would be archaic and dysfunctional. This submission was intended to lobby for a broad copyright exception in the UK. Yet it served as a damning confirmation for plaintiffs in the United States. It was an admission that the commercial viability of their product depended entirely on the unauthorized use of protected intellectual property.

Critics and legal scholars seized on this statement as evidence of “unjust enrichment.” The “impossible” defense that because the theft is necessary for the product to exist the theft must be legal. This logic inverts the traditional principles of market competition. if a business model requires a resource that is too expensive to acquire legally the business model is considered unviable. OpenAI argued the opposite. They claimed that the “societal benefit” of their technology justified the mass appropriation of private property. This utilitarian argument attempts to shift the legal focus from the rights of the creator to the chance utility of the machine. It suggests that the progress of artificial intelligence is a public good that supersedes the “monopoly” rights of individual authors.

The Piracy Pivot: Distinguishing ‘Legal’ Access from ‘Shadow’ Libraries

The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the data. In the consolidated cases of Tremblay v. OpenAI and Silverman v. OpenAI the defense successfully argued that the mere existence of a model does not prove it is a “derivative work” of the books it read. Judge Araceli Martínez-Olguín dismissed several claims including vicarious infringement and negligence. She ruled that plaintiffs had to prove that specific outputs were substantially similar to their books. This was a tactical victory for OpenAI. It forced authors to find “smoking gun” examples where ChatGPT regurgitated their text verbatim. Such examples are rare due to the probabilistic nature of the model.

Yet a more dangerous precedent emerged from the parallel Anthropic litigation in June 2025. Judge William Alsup ruled that while using purchased books for training might be “exceedingly major” and thus fair use the use of pirated datasets constitutes a different category of violation. This ruling struck at the heart of OpenAI’s reliance on the “Books3” corpus. If OpenAI cannot prove they purchased legal copies of the thousands of books in their training data their fair use defense collapses. The “major” nature of the processing does not cure the initial act of acquiring stolen goods. OpenAI has since attempted to distance itself from the “Books3” dataset. They emphasize their partnerships with publishers. the presence of pirated libraries in their earlier training runs remains a toxic liability that no amount of current licensing can retroactively sanitize.

The DMCA Technicality: Evading the ‘Removal of Rights’ Charge

A serious component of the authors’ lawsuits involved the Digital Millennium Copyright Act (DMCA). Plaintiffs alleged that OpenAI violated Section 1202 by removing “Copyright Management Information” (CMI) such as titles, author names, and ISBNs during the training process. The argument was that by stripping this identifying data OpenAI facilitated copyright infringement and concealed the origin of the text. This claim posed a serious threat because the DMCA allows for statutory damages of up to $25000 per violation. With billions of documents involved the chance liability was astronomical.

OpenAI’s legal team dismantled this argument by focusing on the requirement of “intent.” They argued that the training process scrapes text automatically and that any removal of CMI was an incidental side effect of data cleaning rather than a malicious attempt to conceal infringement. The courts largely agreed with this interpretation in the early phases of the Tremblay litigation. The judge ruled that the plaintiffs failed to show that OpenAI knowingly removed the CMI to induce infringement. This technical victory allowed OpenAI to avoid the catastrophic damages associated with the DMCA. It narrowed the scope of the battle to the core copyright question. The company successfully framed the removal of author names not as a cover-up as a necessary technical step in preparing data for tokenization. This defense relies on the complexity of the “black box” to obscure the intent behind the data processing.

The Licensing Paradox: Paying for What You Claim is Free

The most contradiction in OpenAI’s defense strategy is its aggressive of licensing deals. While asserting in court that they have a fair use right to train on all publicly available text the company has simultaneously signed multi-million dollar agreements with entities like The Associated Press, Axel Springer, and News Corp. If training is fair use then these payments are unnecessary. OpenAI characterizes these deals as “partnerships” for real-time access and attribution rather than as copyright licenses for training data. They they are paying for the “freshness” of the news feed and the right to display snippets in search results. This distinction allows them to maintain their fair use stance in court while buying peace with the most media conglomerates.

Legal analysts view this as a “risk reduction” strategy. By paying off the largest chance litigants OpenAI isolates the individual authors and smaller publishers who absence the resources to sustain a protracted legal war. The deals also serve as a hedge against a chance loss in the NYT case. If the courts eventually rule that training requires a license OpenAI can claim they are already a “responsible actor” that compensates rights holders. This dual-track strategy creates a two-tiered system. Large corporations get paid while individual writers are told their work is “fair use” fodder for the machine. The company uses its vast capital to create a private licensing regime that undermines the very legal principle they defend in the courtroom. They pay when they must and take when they can.

load Shifting: The ‘Opt-Out’ Defense

OpenAI has further fortified its position by introducing “opt-out” method like the GPTBot user agent. They that because they offer a way for webmasters to block their crawler any site that does not block them has implicitly consented to be scraped. This argument attempts to shift the load of copyright enforcement from the user to the owner. It ignores the fact that the vast majority of the training data was collected years before these opt-out tools existed. The “Books3” dataset and the Common Crawl archives were ingested long before any author had the option to say no. OpenAI treats this retroactive consent as a valid legal shield. They assert that their current “good faith” efforts to respect robots. txt should mitigate any liability for past actions. This defense relies on the sheer inertia of the internet. It assumes that silence equals permission and that the default state of all digital content is to be available for AI training unless explicitly marked otherwise.

Systematic Theft or Fair Use? Analyzing OpenAI's Legal Defense Strategy

The Authors Guild Class Action: Fiction and Non-Fiction Writers Unite

The Authors Guild class action represents the most significant organized legal challenge to generative AI in history. Filed in the Southern District of New York on September 19, 2023, the complaint brought together a coalition of literary giants who alleged their life’s work had been ingested without consent to fuel a commercial product capable of replacing them. The lead plaintiffs included household names such as George R. R. Martin, John Grisham, Jodi Picoult, Jonathan Franzen, and Elin Hilderbrand. Their participation signaled that this was not a dispute over royalties an existential defense of human creativity against automated mimicry. The initial filing (Case 1: 23-cv-08292) struck at the core of OpenAI’s technical narrative. While OpenAI executives frequently described their models as learning abstract patterns of language, the authors contended the models were simply high-tech plagiarism machines. The complaint detailed how ChatGPT could generate accurate summaries of copyrighted and even produce unauthorized sequels. In one example, the model generated a detailed outline for a sequel to *While the Patient Slept* titled *Shadows Over Federie House* using the same characters and setting. This specific output demonstrated that the model retained the expressive content of the books rather than just the statistical probability of words. The legal battle expanded significantly in November 2023 when non-fiction author Julian Sancton filed a parallel class action. Sancton, author of *Madhouse at the End of the Earth*, was the to name Microsoft as a co-defendant alongside OpenAI. His suit argued that non-fiction writers face a unique threat. These authors spend years conducting research and verifying facts only to have AI models ingest that labor and regurgitate it without attribution. The amended complaint in December 2023 added Pulitzer Prize winners like Kai Bird and Stacy Schiff to the roster. This consolidation of fiction and non-fiction writers under the umbrella of *In re OpenAI, Inc. Copyright Infringement Litigation* (MDL No. 3143) created a unified front representing tens of thousands of professional writers. The plaintiffs’ legal strategy focused on the concept of “systematic theft.” They argued that OpenAI could not have built its models without the use of “shadow libraries” like Library Genesis (LibGen) or the Books3 dataset. The complaint noted that ChatGPT could provide detailed quotes from books that were never made available freely on the open internet. When Sancton’s legal team queried ChatGPT about his book, the model initially confirmed it was part of the training data. OpenAI later updated the system to refuse such questions. This obfuscation became a central point of contention during discovery. Judge Sidney Stein presided over the consolidated cases and issued a pivotal ruling on October 27, 2025. OpenAI had moved to dismiss the claims by arguing that the authors failed to show “substantial similarity” between their books and the AI’s outputs. Judge Stein denied this motion. He found that the generated summaries and sequels were sufficiently similar to the original works to warrant a trial. The ruling was a massive blow to OpenAI’s defense. It established that plaintiffs could proceed with claims based on the model’s outputs and not just the initial act of copying the books for training. The inclusion of Microsoft in the non-fiction suit added a of corporate liability that complicated the defense. Sancton’s lawyers argued that Microsoft was not a passive investor an active participant in the infringement. They pointed to Microsoft’s specialized supercomputing clusters designed specifically to process these massive datasets. The plaintiffs contended that Microsoft knew or should have known that the training data included pirated copyrighted materials. This allegation forced Microsoft to defend its own internal compliance regarding data sourcing. Discovery proceedings throughout late 2025 revealed the of the ingestion. Court orders compelled OpenAI to produce millions of chat logs to determine how frequently users prompted the model to reproduce copyrighted text. The data suggested that users frequently treated ChatGPT as a free alternative to purchasing books. Students used it to bypass reading assignments while fans used it to generate new stories in the worlds of their favorite authors. The Authors Guild argued this constituted direct market substitution. The defense relied heavily on the doctrine of fair use. OpenAI maintained that its use of the books was “major” because it created a new tool for generating text rather than simply republishing the books. They compared their process to a student reading a library book to learn how to write. The authors rejected this analogy. They argued that a student does not charge a subscription fee to regurgitate the book’s contents on demand. The “fair use” defense faces a steep climb given the commercial nature of OpenAI’s products and the direct competition they pose to the original works. The unification of fiction and non-fiction writers also highlighted the different ways AI harms different genres. For fiction writers like Martin and Grisham, the harm is the theft of their creative expression and the chance for AI to flood the market with derivative slop. For non-fiction writers like Sancton and Bird, the harm is the devaluation of investigative labor. An AI can summarize a ten-year historical investigation in seconds without citing the author who did the work. This capability threatens the economic viability of long-form journalism and historical non-fiction. By early 2026, the case had moved into the heart of the discovery phase. The court’s refusal to dismiss the output-based claims meant that OpenAI would have to explain exactly how its models memorized specific plot points and character arcs. The “black box” nature of the LLM was no longer a shield. The plaintiffs demanded access to the specific training weights and dataset manifests. This legal pressure forced OpenAI to confront the reality that its “magic” was built on the unauthorized use of human intellectual property. The Authors Guild action stands as the primary barrier between creative professionals and total displacement. The outcome of this litigation determine whether copyright law can adapt to the age of AI or if it be rendered obsolete. If the authors prevail, OpenAI could owe billions in statutory damages. A loss for the authors would signal the end of the professional writer as a viable career route. The unification of these diverse writers show the severity of the threat. They are not fighting for a larger slice of the pie. They are fighting to keep the kitchen from being stolen.

Digital 'Regurgitation': Evidence of Verbatim Text Reproduction in ChatGPT

SECTION 5 of 14: Digital ‘Regurgitation’: Evidence of Verbatim Text Reproduction in ChatGPT

The sanitized term is “memorization.” In the sterile corridors of machine learning research, it refers to a model’s tendency to encode specific training data so perfectly that it can be recalled sequence-for-sequence. In the courtroom, yet, this phenomenon is known by a far more damaging name: digital regurgitation. This is not a technical quirk; it is the smoking gun that strips away the veneer of “learning” to reveal what critics is little more than a high-tech photocopier operating at industrial. For years, OpenAI maintained that its models did not “copy” text rather “learned concepts” in the same way a human student might study a library. They argued that ChatGPT synthesized information, creating sentences based on statistical probabilities. That defense crumbled visibly in December 2023, when The New York Times filed a lawsuit that included **Exhibit J**, a document that may go down in legal history as the most devastating proof of copyright infringement ever assembled against an AI company. Exhibit J did not contain abstract arguments. It contained one hundred specific examples where GPT-4, when prompted with the few paragraphs of a *Times* article, proceeded to output the remainder of the text verbatim. In one instance, the model reproduced a Pulitzer Prize-winning investigation into the taxi industry, “The ‘New’ Yellow Cab,” with near-perfect fidelity. It did not summarize; it did not paraphrase. It recited the copyrighted text word-for-word, including specific quotes, data points, and stylistic flourishes unique to the original authors. This was not “learning concepts”; this was unauthorized republication. The of Exhibit J extend far beyond a single newspaper. It demonstrated that the “black box” of the Large Language Model (LLM) is not as unclear as claimed. The data is not dissolved into a nebulous soup of weights and biases;, it sits intact, ready to be extracted by any user who knows the right prompt. Independent researchers have corroborated these findings with rigorous technical audits. A landmark study by **Nicholas Carlini** and researchers from Google DeepMind, the University of Washington, and Cornell University shattered the assumption that training data is private or unrecoverable. By employing a ” attack”—prompting the model to repeat a specific word (like “poem”) forever—the researchers caused ChatGPT to glitch and vomit up raw training data. The model stopped generating coherent text and began outputting massive chunks of memorized information: personally identifiable information (PII), snippets of code, and entire passages from copyrighted literature. This phenomenon is particularly acute with “popular” texts that appear frequently in the training dataset. While OpenAI has attempted to patch these leaks with safety filters, the underlying reality remains: the model *knows* the text. In 2024, researchers demonstrated that **Llama 3. 1 70B** (a similar class of model) had memorized entire books, including *Harry Potter and the Sorcerer’s Stone* and George Orwell’s *1984*, almost in their entirety. While these are fiction examples, the method applies equally to non-fiction. The **Books3** dataset, a controversial component of LLM training sets, contains thousands of non-fiction titles—biographies, histories, and technical manuals—that are subject to the same memorization mechanics. For non-fiction authors, the threat of regurgitation manifests differently no less destructively. A user may not ask ChatGPT to “write the chapter” of a history book, they frequently ask for “detailed summaries.” Reports from the **Authors Guild** and independent tests show that ChatGPT can generate chapter-by-chapter breakdowns of non-fiction books so detailed that they serve as a market substitute for the original work. If a user can obtain the core arguments, data, and narrative arc of a new business book or historical biography without purchasing it, the economic damage is identical to piracy, even if the output is not a 100% verbatim copy. OpenAI’s response to these allegations has been a mixture of technical minimization and legal maneuvering. In public statements, they characterize verbatim regurgitation as a “rare bug” that affects only a tiny fraction of queries. They accuse researchers and the *New York Times* of “prompt engineering”—essentially hacking the model to force it to misbehave. They that normal users do not spend their time trying to extract copyrighted text. Yet, this defense ignores the reality of “Retrieval Augmented Generation” (RAG) and the way users actually interact with these tools. Users *do* want specific information. When a user asks for a recipe, a coding solution, or a news summary, they are frequently unknowingly requesting copyrighted material that the model provides without attribution or compensation. The “glitch” is not the regurgitation; the glitch is the model’s occasional failure to hide the theft. also, the “paywall bypass” capability of these models has alarmed the news industry. Investigations by **Press Gazette** and **INMA** in 2024 and 2025 revealed that ChatGPT could reconstruct the substance of paywalled articles from *The Atlantic*, *The Financial Times*, and *The New York Times*. By scraping “public” fragments—social media posts, Reddit discussions, and syndicated snippets—the model could assemble a “Frankenstein” version of the article that broke the paywall. OpenAI this is “fair use” of public data; publishers it is a sophisticated form of fencing stolen goods. The evidence of digital regurgitation neutralizes the “fair use” argument that relies on the “major” nature of AI. If a machine outputs the exact text of a copyrighted work, it has not transformed anything. It has transported it from a protected server to a public chat window. As the legal battles intensify, Exhibit J and the Carlini studies stand as the twin pillars of the prosecution: proof that beneath the hype of “artificial intelligence” lies a vast, unauthorized archive of human labor, waiting to be recalled.

Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives

SECTION 6 of 14: Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives

The economic survival of modern journalism relies on a simple contract. Readers pay for access to high-quality reporting and publishers use those funds to sustain newsrooms. OpenAI shattered this model. The company did not scrape the open web. It systematically penetrated digital blocks designed to protect intellectual property. The training datasets for GPT-3 and GPT-4 contained millions of articles from subscription-based publications. These archives were ingested without permission. They were processed without payment. The resulting models could reproduce premium content verbatim. This theft was not an accident of web crawling. It was a foundational feature of the data acquisition strategy.

The primary vehicle for this unauthorized access was Common Crawl. This non-profit organization maintains a massive repository of web data. OpenAI used this repository as the bedrock for its training sets. Common Crawl’s bots frequently ignore the technical nuances of paywalls. They capture the text of an article before the subscription overlay triggers. They scrape cached versions of pages that are meant to be restricted. OpenAI did not filter this stolen contraband. They fed it directly into their neural networks. The result was a machine that had read the Wall Street Journal and the Financial Times without ever buying a subscription. The company treated the world’s most expensive journalism as free raw material.

The New York Times provided the most damning evidence of this practice in its December 2023 lawsuit. The complaint detailed how GPT-4 could recite large portions of the paper’s content. The model did not just summarize facts. It regurgitated entire paragraphs of Pulitzer Prize-winning investigative work. The Times the 2012 multimedia feature “Snow Fall” as a prime example. The article sits behind a strict paywall. Yet the model could reproduce its opening passages word-for-word. This memorization proved that the text resided deep within the model’s parameters. The machine had not just learned from the article. It had cloned it.

The legal battle expanded in 2024 with a lawsuit from Alden Global Capital. The investment firm owns eight major newspapers including the Chicago Tribune and the New York Daily News. Their complaint alleged that OpenAI and Microsoft purloined millions of copyrighted articles. The suit argued that the tech giants siphoned off the revenue of local news organizations. They did this while simultaneously degrading the brand of the newspapers by attributing hallucinations to them. The Alden lawsuit highlighted a specific technical reality. The models frequently displayed full text from articles that human readers could not access without paying. This capability turned ChatGPT into a bootleg reading service. It allowed users to bypass the subscription model entirely.

A more insidious aspect of this theft involves the removal of Copyright Management Information or CMI. This is the digital fingerprint of a piece of writing. It includes the author’s name and the publication title. It also includes the copyright notice. Raw Story and The Intercept filed lawsuits in February 2024 focusing on this violation. They alleged that OpenAI stripped this metadata during the training process. The removal served a specific purpose. It concealed the origin of the text. It made the output appear as generic knowledge rather than the product of specific labor. The Digital Millennium Copyright Act strictly prohibits the removal of CMI. OpenAI’s defense relied on the claim that the removal was not intentional. The plaintiffs argued that it was a necessary step to sanitize the stolen goods.

OpenAI attempted to frame these violations as technical glitches. The company temporarily disabled the “Browse with Bing” feature in July 2023 after users discovered it could bypass paywalls. This feature allowed the chatbot to search the live web. Users quickly realized they could ask the bot to print the text of a locked article. The bot would comply. OpenAI called this an “unwanted” behavior. This excuse ignored the deeper reality. The live browsing tool was accessing the same unauthorized pathways that the training crawlers had used for years. The “glitch” was not the access itself. The glitch was that the public could see it happening in real time.

The subsequent behavior of OpenAI confirms the value of the stolen data. The company began signing licensing deals with major publishers in mid-2024. They struck an agreement with News Corp to access content from the Wall Street Journal. They made similar deals with Axel Springer and the Financial Times. These agreements were worth millions of dollars. They served as a tacit admission. If the data was fair use and free for the taking then there would be no need to pay for it. The checkbook opened only after the lawsuits began. The payments were not for future access alone. They were “retroactive” in nature. They were hush money designed to legalize the theft that had already occurred.

Evidence surfaced in 2025 that the practice continued even with the lawsuits. A study by the AI Disclosures Project revealed that the model GPT-4o showed strong recognition of non-public books from O’Reilly Media. These technical manuals and guides are sold for profit. They are not free blog posts. The study used “membership inference attacks” to prove the model had seen the text. The results showed that the model knew the contents of paywalled books better than it knew public domain texts. This suggested that the company prioritized high-value proprietary data. They sought out the most expensive information because it was the most reliable. They ingested it regardless of the copyright status.

The defense strategy employed by OpenAI rests on the concept of “fair use.” They that training a model is a major act. They claim that the machine reads the text to learn the patterns of language. It does not read to consume the information. This argument collapses when the model outputs the text verbatim. A machine that memorizes a paywalled article and serves it to a user is not transforming anything. It is redistributing stolen property. The courts have historically protected the “hot news” doctrine. This legal principle prevents competitors from free-riding on the time-sensitive reporting of others. OpenAI built a business model that is the free rider. It capitalized on the investigative budgets of the New York Times and the Chicago Tribune. It sold the resulting intelligence for twenty dollars a month.

The impact on the news industry is quantifiable. The New York Times reported a direct correlation between the rise of LLMs and the decline in referral traffic. Users no longer needed to click through to the source. The chatbot provided the answer. The chatbot provided the analysis. The chatbot frequently provided the article itself. This severed the link between the reader and the publisher. It destroyed the advertising funnel. It devalued the subscription proposition. The unauthorized ingestion of news archives was not just a copyright violation. It was a market intervention. It transferred the value of the fourth estate to the balance sheet of a Silicon Valley startup. The journalism was expensive to produce. The theft was free.

The 'Transformative' Myth: Debunking the Core of AI Copyright Defense

The ‘Learning’ Fallacy: Anthropomorphism as Legal Shield

OpenAI’s primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a book in a library. In this narrative, the AI “learns” concepts, facts, and styles just as a human would, and therefore its ingestion of copyrighted material constitutes “fair use.” This anthropomorphic framing is a calculated distraction. A human student does not ingest 500, 000 books in a week, compress them into a probabilistic matrix, and then sell access to a service that can reproduce the specific expression of those books on command. The “learning” argument attempts to mask the mechanical reality: LLMs are industrial- copying engines that rely on the unauthorized reproduction of expression to function.

The technical reality of an LLM contradicts the “major” defense. When OpenAI trains a model, it does not extract abstract “ideas” (which are not copyrightable). It tokenizes and stores the statistical relationships between specific words and phrases found in the training data. As demonstrated in the New York Times litigation, the model retains enough fidelity to the original text to reproduce it verbatim when prompted. This is not “learning”; it is compression and retrieval. The US Copyright Office, in its 2025 report on AI training, explicitly rejected the “inherently major” argument, noting that when a model is trained to produce content that appeals to the same audience as the original work, the use is “at best, modestly major.”

The Warhol Effect: A Supreme Court Reality Check

For years, Silicon Valley relied on a broad interpretation of “major use,” assuming that any technological processing of data qualified for fair use protection. That assumption collapsed with the Supreme Court’s 2023 ruling in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith. The Court held that even if a new work adds new expression or meaning, it is not “fair use” if it shares the same commercial purpose as the original and competes in the same market. This ruling struck at the heart of the AI defense strategy.

OpenAI that its models serve a different purpose than the books and articles they ingest, that they are “tools for generating new content” rather than “archives.” Yet, for non-fiction and news, this distinction evaporates. The purpose of a news article is to inform the reader. The purpose of a ChatGPT summary of that article is also to inform the reader. If a user asks ChatGPT, “What are the key findings of the latest NYT investigation?”, the AI provides the information without the user ever visiting the Times website. Under the Warhol standard, this is a clear market substitute. The AI is not transforming the purpose; it is hijacking the audience. The commercial nature of OpenAI’s subscription model further weakens its defense, as it directly monetizes the value created by the original journalists and authors.

The Google Books False Equivalence

OpenAI frequently cites Authors Guild v. Google (2015), the “Google Books” case, as its legal shield. In that case, the Second Circuit ruled that Google’s scanning of millions of books was fair use because it created a searchable index that displayed only “snippets” of text. The court found that this “snippet view” did not substitute for the original books; in fact, it likely drove sales by helping users discover them. OpenAI attempts to position ChatGPT as the spiritual successor to Google Books, the functional difference is clear.

Google Books acted as a pointer; ChatGPT acts as a replacement. A search engine sends traffic to the source; an LLM answers the query within its own interface, keeping the user inside the “walled garden.” When an LLM summarizes a non-fiction book chapter by chapter, it provides the “value” of the book, the knowledge, without the purchase. This “expressive substitution” destroys the incentive to buy the original work. In the Bartz v. Anthropic ruling (July 2025), the court recognized this distinction, finding that while training on legally purchased books might be fair use, the use of pirated datasets (like Books3) to create a competing product constituted infringement. The “Google Books” defense fails because LLMs do not drive discovery; they automate consumption.

Market Usurpation in the Information Economy

The “major” defense collapses completely when applied to the market for non-fiction and news archives. Unlike fiction, where the “experience” of reading the prose is the primary value, non-fiction is frequently consumed for its factual content. By extracting and synthesizing these facts, LLMs strip-mine the value of the work. A biography that took five years to research can be condensed into a five-minute read by an LLM. While facts themselves are not copyrightable, the selection and arrangement of those facts, the narrative structure, is protected. LLMs appropriate this structure to deliver a “detailed guide” that renders the original obsolete.

The economic damage is measurable. Licensing negotiations reveal the true value of this data. If training were truly “fair use,” OpenAI would have no legal reason to sign multi-million dollar deals with publishers like Axel Springer or the Associated Press. These agreements function as an admission that the data has value and that its unauthorized use carries legal risk. OpenAI pays the to avoid litigation while continuing to scrape the work of independent authors and smaller publishers who absence the resources to sue. This two-tiered method exposes the “major” argument for what it is: a legal bluff designed to delay regulation while the models achieve market dominance.

Comparison of Fair Use Defenses: Google Books vs. Generative AI
Factor	Google Books (2015 Ruling)	Generative AI (Current Reality)
Purpose	Search index / Pointer to source	Content generation / Substitute for source
Amount Used	100% scanned, snippets displayed	100% scanned, variable output (summaries to verbatim)
Market Effect	No significant substitution; aids discovery	Direct substitution; reduces traffic/sales
Legal Status	Ruled Fair Use (major)	Contested (Warhol precedent narrows “major”)

Shadow Libraries and Z-Library: The Illicit Origins of 'Clean' Datasets

The ‘Books2’ Enigma: A Black Box of Stolen Knowledge

OpenAI openly admits to training its models on a dataset it calls “Books1,” widely identified as Project Gutenberg, a repository of public domain literature. Yet, the company maintains a rigid silence regarding a second, much larger corpus simply labeled “Books2.” This dataset, comprising an estimated 294, 000 titles, represents the dark matter of the GPT training process. While OpenAI refuses to disclose its contents, forensic analysis and legal discovery have pierced the veil of secrecy. Evidence suggests “Books2” is not a licensed collection, a sanitized alias for the world’s most notorious shadow libraries: Library Genesis (LibGen) and Z-Library. The mathematical impossibility of “Books2” being a legitimate dataset betrays its origins. No commercial entity offers a digital license for 294, 000 diverse, high-quality copyrighted books in a single batch for AI training. The only repositories matching this specific volume and breadth are illicit. Shadow libraries operate as massive, decentralized archives that host millions of pirated epubs, pdfs, and academic papers. They exist outside the law, ignoring copyright notices and bypassing paywalls to provide free access to the sum of human knowledge. For a company seeking to ingest the world’s information, these sites offered an irresistible, cost-free resource.

From ‘LibGen1’ to ‘Books2’: The Internal Laundering

Legal filings from the Authors Guild and other plaintiffs have unearthed internal practices that contradict OpenAI’s public stance on safety and compliance. Court documents allege that OpenAI employees downloaded massive quantities of data from Library Genesis in 2018. These internal corpora were initially tagged with explicit descriptors: “LibGen1” and “LibGen2.” Before the release of GPT-3, these filenames were reportedly altered to the innocuous “Books1” and “Books2.” This renaming served a dual purpose: it obscured the illicit provenance of the data and presented a veneer of curation to the public and investors. The timeline of these acquisitions is damning. The FBI seized Z-Library’s primary domains in November 2022, charging its operators with criminal copyright infringement. By then, yet, the damage was irreversible. OpenAI had likely already scraped, processed, and the library’s contents into the neural weights of GPT-3 and GPT-4. The seizure removed the public interface of the library, the data survives, immortalized within the models. Every time ChatGPT summarizes a copyrighted non-fiction bestseller or explains a complex concept from a paywalled textbook, it draws upon this ghostly archive of stolen material.

The Non-Fiction Imperative: Why Fiction Wasn’t Enough

The theft of shadow libraries was not about volume; it was a strategic need for model performance. While public domain fiction provides syntax and narrative structure, it absence the dense, factual reasoning required for a “smart” AI. To build a model capable of passing the bar exam, diagnosing medical conditions, or writing code, OpenAI needed non-fiction. They needed textbooks, monographs, technical manuals, and academic journals. Z-Library and LibGen specialize in exactly this type of content. Unlike Project Gutenberg, which stops at 1929, shadow libraries house the modern scientific and intellectual output of the last century. They contain the standard university curriculum, books on quantum mechanics, econometrics, molecular biology, and computer science. By ingesting this specific corpus, OpenAI did not just teach its model how to speak; it taught the model how to think using the proprietary research and pedagogical structures of the world’s leading experts. The “reasoning” capabilities of GPT-4 are, in large part, a result of processing millions of unauthorized textbooks that define the logic of their respective fields.

Spoliation and the Deletion of Evidence

As scrutiny intensified in 2022, OpenAI took steps that plaintiffs describe as spoliation of evidence. The company reportedly deleted the original “Books1” and “Books2” datasets from its servers, claiming they were no longer in use. This deletion conveniently occurred just as class-action lawsuits began to materialize. By destroying the source files, OpenAI made it significantly harder for forensic data scientists to prove a direct one-to-one match between the training data and the copyrighted works. yet, the model itself remains a witness. Researchers have demonstrated that LLMs can memorize and regurgitate long passages of text from books present in shadow libraries absent from the public web. When prompted with specific, unique strings from a copyrighted book found on Z-Library, the model frequently completes the passage verbatim. This “eidetic memory” serves as a digital fingerprint, linking the clean output of the chatbot directly to the dirty data of the piracy hub.

The Industry Standard of Theft

OpenAI does not stand alone in this practice, though it remains the primary target of litigation. The use of shadow libraries appears to be an open secret within the AI development sector. Internal communications from Meta, revealed in separate litigation, show executives discussing the use of LibGen for training their LLaMA models. In one exchange, a Meta engineer noted that using such data was legally risky necessary to compete, with Mark Zuckerberg reportedly giving his approval. This industry-wide reliance on piracy suggests that the ” ” performance of modern AI is fundamentally tethered to the existence of illegal archives. The utilization of Z-Library and LibGen represents a massive transfer of value. The academic publishing industry and the trade book market operate on a model of scarcity and paid access. Shadow libraries subvert this by demonetizing the content. OpenAI then recapitalizes this demonetized content, selling access to the intelligence derived from it via monthly subscriptions. The authors and publishers, whose works provided the cognitive architecture for the model, receive nothing. The “clean” interface of ChatGPT launders the reputation of the data, presenting the output of a pirate library as the product of advanced engineering.

Comparison of Legitimate vs. Shadow Data Sources
Feature	Project Gutenberg (Books1)	Shadow Libraries (Books2/LibGen)
Copyright Status	Public Domain (Pre-1929)	Full Copyright (Modern & Contemporary)
Content Type	Classic Literature, Fiction	Textbooks, Academic Papers, Non-Fiction, Bestsellers
Value to AI	Language Syntax, Narrative Style	Complex Reasoning, Factual Knowledge, Technical Logic
Acquisition Cost	Free (Legal)	Free (Illegal)
Transparency	Disclosed by OpenAI	Concealed, Renamed, Deleted

This reliance on shadow libraries exposes the fragility of the ethical claims made by AI companies. The argument that these models are “learning like a human” collapses when the method of learning involves the automated ingestion of millions of stolen files—a feat no human could perform. The “Books2” dataset remains the smoking gun of the AI copyright wars, a testament to the fact that the smartest machines on earth were educated in a library built on theft.

Market Usurpation: How LLMs Threaten the Livelihood of Non-Fiction Authors

The economic threat posed by Large Language Models (LLMs) to non-fiction authors and journalists is not theoretical; it is a documented, quantifiable displacement of human labor. By 2026, the “substitution effect”—a legal concept central to copyright infringement cases—has transitioned from an abstract fear to a market reality. OpenAI’s models do not “learn” from non-fiction texts; they actively compete with them, offering users free, synthesized derivatives that bypass the need to purchase books or visit news websites. #### The Displacement of Freelance Labor The initial tremors of this market usurpation were felt in the freelance sector. Data from online labor markets between 2023 and 2025 reveals a sharp contraction in demand for human writers. A study analyzing transaction data from platforms like Upwork showed a 30% to 33% decline in the number of writing jobs and a 5. 2% drop in monthly earnings for freelancers immediately following the release of ChatGPT. This was not a temporary fluctuation a structural shift. Surveys conducted by the Society of Authors (SoA) and the Authors Guild paint a grim picture of the. By 2025, 86% of surveyed authors reported reduced earnings attributable to generative AI. The impact was particularly severe for translators and illustrators, with 36% of translators and 26% of illustrators losing work directly to AI automation. The “stabilization” of AI adoption in corporate sectors—where LLMs have permanently replaced human workers for drafting, summarization, and basic reporting—has erased the entry-level tier of the writing profession. #### The “Zero-Click” News emergency For the journalism industry, OpenAI’s integration into search and information retrieval systems has accelerated the “zero-click” phenomenon. By May 2025, nearly 69% of Google searches ended without a click to a publisher’s website, a trend exacerbated by AI-generated “overviews” and chatbots that scrape reporting to provide self-contained answers. The *New York Times v. OpenAI* lawsuit crystallized this existential threat. The *Times* argued that ChatGPT serves not as a research tool as a direct market substitute. When a user prompts an LLM for a summary of a paywalled investigation, the model delivers the core findings, frequently verbatim, without generating a visit to the original source. This bypass method deprives publishers of subscription revenue, advertising impressions, and licensing fees. Industry projections from 2025 indicate that news publishers face a 43% decline in search referral traffic by 2029, a loss that the meager 0. 13% to 1% referral rate from AI chatbots fails to offset. #### The “Sham” Book Industry In the book market, non-fiction authors face a deluge of parasitic AI-generated content. Amazon’s Kindle store has been flooded with “summary” books, “workbooks,” and unauthorized biographies that piggyback on the release of major non-fiction titles. These “sham” books, generated in minutes by LLMs, appear alongside legitimate works, confusing consumers and siphoning sales. Notable instances include the proliferation of AI-generated biographies for public figures like Rory Cellan-Jones and Kara Swisher. These unauthorized texts, frequently with hallucinations and factual errors, are published under the names of non-existent authors. In one egregious pattern, “summary bots” automatically generate condensed versions of bestselling non-fiction books within days of their release, selling them for a fraction of the price. This practice creates a “market dilution” effect, where the value of the original research is eroded by cheap, machine-generated derivatives that pay no royalties to the primary creator. #### The “Opt-Out” Mirage and the Media Manager Failure OpenAI’s defense has frequently relied on the existence of “opt-out” method, such as the `GPTBot` web crawler which publishers can block via `robots. txt`. yet, critics and legal experts this places an unfair load on creators to police a trillion-dollar company’s infrastructure. also, blocking a crawler does nothing to remove content that has already been ingested into existing models like GPT-4. To quell this criticism, OpenAI announced the development of a “Media Manager” tool in May 2024, promising creators granular control over how their works were used. By early 2025, this tool remained “missing in action.” Reports from internal sources suggested the project was never a priority, described by former employees as a “public relations strategy” rather than a substantive technical solution. The failure to deliver this promised safeguard left authors with no means to protect their intellectual property, reinforcing the Authors Guild’s characterization of the situation as “identity theft on a grand.” The cumulative effect of these practices is a transfer of wealth from the creators of knowledge to the owners of the models that exploit it. By treating non-fiction literature and journalism as raw “training data” rather than licensed intellectual property, OpenAI has engineered a system where the machine does not just read the book—it sells it.

The Opt-Out Illusion: Criticisms of OpenAI's Retroactive Data Policies

The introduction of “opt-out” method by OpenAI in late 2023 marked a strategic pivot in the company’s handling of copyright disputes. Faced with mounting lawsuits and public outcry, the organization unveiled tools ostensibly designed to give creators control over their work. These measures, including the `GPTBot` web crawler identification and various submission forms, were presented as a concession to rights holders. Investigative analysis reveals these policies function less as a remedy for data theft and more as a legal fortification, shifting the load of protection onto the victims while leaving the core problem of existing unauthorized training data untouched. ### The GPTBot Deception In August 2023, OpenAI announced `GPTBot`, a user agent token that webmasters could block via `robots. txt` files to prevent their sites from being scraped for future model training. This development was heralded by company representatives as a step toward transparency and user choice. Technical scrutiny exposes the limitations of this tool. The directive only instructs the crawler to bypass a site in *future* sweeps. It does nothing to remove data that OpenAI had already harvested during the preceding years of unrestricted scraping. The Common Crawl and other datasets used to train GPT-3 and GPT-4 had already ingested billions of pages before `GPTBot` existed. By the time the opt-out became available, the “learning” was complete. The text of copyrighted, news archives, and academic papers had already been converted into billions of parameters within the model’s neural network. Blocking `GPTBot` in late 2023 is akin to locking the bank vault after the robbery has occurred; it prevents a second theft does not recover the stolen assets. ### The Bureaucratic Wall For authors and publishers seeking to remove specific works from training datasets, OpenAI introduced a submission process frequently described by critics as a “bureaucratic wall.” Unlike the automated scraping that ingested their work, the removal process is manual, granular, and labor-intensive. Rights holders must identify specific URLs or provide evidence of ownership for each individual work they wish to exclude. For a prolific author with dozens of titles, or a news organization with millions of archived articles, this requirement creates an impossible workload. The asymmetry is clear: OpenAI used automated bots to ingest content at a of millions of documents per hour, yet requires human creators to submit opt-out requests one by one. This friction appears intentional, designed to minimize the number of successful removals while allowing the company to claim it offers a compliance pathway. ### The Impossibility of “Unlearning” A fundamental technical reality renders the entire concept of retroactive opt-outs illusory: Large Language Models cannot easily “forget” specific data. Unlike a traditional database where a record can be deleted with a single command, an LLM stores information as probabilistic weights distributed across its entire network. A specific book is not stored as a file as a complex set of associations between words and concepts. Removing a specific author’s writing style or factual assertions from a trained model is a scientific challenge that remains largely unsolved. “Machine unlearning” is an experimental field with no reliable, solutions for models the size of GPT-4. To truly honor an opt-out request for data already trained, OpenAI would need to retrain the model from scratch—a process costing tens of millions of dollars and months of computational time. Consequently, when OpenAI grants a removal request, they only filter the *output* to prevent the model from reciting the text verbatim, or they remove the data from *future* training sets. The model retains the knowledge it gained from the original unauthorized ingestion. ### Legal Defense Strategy Legal experts suggest the opt-out framework is primarily a defense strategy for the courtroom. By offering an opt-out, OpenAI attempts to reframe the narrative from “systematic theft” to “implied consent.” The argument posits that if a creator did not opt out, they implicitly agreed to the use of their data. This flips the standard of copyright law, which requires affirmative permission (opt-in) before using protected work. This strategy also aims to mitigate damages in litigation. If OpenAI can show they provided a method for rights holders to object, they can that any continued infringement was not “willful,” chance reducing financial penalties. The existence of the tool serves the defendant, not the plaintiff. ### The “Do Not Train” Standard The failure of these retroactive policies has led to calls for a universal “Do Not Train” standard, similar to “Do Not Track” in web advertising. Yet, without legislative enforcement, such standards remain voluntary. OpenAI’s adherence to `robots. txt` exclusions is a policy choice, not a legal requirement they acknowledge. They reserve the right to change this policy or interpret “fair use” in ways that override these signals. The opt-out method provided by OpenAI offer the appearance of agency without the substance of control. They demand action from the aggrieved party to stop an activity that was never authorized in the place. For the millions of pages of non-fiction and journalism already in the latent space of GPT models, these tools offer no recourse. The data is not just in the machine; it *is* the machine.

Internal Comms and Deleted Datasets: Evidence of Willful Infringement

The Pivot to Secrecy: Concealing the Source

The trajectory of OpenAI’s transparency offers a timeline of incriminating silence. In its early years, the organization operated with a mandate of openness, publishing detailed datasheets for models like GPT-1 and GPT-2. These documents listed training sources with academic precision. Yet, as the models grew in power and commercial chance, this transparency. The release of GPT-3 marked a definitive shift. OpenAI stopped disclosing the specific contents of its training data, offering only vague descriptors like “internet-based books corpora.” This sudden opacity was not a competitive strategy; it was a legal need. By 2020, the of data required to improve model performance had outstripped the available public domain. To continue scaling, OpenAI had to ingest copyrighted material. Admitting this publicly would have invited immediate litigation. The shift to “closed” source was less about protecting trade secrets and more about obscuring the provenance of stolen intellectual property.

The ‘Books2’ Smoking Gun

The most damning evidence of unauthorized use lies in the statistical anomalies of the dataset known as “Books2.” In the few disclosures OpenAI made regarding GPT-3, they listed “Books1” (12 billion tokens) and “Books2” (55 billion tokens) as primary sources. “Books1” aligns in size with Project Gutenberg, a legal repository of public domain works. “Books2,” yet, presents a mathematical impossibility for a legal dataset. There is no commercially available, licensed corpus of high-quality fiction and non-fiction that matches this size. The only collections of text that fit these parameters are “shadow libraries”, illicit repositories like Library Genesis (LibGen), Z-Library, and Bibliotik, which host millions of pirated e-books.

Independent researchers and plaintiffs in the Tremblay v. OpenAI class action have corroborated this suspicion. The token count of Books2 mirrors the size of the Bibliotik collection almost exactly. By ingesting this data, OpenAI did not “scrape the web”; they likely downloaded a curated archive of stolen property. This distinction is serious. Scraping the open web allows for a plausible deniability defense regarding “fair use.” Downloading a torrent of pirated books from a shadow library demonstrates active, willful infringement. It suggests that OpenAI engineers sought out specific, high-quality copyrighted literature because the “clean” internet did not provide enough depth for their models to master complex reasoning.

The ‘Accidental’ Deletion of Evidence

Allegations of willful misconduct intensified during the discovery phase of The New York Times v. OpenAI. In late 2024, a serious dispute emerged when OpenAI engineers “accidentally” erased data from a virtual machine provided to the Times’ legal team. This machine contained the results of weeks of forensic analysis, where experts had traced specific copyrighted articles from the Times into OpenAI’s training set. OpenAI attributed the loss to a “system misconfiguration,” the timing raised immediate suspicions. The deletion forced the plaintiffs to restart their investigation, delaying the legal process and increasing costs.

This was not an incident of data destruction. Court filings revealed that OpenAI had previously deleted the original “Books1” and “Books2” datasets from their internal servers in 2022. The company claimed this was due to “non-use,” a justification that Judge Ona Wang of the Southern District of New York found unconvincing. In a November 2025 ruling, Judge Wang ordered OpenAI to produce internal communications related to this deletion. The court recognized that destroying the only direct evidence of what the models were trained on, right before a wave of copyright lawsuits, could be interpreted as spoliation of evidence. If OpenAI believed their use of these books was legal, retaining the datasets would have been their best defense. Deleting them suggests a consciousness of guilt.

Internal Slack Messages: The Liability Discussion

Discovery proceedings have begun to unearth internal communications that contradict OpenAI’s public stance. Plaintiffs in the Authors Guild case have gained access to internal Slack messages where employees discussed the “Books1” and “Books2” datasets. These messages reportedly show engineers referring to the erasure of these datasets with an awareness of their problematic nature. The plaintiffs that these communications prove OpenAI executives knew the data was tainted. Instead of licensing the material or removing it, they chose to purge the source files while retaining the trained model weights, laundering the stolen data into a commercial product.

The existence of these messages the “innocent infringer” defense. Copyright law distinguishes between accidental infringement and willful infringement, with the latter carrying significantly higher statutory damages. If internal emails or chats confirm that employees identified the datasets as “pirated” or “illegal” and proceeded to use them anyway, OpenAI faces a liability catastrophe. The decision to hide the “Books3” dataset (a known component of the open-source “The Pile” dataset, which OpenAI likely used or replicated) further implicates them. While open-source competitors like EleutherAI were transparent about using Bibliotik to build “Books3,” OpenAI kept their equivalent source hidden, likely to avoid the very lawsuits they face.

The ‘Fair Use’ Charade

OpenAI’s legal defense relies entirely on the doctrine of “fair use,” arguing that their models transform the original works into something new. Yet, their internal actions tell a different story. A company confident in its fair use defense does not stop publishing datasheets. It does not delete the raw training data when litigation looms. It does not obfuscate the sources of its most valuable assets. The pattern of secrecy, deletion, and obfuscation suggests that OpenAI’s leadership understood that their “fair use” argument was a legal gamble, not a settled fact. They built their empire on a foundation of mass copyright infringement, betting that they could become too big to fail before the legal system caught up. The internal communications and deleted datasets are not just procedural footnotes; they are the smoking gun of a calculated industrial theft.

Licensing as Admission: Why Deals with Axel Springer Undermine Fair Use

The legal defense of “fair use” relies on a delicate balance, one that OpenAI shattered the moment it began writing checks. For years, the company maintained that ingesting copyrighted works to train Large Language Models (LLMs) was a major act requiring no permission and no payment. This argument crumbled on December 13, 2023, when OpenAI announced a “global partnership” with German publishing giant Axel Springer. By agreeing to pay tens of millions of euros for access to content from *Politico*, *Business Insider*, *Bild*, and *Die Welt*, OpenAI did more than secure a data pipeline; it created a market. In the eyes of intellectual property law, the existence of a market for licensing training data is a fatal blow to the claim that no such market exists to be harmed.

The “Market Harm” Trap

Under United States copyright law, the fourth factor of the fair use test examines “the effect of the use upon the chance market for or value of the copyrighted work.” Courts have historically ruled that if a use usurps a market that the copyright holder could reasonably exploit, it is not fair use. OpenAI’s strategy of selective payment hands plaintiffs the evidence they need to prove this usurpation. When OpenAI pays Axel Springer or the Associated Press (AP) for the right to train on their archives, they validate the premise that news archives are a tradable asset in the AI economy. The *New York Times* capitalized on this exact contradiction in its December 27, 2023, lawsuit filed just two weeks after the Axel Springer announcement. The complaint explicitly cites these deals as proof that a viable licensing market exists, arguing that OpenAI’s unauthorized use of *Times* content deprives the publisher of licensing revenue that other tech companies are clear to pay.

**Select OpenAI Licensing Agreements & Estimated Values**
Publisher / Entity	Date Announced	Estimated Value / Terms	Scope of Access
Associated Press (AP)	July 13, 2023	Undisclosed (IP exchange)	Text archive licensing for training; access to OpenAI tech.
Axel Springer	Dec 13, 2023	Tens of millions (EUR)	Training data + RAG (summaries) for Politico, Business Insider, etc.
Financial Times	Apr 29, 2024	Undisclosed (Multi-million)	Training on archived content; attribution in ChatGPT.
Dotdash Meredith	May 7, 2024	>$16 Million	Content from People, Better Homes & Gardens for training/ad targeting.
News Corp	May 22, 2024	>$250 Million (5 Years)	Access to WSJ, The Times, New York Post, MarketWatch.
Time Magazine	June 27, 2024	Multi-year deal	Access to 101-year editorial archive for training.

Buying Silence, Fragmenting Opposition

The timing of these agreements suggests a strategy of “divide and conquer” rather than a genuine respect for intellectual property. By cutting lucrative deals with the largest conglomerates, News Corp’s deal is valued at over $250 million over five years, OpenAI splits the publishing industry into two camps: the paid “partners” and the unpaid “litigants.” This fragmentation serves a dual purpose., it secures a steady stream of high-quality, real-time journalism to ground ChatGPT’s increasingly erratic outputs. Second, it isolates holdouts like the *New York Times* and the Authors Guild, painting them as Luddites standing in the way of progress, rather than victims of theft. Yet, this method backfires legally. Every dollar paid to Rupert Murdoch’s News Corp is a dollar that establishes the “going rate” for the theft of similar content from independent authors and smaller outlets who received nothing. Legal scholars note that “major use”, the defense that the AI creates something fundamentally new, is harder to sustain when the raw material is being bought and sold for that specific purpose. If the use were truly fair, OpenAI would not need to pay anyone. The checkbook reveals the truth: the data is not just raw material for a major machine; it is a product being consumed.

The “Clean” vs. “Dirty” Data Paradox

A serious contradiction sits at the heart of OpenAI’s operations. The company pays News Corp for “clean” access to the *Wall Street Journal*, yet its models remain trained on the “dirty” data of the Books3 dataset, Common Crawl, and the pirated libraries discussed in previous sections. The licensing deals are prospective; they do not scrub the illicitly obtained data already baked into GPT-4’s neural weights. This creates a liability paradox. OpenAI admits that high-quality non-fiction requires compensation * *, refuses to apply that logic retroactively to the millions of books and articles it ingested to build its empire. The “opt-out” method offered to authors are similarly hollow, as they only prevent *future* scraping, leaving the existing infringing models untouched. The Axel Springer and News Corp deals are not partnerships; they are admissions of guilt priced into the cost of doing business. They demonstrate that the “fair use” defense is a temporary shield, discarded the moment a copyright holder is large enough to pose a serious threat. For the non-fiction author whose work was stolen to build the model that these deals, the message is clear: fair use applies only to those who cannot afford to fight back.

Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks

SECTION 13 of 14: Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks

The intellectual foundation of the post-2020 artificial intelligence boom rests not on code, on a vast, unauthorized appropriation of human knowledge. While news outlets focus on the plagiarism of fiction, a far more systematic extraction has targeted the academic sector. OpenAI’s Large Language Models (LLMs) have ingested millions of copyrighted textbooks, peer-reviewed journal articles, and monographs. This process, frequently described by critics as data laundering, converts the proprietary output of the global scientific community into a commercial product, frequently without a single cent reaching the researchers or educators who created it. For years, the specific composition of OpenAI’s training datasets, known unclear as “Books1” and “Books2,” remained a closely guarded corporate secret. Yet, forensic analysis and class-action lawsuits have pierced this veil. The sheer volume of data required to train GPT-3 and GPT-4—hundreds of billions of tokens—mathematically the inclusion of “shadow libraries.” These illicit repositories, such as Library Genesis (LibGen), Z-Library, and Sci-Hub, host pirated copies of nearly every academic text and journal article in existence. In *Authors Guild v. OpenAI*, plaintiffs allege that the company’s models correlate so strongly with the contents of these shadow libraries that unauthorized ingestion is the only logical explanation. The “Books3” dataset, a component of the open-source “Pile” dataset created by EleutherAI to replicate OpenAI’s methods, provides a grim proxy for what lies inside GPT-4. Books3 contains nearly 200, 000 books derived from a torrent of the Bibliotik tracker, a notorious piracy hub. This dataset includes standard university textbooks on subjects ranging from quantum mechanics to macroeconomics. When a student prompts ChatGPT to “explain the concept of elasticity as defined in Mankiw’s *Principles of Economics*,” and the model returns a breakdown mirroring the text’s unique structure and examples, it functions not as a search engine, as a replacement for the textbook itself. This unauthorized ingestion creates a direct market substitute. Academic publishers and authors rely on the sale of textbooks and access to journals for revenue. An LLM that can synthesize the specific pedagogical methods of a copyrighted textbook removes the incentive for a student to purchase the original work. The model does not “read” the book; it metabolizes the author’s labor—the structuring of complex ideas, the creation of problem sets, the curation of case studies—and regurgitates it as a service. In 2025, the “AI Disclosures Project” released a study providing empirical evidence of this theft. Researchers tested GPT-4o against a dataset of paywalled technical books from O’Reilly Media. The model demonstrated “strong recognition” of the non-public content, completing passages and solving problems that exist only within those paid resources. This capability confirms that the model was trained on data that could not have been accessed legally through public web scraping. The inclusion of such material suggests a deliberate strategy to bypass copyright controls to acquire high-value technical knowledge. While OpenAI fights legal battles over shadow libraries, a secondary form of exploitation has emerged from within the publishing industry itself. In 2024, academic conglomerate Taylor & Francis sparked outrage when it signed a deal worth over $10 million to license its authors’ work to Microsoft, OpenAI’s primary backer. This agreement allowed the tech giant to train its AI models on thousands of journals and textbooks. Crucially, the authors of these works were neither consulted nor offered an opt-out method. This deal, and similar arrangements by Wiley (which projected $44 million in AI licensing revenue), represents a betrayal of academic trust. Scholars publish to advance human knowledge, signing over copyright to publishers under the assumption that the publisher protect the work and manage its distribution. Instead, these corporations have sold the cumulative life’s work of their authors to train systems that may eventually render those same authors obsolete. The Society of Authors and various academic unions have condemned these deals, noting that while publishers reap millions, the researchers who performed the actual labor receive nothing. The economic for the academic ecosystem are severe. If an AI can generate a literature review, summarize the latest findings in oncology, or solve complex engineering problems by accessing a training set of pirated journals, the value of the original publication collapses. The “Fair Use” defense employed by OpenAI that their use is major. Yet, in the context of textbooks and reference materials, the use is frequently derivative and competitive. A chemistry student using ChatGPT to solve reaction method from a specific textbook is not using the tool for a “major” purpose; they are using it to avoid reading the book. also, the integrity of the scientific record faces a serious threat. LLMs are prone to hallucination, fabricating citations and misinterpreting data. When a model trained on a mix of high-quality journals and unverified internet detritus answers a query, it flattens the hierarchy of credibility. A peer-reviewed study from *Nature* becomes statistically equivalent to a blog post, both reduced to mere tokens in a probability distribution. This degradation of sourcing undermines the very purpose of academic rigor. The exploitation extends to the “publish or perish” pattern. Professors and researchers spend decades building a body of work. OpenAI has enclosed this intellectual commons, privatizing the output of public and university-funded research. The company charges users $20 a month to access a model built on the unpaid labor of the global academic community. It is a wealth transfer of proportions, moving value from the public education and research sectors directly into the coffers of a private Silicon Valley entity. As of 2026, the legal remains in flux, the ethical verdict is clear. The unauthorized use of scholarly literature constitutes a massive, uncompensated extraction of value. Whether through the illicit scraping of shadow libraries or the unclear, backdoor deals struck by publishers, the academic community has been strip-mined to fuel the AI revolution. The “intelligence” of these models is not an emergent property of code; it is the stolen wisdom of human scholars, repackaged and sold back to them.

The Substitution Effect: AI-Generated Summaries as Direct Market Competitors

The economic engine of the publishing industry relies on a simple, established contract: content creators provide information, and platforms provide traffic. For two decades, search engines adhered to this reciprocal arrangement. Google indexed the web, to extract value, the user had to click a blue link, visiting the source and triggering an ad impression or a subscription opportunity. OpenAI shattered this contract. By 2026, the “Substitution Effect” has transitioned from a theoretical legal argument to a quantifiable market reality. Large Language Models (LLMs) do not organize information; they replace the need of consulting the original source.

The Mechanics of Market Usurpation

The core of the conflict lies in the architectural difference between a search engine and an “answer engine.” A traditional search engine acts as a signpost; an LLM acts as a librarian who reads the book for you and recites the relevant passages. When a user prompts ChatGPT for a summary of a non-fiction bestseller or a breakdown of a complex news event, the model generates a detailed response that satisfies the user’s curiosity instantly. The user has no reason to click through to the publisher’s site. This phenomenon is what Gartner analysts described as the rise of “substitute answer engines,” predicting a 25% drop in traditional search volume by 2026, a forecast that has proven devastatingly accurate. Data from Chartbeat and the Reuters Institute confirms the of this displacement. Between November 2024 and November 2025, organic search traffic to news sites plummeted by 33%. This is not a fluctuation; it is a structural collapse. The “zero-click” future, once a fear, is the baseline. For publishers, this means the content they spend millions to produce is being ingested, processed, and served to users by a third party that captures 100% of the engagement while remitting zero percent of the revenue.

The Wirecutter Evidence: A Smoking Gun

The *New York Times v. OpenAI* complaint provided the most tangible evidence of this parasitic. The *Times* highlighted its product review site, Wirecutter, which generates revenue through affiliate links. When a user reads a review on Wirecutter and clicks a link to buy a blender or a set of headphones, the *Times* earns a commission. The legal filing demonstrated that ChatGPT could reproduce Wirecutter’s recommendations verbatim. Yet, in doing so, the AI stripped the affiliate links. The user received the value of the *Times’* rigorous testing and editorial judgment the *Times* received nothing. In instances, the model even “hallucinated” recommendations, attributing endorsements to Wirecutter for products the editorial team had explicitly rejected, damaging the brand’s reputation while simultaneously starving it of revenue. This is not “fair use” transformation; it is direct market competition using the victim’s own inventory.

Factor Four and the Death of Fair Use

OpenAI’s legal defense rests heavily on the doctrine of Fair Use, specifically the claim that their use of copyrighted data is “major.” Yet, the Copyright Act of 1976 mandates a four-factor test to determine fair use. The fourth factor is widely considered the most significant: “the effect of the use upon the chance market for or value of the copyrighted work.” If a secondary work serves as a market substitute for the original, fair use fails. The Authors Guild and non-fiction writers that LLMs are the market substitute. Why purchase a business strategy book when an LLM can synthesize its core arguments, chapter by chapter, into a five-minute read? The “Books3” dataset and other shadow libraries allowed models to ingest tens of thousands of non-fiction titles. Users can treat ChatGPT as an on-demand summarization service, bypassing the bookstore. The AI does not critique the book; it metabolizes it.

The “Browse” Loophole and Paywall Evasion

The integration of real-time browsing capabilities (RAG, Retrieval-Augmented Generation) further exacerbates the substitution problem. Early iterations of “Browse with Bing” allowed users to bypass paywalls simply by asking the AI to print the text of a locked article. While OpenAI patched the most egregious exploits, the fundamental function remains: the AI visits the page, reads the content, and synthesizes the information. For a subscription-based outlet, this is fatal. If a subscriber can cancel their $20/month newspaper subscription because their $20/month AI assistant can brief them on the morning’s top stories using data scraped from that very newspaper, the market for the original work evaporates. The *Times* complaint specific examples where the model reproduced significant portions of Pulitzer Prize-winning articles, such as “Snow Fall,” rendering the original publication redundant.

The Vampire Economy

The “Custom GPT” store launched by OpenAI introduced another of unauthorized substitution. Users began creating specialized bots trained on specific libraries of copyrighted texts, “The Harry Potter Bot,” “The Warren Buffett Investment Bot,” or “The Python Coding Interview Bot.” These user-generated agents were frequently fed pirated EPUBs or PDFs. OpenAI provided the infrastructure for this mass infringement, profiting from the subscription fees while the authors of the underlying material watched their royalty checks dwindle. This creates a “vampire economy.” The AI company sucks the lifeblood (data) from the creative industries to fuel its own growth, leaving the host bodies (publishers and authors) anemic. Unlike the search era, where Google needed a healthy web to index, LLMs theoretically benefit if the open web dies, provided they have already archived its contents. They are not symbiotic; they are extractive.

The 2026 Reality

As of February 2026, the consequences are measurable. Niche publishers, particularly in the “how-to” and informational sectors, have seen traffic from search evaporate. The “10 Best” lists, the tutorial sites, and the explainer journalism sector are being decimated by AI summaries. The user intent, “I need to know how to fix my sink”, is satisfied by the chat interface. The website that actually hired the plumber to write the guide receives no visit, no ad impression, and no affiliate sale. OpenAI’s strategy relies on the assumption that they can outrun the legal consequences until they become too big to fail. They are betting that the “major” argument hold, or that they can settle individual lawsuits with licensing deals that amount to hush money, pennies on the dollar compared to the value of the content they have appropriated. the substitution effect proves that this is not a victimless technological evolution. It is a wealth transfer from the creators of knowledge to the owners of the machines that process it.

Conclusion of the Review

This investigation has examined the systematic, unauthorized utilization of copyrighted works by OpenAI. From the ingestion of the “Books3” shadow library to the scraping of premium news archives, the evidence points to a deliberate strategy of “ask forgiveness, not permission.” The company built a trillion-dollar valuation on the backs of authors, journalists, and academics who never consented to their work being used to train their replacements. The legal battles currently moving through the courts—*The New York Times v. OpenAI*, *The Authors Guild v. OpenAI*— define the future of human intellectual property. If the courts rule that training an AI to replace a writer is “fair use,” the economic foundation of the creative class collapse. If they rule against OpenAI, the AI industry faces a reckoning that could its current business model. Until then, the theft continues, one token at a time.

Timeline Tracker

2020

The 'Books3' Revelation: Uncovering the Shadow Library in Training Data — The transition of OpenAI from a non-profit research laboratory to a closed-source commercial entity is best illustrated by the sudden obfuscation of its training data. In.

December 27, 2023

The New York Times v. OpenAI: Allegations of Mass Copyright Infringement — The legal war between The New York Times and OpenAI began on December 27, 2023. This filing in the U. S. District Court for the Southern.

2015

The 'Fair Use': Redefining Copyright for the Age of Algorithms — OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the "fair use" doctrine. The company asserts that training a.

January 2024

The 'Impossible' Admission: A confession to the House of Lords — The confidence of this fair use defense was severely tested by OpenAI's own admissions to the UK Parliament. In a written submission to the House of.

June 2025

The Piracy Pivot: Distinguishing 'Legal' Access from 'Shadow' Libraries — The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the.

September 19, 2023

The Authors Guild Class Action: Fiction and Non-Fiction Writers Unite — The Authors Guild class action represents the most significant organized legal challenge to generative AI in history. Filed in the Southern District of New York on.

December 2023

SECTION 5 of 14: Digital 'Regurgitation': Evidence of Verbatim Text Reproduction in ChatGPT — The sanitized term is "memorization." In the sterile corridors of machine learning research, it refers to a model's tendency to encode specific training data so perfectly.

December 2023

SECTION 6 of 14: Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives — The economic survival of modern journalism relies on a simple contract. Readers pay for access to high-quality reporting and publishers use those funds to sustain newsrooms.

2025

The 'Learning' Fallacy: Anthropomorphism as Legal Shield — OpenAI's primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a.

2023

The Warhol Effect: A Supreme Court Reality Check — For years, Silicon Valley relied on a broad interpretation of "major use," assuming that any technological processing of data qualified for fair use protection. That assumption.

July 2025

The Google Books False Equivalence — OpenAI frequently cites Authors Guild v. Google (2015), the "Google Books" case, as its legal shield. In that case, the Second Circuit ruled that Google's scanning.

2015

Market Usurpation in the Information Economy — The "major" defense collapses completely when applied to the market for non-fiction and news archives. Unlike fiction, where the "experience" of reading the prose is the.

November 2022

From 'LibGen1' to 'Books2': The Internal Laundering — Legal filings from the Authors Guild and other plaintiffs have unearthed internal practices that contradict OpenAI's public stance on safety and compliance. Court documents allege that.

1929

The Non-Fiction Imperative: Why Fiction Wasn't Enough — The theft of shadow libraries was not about volume; it was a strategic need for model performance. While public domain fiction provides syntax and narrative structure.

2022

Spoliation and the Deletion of Evidence — As scrutiny intensified in 2022, OpenAI took steps that plaintiffs describe as spoliation of evidence. The company reportedly deleted the original "Books1" and "Books2" datasets from.

1929

The Industry Standard of Theft — OpenAI does not stand alone in this practice, though it remains the primary target of litigation. The use of shadow libraries appears to be an open.

May 2025

Market Usurpation: How LLMs Threaten the Livelihood of Non-Fiction Authors — The economic threat posed by Large Language Models (LLMs) to non-fiction authors and journalists is not theoretical; it is a documented, quantifiable displacement of human labor.

August 2023

The Opt-Out Illusion: Criticisms of OpenAI's Retroactive Data Policies — The introduction of "opt-out" method by OpenAI in late 2023 marked a strategic pivot in the company's handling of copyright disputes. Faced with mounting lawsuits and.

2020

The Pivot to Secrecy: Concealing the Source — The trajectory of OpenAI's transparency offers a timeline of incriminating silence. In its early years, the organization operated with a mandate of openness, publishing detailed datasheets.

November 2025

The 'Accidental' Deletion of Evidence — Allegations of willful misconduct intensified during the discovery phase of The New York Times v. OpenAI. In late 2024, a serious dispute emerged when OpenAI engineers.

December 13, 2023

Licensing as Admission: Why Deals with Axel Springer Undermine Fair Use — The legal defense of "fair use" relies on a delicate balance, one that OpenAI shattered the moment it began writing checks. For years, the company maintained.

December 27, 2023

The "Market Harm" Trap — Under United States copyright law, the fourth factor of the fair use test examines "the effect of the use upon the chance market for or value.

2020

SECTION 13 of 14: Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks — The intellectual foundation of the post-2020 artificial intelligence boom rests not on code, on a vast, unauthorized appropriation of human knowledge. While news outlets focus on.

2026

The Substitution Effect: AI-Generated Summaries as Direct Market Competitors — The economic engine of the publishing industry relies on a simple, established contract: content creators provide information, and platforms provide traffic. For two decades, search engines.

November 2024

The Mechanics of Market Usurpation — The core of the conflict lies in the architectural difference between a search engine and an "answer engine." A traditional search engine acts as a signpost.

1976

Factor Four and the Death of Fair Use — OpenAI's legal defense rests heavily on the doctrine of Fair Use, specifically the claim that their use of copyrighted data is "major." Yet, the Copyright Act.

February 2026

The 2026 Reality — As of February 2026, the consequences are measurable. Niche publishers, particularly in the "how-to" and informational sectors, have seen traffic from search evaporate. The "10 Best".

Pinned News

Cash for Honors: The Secret Price List for a Seat in the House of Lords

Why it matters: Investigative analysis reveals a correlation between megadonations and elevation to the House of Lords in the UK. There is a pattern of major donors being granted titles.

Read Full Report

Questions And Answers

Tell me about the the 'books3' revelation: uncovering the shadow library in training data of OpenAI.

Tell me about the the new york times v. openai: allegations of mass copyright infringement of OpenAI.

Tell me about the the 'fair use': redefining copyright for the age of algorithms of OpenAI.

OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the "fair use" doctrine. The company asserts that training a Large Language Model is functionally identical to a human student reading a textbook in a library. In their view, the model does not "copy" the expressive content of a or a news article. Instead, it analyzes statistical relationships between words to learn the underlying.

Tell me about the the 'impossible' admission: a confession to the house of lords of OpenAI.

The confidence of this fair use defense was severely tested by OpenAI's own admissions to the UK Parliament. In a written submission to the House of Lords Communications and Digital Committee in January 2024 the company stated explicitly that it would be "impossible" to train leading AI models without using copyrighted materials. This declaration stripped away any pretense that the company could rely solely on public domain works or licensed.

Tell me about the the piracy pivot: distinguishing 'legal' access from 'shadow' libraries of OpenAI.

The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the data. In the consolidated cases of Tremblay v. OpenAI and Silverman v. OpenAI the defense successfully argued that the mere existence of a model does not prove it is a "derivative work" of the books it read. Judge Araceli Martínez-Olguín dismissed several claims including.

Tell me about the the dmca technicality: evading the 'removal of rights' charge of OpenAI.

A serious component of the authors' lawsuits involved the Digital Millennium Copyright Act (DMCA). Plaintiffs alleged that OpenAI violated Section 1202 by removing "Copyright Management Information" (CMI) such as titles, author names, and ISBNs during the training process. The argument was that by stripping this identifying data OpenAI facilitated copyright infringement and concealed the origin of the text. This claim posed a serious threat because the DMCA allows for statutory.

Tell me about the the licensing paradox: paying for what you claim is free of OpenAI.

The most contradiction in OpenAI's defense strategy is its aggressive of licensing deals. While asserting in court that they have a fair use right to train on all publicly available text the company has simultaneously signed multi-million dollar agreements with entities like The Associated Press, Axel Springer, and News Corp. If training is fair use then these payments are unnecessary. OpenAI characterizes these deals as "partnerships" for real-time access and.

Tell me about the load shifting: the 'opt-out' defense of OpenAI.

OpenAI has further fortified its position by introducing "opt-out" method like the GPTBot user agent. They that because they offer a way for webmasters to block their crawler any site that does not block them has implicitly consented to be scraped. This argument attempts to shift the load of copyright enforcement from the user to the owner. It ignores the fact that the vast majority of the training data was.

Tell me about the the authors guild class action: fiction and non-fiction writers unite of OpenAI.

Tell me about the section 5 of 14: digital 'regurgitation': evidence of verbatim text reproduction in chatgpt of OpenAI.

The sanitized term is "memorization." In the sterile corridors of machine learning research, it refers to a model's tendency to encode specific training data so perfectly that it can be recalled sequence-for-sequence. In the courtroom, yet, this phenomenon is known by a far more damaging name: digital regurgitation. This is not a technical quirk; it is the smoking gun that strips away the veneer of "learning" to reveal what critics.

Tell me about the section 6 of 14: bypassing paywalls: the unauthorized ingestion of premium news archives of OpenAI.

Tell me about the the 'learning' fallacy: anthropomorphism as legal shield of OpenAI.

OpenAI's primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a book in a library. In this narrative, the AI "learns" concepts, facts, and styles just as a human would, and therefore its ingestion of copyrighted material constitutes "fair use." This anthropomorphic framing is a calculated distraction. A human student does not ingest 500, 000.