
Unauthorized utilization of copyrighted non-fiction literature and news archives for Large Language Model (LLM) training
If OpenAI cannot prove they purchased legal copies of the thousands of books in their training data their fair use.
Why it matters:
- OpenAI's transition to a closed-source entity led to the mysterious "Books2" dataset, sparking concerns about its origin.
- Independent researcher Shawn Presser's creation of "Books3" shed light on the potential use of shadow libraries in AI training data, revealing copyrighted works within.
The 'Books3' Revelation: Uncovering the Shadow Library in Training Data

The New York Times v. OpenAI: Allegations of Mass Copyright Infringement

Systematic Theft or Fair Use? Analyzing OpenAI's Legal Defense Strategy
The ‘Fair Use’: Redefining Copyright for the Age of Algorithms
OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the “fair use” doctrine. The company asserts that training a Large Language Model is functionally identical to a human student reading a textbook in a library. In their view, the model does not “copy” the expressive content of a or a news article. Instead, it analyzes statistical relationships between words to learn the underlying patterns of language. This argument seeks to categorize the ingestion of billions of copyrighted works not as reproduction as “intermediate copying” for a “major” purpose. Legal teams for the AI giant frequently cite the precedent set in Authors Guild v. Google (2015). That ruling allowed Google Books to scan millions of volumes to create a searchable database. OpenAI contends that if scanning books to create a search index is fair use then scanning books to teach a machine how to write must also be fair use.
This defense hinges on the concept of “non-expressive use.” OpenAI that their models do not retain the artistic expression of the original authors. They claim the software extracts factual data and stylistic abstractions. When a model processes a copyrighted history book it is not “memorizing” the text to reprint it. It is learning how historians structure sentences and how dates correlate with events. This distinction is the bedrock of their motion to dismiss in cases like The New York Times v. OpenAI. The company posits that copyright law protects the specific arrangement of words not the facts or the functional patterns of language contained within them. By framing the training process as a statistical analysis rather than a literary reproduction OpenAI attempts to bypass the need for licensing altogether.
The ‘Impossible’ Admission: A confession to the House of Lords
The confidence of this fair use defense was severely tested by OpenAI’s own admissions to the UK Parliament. In a written submission to the House of Lords Communications and Digital Committee in January 2024 the company stated explicitly that it would be “impossible” to train leading AI models without using copyrighted materials. This declaration stripped away any pretense that the company could rely solely on public domain works or licensed data. OpenAI argued that because copyright covers virtually every form of modern human expression, from blog posts to government reports, an AI trained only on out-of-copyright books would be archaic and dysfunctional. This submission was intended to lobby for a broad copyright exception in the UK. Yet it served as a damning confirmation for plaintiffs in the United States. It was an admission that the commercial viability of their product depended entirely on the unauthorized use of protected intellectual property.
Critics and legal scholars seized on this statement as evidence of “unjust enrichment.” The “impossible” defense that because the theft is necessary for the product to exist the theft must be legal. This logic inverts the traditional principles of market competition. if a business model requires a resource that is too expensive to acquire legally the business model is considered unviable. OpenAI argued the opposite. They claimed that the “societal benefit” of their technology justified the mass appropriation of private property. This utilitarian argument attempts to shift the legal focus from the rights of the creator to the chance utility of the machine. It suggests that the progress of artificial intelligence is a public good that supersedes the “monopoly” rights of individual authors.
The Piracy Pivot: Distinguishing ‘Legal’ Access from ‘Shadow’ Libraries
The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the data. In the consolidated cases of Tremblay v. OpenAI and Silverman v. OpenAI the defense successfully argued that the mere existence of a model does not prove it is a “derivative work” of the books it read. Judge Araceli Martínez-Olguín dismissed several claims including vicarious infringement and negligence. She ruled that plaintiffs had to prove that specific outputs were substantially similar to their books. This was a tactical victory for OpenAI. It forced authors to find “smoking gun” examples where ChatGPT regurgitated their text verbatim. Such examples are rare due to the probabilistic nature of the model.
Yet a more dangerous precedent emerged from the parallel Anthropic litigation in June 2025. Judge William Alsup ruled that while using purchased books for training might be “exceedingly major” and thus fair use the use of pirated datasets constitutes a different category of violation. This ruling struck at the heart of OpenAI’s reliance on the “Books3” corpus. If OpenAI cannot prove they purchased legal copies of the thousands of books in their training data their fair use defense collapses. The “major” nature of the processing does not cure the initial act of acquiring stolen goods. OpenAI has since attempted to distance itself from the “Books3” dataset. They emphasize their partnerships with publishers. the presence of pirated libraries in their earlier training runs remains a toxic liability that no amount of current licensing can retroactively sanitize.
The DMCA Technicality: Evading the ‘Removal of Rights’ Charge
A serious component of the authors’ lawsuits involved the Digital Millennium Copyright Act (DMCA). Plaintiffs alleged that OpenAI violated Section 1202 by removing “Copyright Management Information” (CMI) such as titles, author names, and ISBNs during the training process. The argument was that by stripping this identifying data OpenAI facilitated copyright infringement and concealed the origin of the text. This claim posed a serious threat because the DMCA allows for statutory damages of up to $25000 per violation. With billions of documents involved the chance liability was astronomical.
OpenAI’s legal team dismantled this argument by focusing on the requirement of “intent.” They argued that the training process scrapes text automatically and that any removal of CMI was an incidental side effect of data cleaning rather than a malicious attempt to conceal infringement. The courts largely agreed with this interpretation in the early phases of the Tremblay litigation. The judge ruled that the plaintiffs failed to show that OpenAI knowingly removed the CMI to induce infringement. This technical victory allowed OpenAI to avoid the catastrophic damages associated with the DMCA. It narrowed the scope of the battle to the core copyright question. The company successfully framed the removal of author names not as a cover-up as a necessary technical step in preparing data for tokenization. This defense relies on the complexity of the “black box” to obscure the intent behind the data processing.
The Licensing Paradox: Paying for What You Claim is Free
The most contradiction in OpenAI’s defense strategy is its aggressive of licensing deals. While asserting in court that they have a fair use right to train on all publicly available text the company has simultaneously signed multi-million dollar agreements with entities like The Associated Press, Axel Springer, and News Corp. If training is fair use then these payments are unnecessary. OpenAI characterizes these deals as “partnerships” for real-time access and attribution rather than as copyright licenses for training data. They they are paying for the “freshness” of the news feed and the right to display snippets in search results. This distinction allows them to maintain their fair use stance in court while buying peace with the most media conglomerates.
Legal analysts view this as a “risk reduction” strategy. By paying off the largest chance litigants OpenAI isolates the individual authors and smaller publishers who absence the resources to sustain a protracted legal war. The deals also serve as a hedge against a chance loss in the NYT case. If the courts eventually rule that training requires a license OpenAI can claim they are already a “responsible actor” that compensates rights holders. This dual-track strategy creates a two-tiered system. Large corporations get paid while individual writers are told their work is “fair use” fodder for the machine. The company uses its vast capital to create a private licensing regime that undermines the very legal principle they defend in the courtroom. They pay when they must and take when they can.
load Shifting: The ‘Opt-Out’ Defense
OpenAI has further fortified its position by introducing “opt-out” method like the GPTBot user agent. They that because they offer a way for webmasters to block their crawler any site that does not block them has implicitly consented to be scraped. This argument attempts to shift the load of copyright enforcement from the user to the owner. It ignores the fact that the vast majority of the training data was collected years before these opt-out tools existed. The “Books3” dataset and the Common Crawl archives were ingested long before any author had the option to say no. OpenAI treats this retroactive consent as a valid legal shield. They assert that their current “good faith” efforts to respect robots. txt should mitigate any liability for past actions. This defense relies on the sheer inertia of the internet. It assumes that silence equals permission and that the default state of all digital content is to be available for AI training unless explicitly marked otherwise.

The Authors Guild Class Action: Fiction and Non-Fiction Writers Unite

Digital 'Regurgitation': Evidence of Verbatim Text Reproduction in ChatGPT
SECTION 5 of 14: Digital ‘Regurgitation’: Evidence of Verbatim Text Reproduction in ChatGPT
The sanitized term is “memorization.” In the sterile corridors of machine learning research, it refers to a model’s tendency to encode specific training data so perfectly that it can be recalled sequence-for-sequence. In the courtroom, yet, this phenomenon is known by a far more damaging name: digital regurgitation. This is not a technical quirk; it is the smoking gun that strips away the veneer of “learning” to reveal what critics is little more than a high-tech photocopier operating at industrial. For years, OpenAI maintained that its models did not “copy” text rather “learned concepts” in the same way a human student might study a library. They argued that ChatGPT synthesized information, creating sentences based on statistical probabilities. That defense crumbled visibly in December 2023, when The New York Times filed a lawsuit that included **Exhibit J**, a document that may go down in legal history as the most devastating proof of copyright infringement ever assembled against an AI company. Exhibit J did not contain abstract arguments. It contained one hundred specific examples where GPT-4, when prompted with the few paragraphs of a *Times* article, proceeded to output the remainder of the text verbatim. In one instance, the model reproduced a Pulitzer Prize-winning investigation into the taxi industry, “The ‘New’ Yellow Cab,” with near-perfect fidelity. It did not summarize; it did not paraphrase. It recited the copyrighted text word-for-word, including specific quotes, data points, and stylistic flourishes unique to the original authors. This was not “learning concepts”; this was unauthorized republication. The of Exhibit J extend far beyond a single newspaper. It demonstrated that the “black box” of the Large Language Model (LLM) is not as unclear as claimed. The data is not dissolved into a nebulous soup of weights and biases;, it sits intact, ready to be extracted by any user who knows the right prompt. Independent researchers have corroborated these findings with rigorous technical audits. A landmark study by **Nicholas Carlini** and researchers from Google DeepMind, the University of Washington, and Cornell University shattered the assumption that training data is private or unrecoverable. By employing a ” attack”—prompting the model to repeat a specific word (like “poem”) forever—the researchers caused ChatGPT to glitch and vomit up raw training data. The model stopped generating coherent text and began outputting massive chunks of memorized information: personally identifiable information (PII), snippets of code, and entire passages from copyrighted literature. This phenomenon is particularly acute with “popular” texts that appear frequently in the training dataset. While OpenAI has attempted to patch these leaks with safety filters, the underlying reality remains: the model *knows* the text. In 2024, researchers demonstrated that **Llama 3. 1 70B** (a similar class of model) had memorized entire books, including *Harry Potter and the Sorcerer’s Stone* and George Orwell’s *1984*, almost in their entirety. While these are fiction examples, the method applies equally to non-fiction. The **Books3** dataset, a controversial component of LLM training sets, contains thousands of non-fiction titles—biographies, histories, and technical manuals—that are subject to the same memorization mechanics. For non-fiction authors, the threat of regurgitation manifests differently no less destructively. A user may not ask ChatGPT to “write the chapter” of a history book, they frequently ask for “detailed summaries.” Reports from the **Authors Guild** and independent tests show that ChatGPT can generate chapter-by-chapter breakdowns of non-fiction books so detailed that they serve as a market substitute for the original work. If a user can obtain the core arguments, data, and narrative arc of a new business book or historical biography without purchasing it, the economic damage is identical to piracy, even if the output is not a 100% verbatim copy. OpenAI’s response to these allegations has been a mixture of technical minimization and legal maneuvering. In public statements, they characterize verbatim regurgitation as a “rare bug” that affects only a tiny fraction of queries. They accuse researchers and the *New York Times* of “prompt engineering”—essentially hacking the model to force it to misbehave. They that normal users do not spend their time trying to extract copyrighted text. Yet, this defense ignores the reality of “Retrieval Augmented Generation” (RAG) and the way users actually interact with these tools. Users *do* want specific information. When a user asks for a recipe, a coding solution, or a news summary, they are frequently unknowingly requesting copyrighted material that the model provides without attribution or compensation. The “glitch” is not the regurgitation; the glitch is the model’s occasional failure to hide the theft. also, the “paywall bypass” capability of these models has alarmed the news industry. Investigations by **Press Gazette** and **INMA** in 2024 and 2025 revealed that ChatGPT could reconstruct the substance of paywalled articles from *The Atlantic*, *The Financial Times*, and *The New York Times*. By scraping “public” fragments—social media posts, Reddit discussions, and syndicated snippets—the model could assemble a “Frankenstein” version of the article that broke the paywall. OpenAI this is “fair use” of public data; publishers it is a sophisticated form of fencing stolen goods. The evidence of digital regurgitation neutralizes the “fair use” argument that relies on the “major” nature of AI. If a machine outputs the exact text of a copyrighted work, it has not transformed anything. It has transported it from a protected server to a public chat window. As the legal battles intensify, Exhibit J and the Carlini studies stand as the twin pillars of the prosecution: proof that beneath the hype of “artificial intelligence” lies a vast, unauthorized archive of human labor, waiting to be recalled.

Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives
SECTION 6 of 14: Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives
The economic survival of modern journalism relies on a simple contract. Readers pay for access to high-quality reporting and publishers use those funds to sustain newsrooms. OpenAI shattered this model. The company did not scrape the open web. It systematically penetrated digital blocks designed to protect intellectual property. The training datasets for GPT-3 and GPT-4 contained millions of articles from subscription-based publications. These archives were ingested without permission. They were processed without payment. The resulting models could reproduce premium content verbatim. This theft was not an accident of web crawling. It was a foundational feature of the data acquisition strategy.
The primary vehicle for this unauthorized access was Common Crawl. This non-profit organization maintains a massive repository of web data. OpenAI used this repository as the bedrock for its training sets. Common Crawl’s bots frequently ignore the technical nuances of paywalls. They capture the text of an article before the subscription overlay triggers. They scrape cached versions of pages that are meant to be restricted. OpenAI did not filter this stolen contraband. They fed it directly into their neural networks. The result was a machine that had read the Wall Street Journal and the Financial Times without ever buying a subscription. The company treated the world’s most expensive journalism as free raw material.
The New York Times provided the most damning evidence of this practice in its December 2023 lawsuit. The complaint detailed how GPT-4 could recite large portions of the paper’s content. The model did not just summarize facts. It regurgitated entire paragraphs of Pulitzer Prize-winning investigative work. The Times the 2012 multimedia feature “Snow Fall” as a prime example. The article sits behind a strict paywall. Yet the model could reproduce its opening passages word-for-word. This memorization proved that the text resided deep within the model’s parameters. The machine had not just learned from the article. It had cloned it.
The legal battle expanded in 2024 with a lawsuit from Alden Global Capital. The investment firm owns eight major newspapers including the Chicago Tribune and the New York Daily News. Their complaint alleged that OpenAI and Microsoft purloined millions of copyrighted articles. The suit argued that the tech giants siphoned off the revenue of local news organizations. They did this while simultaneously degrading the brand of the newspapers by attributing hallucinations to them. The Alden lawsuit highlighted a specific technical reality. The models frequently displayed full text from articles that human readers could not access without paying. This capability turned ChatGPT into a bootleg reading service. It allowed users to bypass the subscription model entirely.
A more insidious aspect of this theft involves the removal of Copyright Management Information or CMI. This is the digital fingerprint of a piece of writing. It includes the author’s name and the publication title. It also includes the copyright notice. Raw Story and The Intercept filed lawsuits in February 2024 focusing on this violation. They alleged that OpenAI stripped this metadata during the training process. The removal served a specific purpose. It concealed the origin of the text. It made the output appear as generic knowledge rather than the product of specific labor. The Digital Millennium Copyright Act strictly prohibits the removal of CMI. OpenAI’s defense relied on the claim that the removal was not intentional. The plaintiffs argued that it was a necessary step to sanitize the stolen goods.
OpenAI attempted to frame these violations as technical glitches. The company temporarily disabled the “Browse with Bing” feature in July 2023 after users discovered it could bypass paywalls. This feature allowed the chatbot to search the live web. Users quickly realized they could ask the bot to print the text of a locked article. The bot would comply. OpenAI called this an “unwanted” behavior. This excuse ignored the deeper reality. The live browsing tool was accessing the same unauthorized pathways that the training crawlers had used for years. The “glitch” was not the access itself. The glitch was that the public could see it happening in real time.
The subsequent behavior of OpenAI confirms the value of the stolen data. The company began signing licensing deals with major publishers in mid-2024. They struck an agreement with News Corp to access content from the Wall Street Journal. They made similar deals with Axel Springer and the Financial Times. These agreements were worth millions of dollars. They served as a tacit admission. If the data was fair use and free for the taking then there would be no need to pay for it. The checkbook opened only after the lawsuits began. The payments were not for future access alone. They were “retroactive” in nature. They were hush money designed to legalize the theft that had already occurred.
Evidence surfaced in 2025 that the practice continued even with the lawsuits. A study by the AI Disclosures Project revealed that the model GPT-4o showed strong recognition of non-public books from O’Reilly Media. These technical manuals and guides are sold for profit. They are not free blog posts. The study used “membership inference attacks” to prove the model had seen the text. The results showed that the model knew the contents of paywalled books better than it knew public domain texts. This suggested that the company prioritized high-value proprietary data. They sought out the most expensive information because it was the most reliable. They ingested it regardless of the copyright status.
The defense strategy employed by OpenAI rests on the concept of “fair use.” They that training a model is a major act. They claim that the machine reads the text to learn the patterns of language. It does not read to consume the information. This argument collapses when the model outputs the text verbatim. A machine that memorizes a paywalled article and serves it to a user is not transforming anything. It is redistributing stolen property. The courts have historically protected the “hot news” doctrine. This legal principle prevents competitors from free-riding on the time-sensitive reporting of others. OpenAI built a business model that is the free rider. It capitalized on the investigative budgets of the New York Times and the Chicago Tribune. It sold the resulting intelligence for twenty dollars a month.
The impact on the news industry is quantifiable. The New York Times reported a direct correlation between the rise of LLMs and the decline in referral traffic. Users no longer needed to click through to the source. The chatbot provided the answer. The chatbot provided the analysis. The chatbot frequently provided the article itself. This severed the link between the reader and the publisher. It destroyed the advertising funnel. It devalued the subscription proposition. The unauthorized ingestion of news archives was not just a copyright violation. It was a market intervention. It transferred the value of the fourth estate to the balance sheet of a Silicon Valley startup. The journalism was expensive to produce. The theft was free.
The 'Transformative' Myth: Debunking the Core of AI Copyright Defense
The ‘Learning’ Fallacy: Anthropomorphism as Legal Shield
OpenAI’s primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a book in a library. In this narrative, the AI “learns” concepts, facts, and styles just as a human would, and therefore its ingestion of copyrighted material constitutes “fair use.” This anthropomorphic framing is a calculated distraction. A human student does not ingest 500, 000 books in a week, compress them into a probabilistic matrix, and then sell access to a service that can reproduce the specific expression of those books on command. The “learning” argument attempts to mask the mechanical reality: LLMs are industrial- copying engines that rely on the unauthorized reproduction of expression to function.
The technical reality of an LLM contradicts the “major” defense. When OpenAI trains a model, it does not extract abstract “ideas” (which are not copyrightable). It tokenizes and stores the statistical relationships between specific words and phrases found in the training data. As demonstrated in the New York Times litigation, the model retains enough fidelity to the original text to reproduce it verbatim when prompted. This is not “learning”; it is compression and retrieval. The US Copyright Office, in its 2025 report on AI training, explicitly rejected the “inherently major” argument, noting that when a model is trained to produce content that appeals to the same audience as the original work, the use is “at best, modestly major.”
The Warhol Effect: A Supreme Court Reality Check
For years, Silicon Valley relied on a broad interpretation of “major use,” assuming that any technological processing of data qualified for fair use protection. That assumption collapsed with the Supreme Court’s 2023 ruling in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith. The Court held that even if a new work adds new expression or meaning, it is not “fair use” if it shares the same commercial purpose as the original and competes in the same market. This ruling struck at the heart of the AI defense strategy.
OpenAI that its models serve a different purpose than the books and articles they ingest, that they are “tools for generating new content” rather than “archives.” Yet, for non-fiction and news, this distinction evaporates. The purpose of a news article is to inform the reader. The purpose of a ChatGPT summary of that article is also to inform the reader. If a user asks ChatGPT, “What are the key findings of the latest NYT investigation?”, the AI provides the information without the user ever visiting the Times website. Under the Warhol standard, this is a clear market substitute. The AI is not transforming the purpose; it is hijacking the audience. The commercial nature of OpenAI’s subscription model further weakens its defense, as it directly monetizes the value created by the original journalists and authors.
The Google Books False Equivalence
OpenAI frequently cites Authors Guild v. Google (2015), the “Google Books” case, as its legal shield. In that case, the Second Circuit ruled that Google’s scanning of millions of books was fair use because it created a searchable index that displayed only “snippets” of text. The court found that this “snippet view” did not substitute for the original books; in fact, it likely drove sales by helping users discover them. OpenAI attempts to position ChatGPT as the spiritual successor to Google Books, the functional difference is clear.
Google Books acted as a pointer; ChatGPT acts as a replacement. A search engine sends traffic to the source; an LLM answers the query within its own interface, keeping the user inside the “walled garden.” When an LLM summarizes a non-fiction book chapter by chapter, it provides the “value” of the book, the knowledge, without the purchase. This “expressive substitution” destroys the incentive to buy the original work. In the Bartz v. Anthropic ruling (July 2025), the court recognized this distinction, finding that while training on legally purchased books might be fair use, the use of pirated datasets (like Books3) to create a competing product constituted infringement. The “Google Books” defense fails because LLMs do not drive discovery; they automate consumption.
Market Usurpation in the Information Economy
The “major” defense collapses completely when applied to the market for non-fiction and news archives. Unlike fiction, where the “experience” of reading the prose is the primary value, non-fiction is frequently consumed for its factual content. By extracting and synthesizing these facts, LLMs strip-mine the value of the work. A biography that took five years to research can be condensed into a five-minute read by an LLM. While facts themselves are not copyrightable, the selection and arrangement of those facts, the narrative structure, is protected. LLMs appropriate this structure to deliver a “detailed guide” that renders the original obsolete.
The economic damage is measurable. Licensing negotiations reveal the true value of this data. If training were truly “fair use,” OpenAI would have no legal reason to sign multi-million dollar deals with publishers like Axel Springer or the Associated Press. These agreements function as an admission that the data has value and that its unauthorized use carries legal risk. OpenAI pays the to avoid litigation while continuing to scrape the work of independent authors and smaller publishers who absence the resources to sue. This two-tiered method exposes the “major” argument for what it is: a legal bluff designed to delay regulation while the models achieve market dominance.
| Factor | Google Books (2015 Ruling) | Generative AI (Current Reality) |
|---|---|---|
| Purpose | Search index / Pointer to source | Content generation / Substitute for source |
| Amount Used | 100% scanned, snippets displayed | 100% scanned, variable output (summaries to verbatim) |
| Market Effect | No significant substitution; aids discovery | Direct substitution; reduces traffic/sales |
| Legal Status | Ruled Fair Use (major) | Contested (Warhol precedent narrows “major”) |
Shadow Libraries and Z-Library: The Illicit Origins of 'Clean' Datasets
The ‘Books2’ Enigma: A Black Box of Stolen Knowledge
OpenAI openly admits to training its models on a dataset it calls “Books1,” widely identified as Project Gutenberg, a repository of public domain literature. Yet, the company maintains a rigid silence regarding a second, much larger corpus simply labeled “Books2.” This dataset, comprising an estimated 294, 000 titles, represents the dark matter of the GPT training process. While OpenAI refuses to disclose its contents, forensic analysis and legal discovery have pierced the veil of secrecy. Evidence suggests “Books2” is not a licensed collection, a sanitized alias for the world’s most notorious shadow libraries: Library Genesis (LibGen) and Z-Library. The mathematical impossibility of “Books2” being a legitimate dataset betrays its origins. No commercial entity offers a digital license for 294, 000 diverse, high-quality copyrighted books in a single batch for AI training. The only repositories matching this specific volume and breadth are illicit. Shadow libraries operate as massive, decentralized archives that host millions of pirated epubs, pdfs, and academic papers. They exist outside the law, ignoring copyright notices and bypassing paywalls to provide free access to the sum of human knowledge. For a company seeking to ingest the world’s information, these sites offered an irresistible, cost-free resource.
From ‘LibGen1’ to ‘Books2’: The Internal Laundering
Legal filings from the Authors Guild and other plaintiffs have unearthed internal practices that contradict OpenAI’s public stance on safety and compliance. Court documents allege that OpenAI employees downloaded massive quantities of data from Library Genesis in 2018. These internal corpora were initially tagged with explicit descriptors: “LibGen1” and “LibGen2.” Before the release of GPT-3, these filenames were reportedly altered to the innocuous “Books1” and “Books2.” This renaming served a dual purpose: it obscured the illicit provenance of the data and presented a veneer of curation to the public and investors. The timeline of these acquisitions is damning. The FBI seized Z-Library’s primary domains in November 2022, charging its operators with criminal copyright infringement. By then, yet, the damage was irreversible. OpenAI had likely already scraped, processed, and the library’s contents into the neural weights of GPT-3 and GPT-4. The seizure removed the public interface of the library, the data survives, immortalized within the models. Every time ChatGPT summarizes a copyrighted non-fiction bestseller or explains a complex concept from a paywalled textbook, it draws upon this ghostly archive of stolen material.
The Non-Fiction Imperative: Why Fiction Wasn’t Enough
The theft of shadow libraries was not about volume; it was a strategic need for model performance. While public domain fiction provides syntax and narrative structure, it absence the dense, factual reasoning required for a “smart” AI. To build a model capable of passing the bar exam, diagnosing medical conditions, or writing code, OpenAI needed non-fiction. They needed textbooks, monographs, technical manuals, and academic journals. Z-Library and LibGen specialize in exactly this type of content. Unlike Project Gutenberg, which stops at 1929, shadow libraries house the modern scientific and intellectual output of the last century. They contain the standard university curriculum, books on quantum mechanics, econometrics, molecular biology, and computer science. By ingesting this specific corpus, OpenAI did not just teach its model how to speak; it taught the model how to think using the proprietary research and pedagogical structures of the world’s leading experts. The “reasoning” capabilities of GPT-4 are, in large part, a result of processing millions of unauthorized textbooks that define the logic of their respective fields.
Spoliation and the Deletion of Evidence
As scrutiny intensified in 2022, OpenAI took steps that plaintiffs describe as spoliation of evidence. The company reportedly deleted the original “Books1” and “Books2” datasets from its servers, claiming they were no longer in use. This deletion conveniently occurred just as class-action lawsuits began to materialize. By destroying the source files, OpenAI made it significantly harder for forensic data scientists to prove a direct one-to-one match between the training data and the copyrighted works. yet, the model itself remains a witness. Researchers have demonstrated that LLMs can memorize and regurgitate long passages of text from books present in shadow libraries absent from the public web. When prompted with specific, unique strings from a copyrighted book found on Z-Library, the model frequently completes the passage verbatim. This “eidetic memory” serves as a digital fingerprint, linking the clean output of the chatbot directly to the dirty data of the piracy hub.
The Industry Standard of Theft
OpenAI does not stand alone in this practice, though it remains the primary target of litigation. The use of shadow libraries appears to be an open secret within the AI development sector. Internal communications from Meta, revealed in separate litigation, show executives discussing the use of LibGen for training their LLaMA models. In one exchange, a Meta engineer noted that using such data was legally risky necessary to compete, with Mark Zuckerberg reportedly giving his approval. This industry-wide reliance on piracy suggests that the ” ” performance of modern AI is fundamentally tethered to the existence of illegal archives. The utilization of Z-Library and LibGen represents a massive transfer of value. The academic publishing industry and the trade book market operate on a model of scarcity and paid access. Shadow libraries subvert this by demonetizing the content. OpenAI then recapitalizes this demonetized content, selling access to the intelligence derived from it via monthly subscriptions. The authors and publishers, whose works provided the cognitive architecture for the model, receive nothing. The “clean” interface of ChatGPT launders the reputation of the data, presenting the output of a pirate library as the product of advanced engineering.
| Feature | Project Gutenberg (Books1) | Shadow Libraries (Books2/LibGen) |
|---|---|---|
| Copyright Status | Public Domain (Pre-1929) | Full Copyright (Modern & Contemporary) |
| Content Type | Classic Literature, Fiction | Textbooks, Academic Papers, Non-Fiction, Bestsellers |
| Value to AI | Language Syntax, Narrative Style | Complex Reasoning, Factual Knowledge, Technical Logic |
| Acquisition Cost | Free (Legal) | Free (Illegal) |
| Transparency | Disclosed by OpenAI | Concealed, Renamed, Deleted |
This reliance on shadow libraries exposes the fragility of the ethical claims made by AI companies. The argument that these models are “learning like a human” collapses when the method of learning involves the automated ingestion of millions of stolen files—a feat no human could perform. The “Books2” dataset remains the smoking gun of the AI copyright wars, a testament to the fact that the smartest machines on earth were educated in a library built on theft.
Market Usurpation: How LLMs Threaten the Livelihood of Non-Fiction Authors
Market Usurpation: How LLMs Threaten the Livelihood of Non-Fiction Authors
The economic threat posed by Large Language Models (LLMs) to non-fiction authors and journalists is not theoretical; it is a documented, quantifiable displacement of human labor. By 2026, the “substitution effect”—a legal concept central to copyright infringement cases—has transitioned from an abstract fear to a market reality. OpenAI’s models do not “learn” from non-fiction texts; they actively compete with them, offering users free, synthesized derivatives that bypass the need to purchase books or visit news websites. #### The Displacement of Freelance Labor The initial tremors of this market usurpation were felt in the freelance sector. Data from online labor markets between 2023 and 2025 reveals a sharp contraction in demand for human writers. A study analyzing transaction data from platforms like Upwork showed a 30% to 33% decline in the number of writing jobs and a 5. 2% drop in monthly earnings for freelancers immediately following the release of ChatGPT. This was not a temporary fluctuation a structural shift. Surveys conducted by the Society of Authors (SoA) and the Authors Guild paint a grim picture of the. By 2025, 86% of surveyed authors reported reduced earnings attributable to generative AI. The impact was particularly severe for translators and illustrators, with 36% of translators and 26% of illustrators losing work directly to AI automation. The “stabilization” of AI adoption in corporate sectors—where LLMs have permanently replaced human workers for drafting, summarization, and basic reporting—has erased the entry-level tier of the writing profession. #### The “Zero-Click” News emergency For the journalism industry, OpenAI’s integration into search and information retrieval systems has accelerated the “zero-click” phenomenon. By May 2025, nearly 69% of Google searches ended without a click to a publisher’s website, a trend exacerbated by AI-generated “overviews” and chatbots that scrape reporting to provide self-contained answers. The *New York Times v. OpenAI* lawsuit crystallized this existential threat. The *Times* argued that ChatGPT serves not as a research tool as a direct market substitute. When a user prompts an LLM for a summary of a paywalled investigation, the model delivers the core findings, frequently verbatim, without generating a visit to the original source. This bypass method deprives publishers of subscription revenue, advertising impressions, and licensing fees. Industry projections from 2025 indicate that news publishers face a 43% decline in search referral traffic by 2029, a loss that the meager 0. 13% to 1% referral rate from AI chatbots fails to offset. #### The “Sham” Book Industry In the book market, non-fiction authors face a deluge of parasitic AI-generated content. Amazon’s Kindle store has been flooded with “summary” books, “workbooks,” and unauthorized biographies that piggyback on the release of major non-fiction titles. These “sham” books, generated in minutes by LLMs, appear alongside legitimate works, confusing consumers and siphoning sales. Notable instances include the proliferation of AI-generated biographies for public figures like Rory Cellan-Jones and Kara Swisher. These unauthorized texts, frequently with hallucinations and factual errors, are published under the names of non-existent authors. In one egregious pattern, “summary bots” automatically generate condensed versions of bestselling non-fiction books within days of their release, selling them for a fraction of the price. This practice creates a “market dilution” effect, where the value of the original research is eroded by cheap, machine-generated derivatives that pay no royalties to the primary creator. #### The “Opt-Out” Mirage and the Media Manager Failure OpenAI’s defense has frequently relied on the existence of “opt-out” method, such as the `GPTBot` web crawler which publishers can block via `robots. txt`. yet, critics and legal experts this places an unfair load on creators to police a trillion-dollar company’s infrastructure. also, blocking a crawler does nothing to remove content that has already been ingested into existing models like GPT-4. To quell this criticism, OpenAI announced the development of a “Media Manager” tool in May 2024, promising creators granular control over how their works were used. By early 2025, this tool remained “missing in action.” Reports from internal sources suggested the project was never a priority, described by former employees as a “public relations strategy” rather than a substantive technical solution. The failure to deliver this promised safeguard left authors with no means to protect their intellectual property, reinforcing the Authors Guild’s characterization of the situation as “identity theft on a grand.” The cumulative effect of these practices is a transfer of wealth from the creators of knowledge to the owners of the models that exploit it. By treating non-fiction literature and journalism as raw “training data” rather than licensed intellectual property, OpenAI has engineered a system where the machine does not just read the book—it sells it.
The Opt-Out Illusion: Criticisms of OpenAI's Retroactive Data Policies
Internal Comms and Deleted Datasets: Evidence of Willful Infringement
The Pivot to Secrecy: Concealing the Source
The trajectory of OpenAI’s transparency offers a timeline of incriminating silence. In its early years, the organization operated with a mandate of openness, publishing detailed datasheets for models like GPT-1 and GPT-2. These documents listed training sources with academic precision. Yet, as the models grew in power and commercial chance, this transparency. The release of GPT-3 marked a definitive shift. OpenAI stopped disclosing the specific contents of its training data, offering only vague descriptors like “internet-based books corpora.” This sudden opacity was not a competitive strategy; it was a legal need. By 2020, the of data required to improve model performance had outstripped the available public domain. To continue scaling, OpenAI had to ingest copyrighted material. Admitting this publicly would have invited immediate litigation. The shift to “closed” source was less about protecting trade secrets and more about obscuring the provenance of stolen intellectual property.
The ‘Books2’ Smoking Gun
The most damning evidence of unauthorized use lies in the statistical anomalies of the dataset known as “Books2.” In the few disclosures OpenAI made regarding GPT-3, they listed “Books1” (12 billion tokens) and “Books2” (55 billion tokens) as primary sources. “Books1” aligns in size with Project Gutenberg, a legal repository of public domain works. “Books2,” yet, presents a mathematical impossibility for a legal dataset. There is no commercially available, licensed corpus of high-quality fiction and non-fiction that matches this size. The only collections of text that fit these parameters are “shadow libraries”, illicit repositories like Library Genesis (LibGen), Z-Library, and Bibliotik, which host millions of pirated e-books.
Independent researchers and plaintiffs in the Tremblay v. OpenAI class action have corroborated this suspicion. The token count of Books2 mirrors the size of the Bibliotik collection almost exactly. By ingesting this data, OpenAI did not “scrape the web”; they likely downloaded a curated archive of stolen property. This distinction is serious. Scraping the open web allows for a plausible deniability defense regarding “fair use.” Downloading a torrent of pirated books from a shadow library demonstrates active, willful infringement. It suggests that OpenAI engineers sought out specific, high-quality copyrighted literature because the “clean” internet did not provide enough depth for their models to master complex reasoning.
The ‘Accidental’ Deletion of Evidence
Allegations of willful misconduct intensified during the discovery phase of The New York Times v. OpenAI. In late 2024, a serious dispute emerged when OpenAI engineers “accidentally” erased data from a virtual machine provided to the Times’ legal team. This machine contained the results of weeks of forensic analysis, where experts had traced specific copyrighted articles from the Times into OpenAI’s training set. OpenAI attributed the loss to a “system misconfiguration,” the timing raised immediate suspicions. The deletion forced the plaintiffs to restart their investigation, delaying the legal process and increasing costs.
This was not an incident of data destruction. Court filings revealed that OpenAI had previously deleted the original “Books1” and “Books2” datasets from their internal servers in 2022. The company claimed this was due to “non-use,” a justification that Judge Ona Wang of the Southern District of New York found unconvincing. In a November 2025 ruling, Judge Wang ordered OpenAI to produce internal communications related to this deletion. The court recognized that destroying the only direct evidence of what the models were trained on, right before a wave of copyright lawsuits, could be interpreted as spoliation of evidence. If OpenAI believed their use of these books was legal, retaining the datasets would have been their best defense. Deleting them suggests a consciousness of guilt.
Internal Slack Messages: The Liability Discussion
Discovery proceedings have begun to unearth internal communications that contradict OpenAI’s public stance. Plaintiffs in the Authors Guild case have gained access to internal Slack messages where employees discussed the “Books1” and “Books2” datasets. These messages reportedly show engineers referring to the erasure of these datasets with an awareness of their problematic nature. The plaintiffs that these communications prove OpenAI executives knew the data was tainted. Instead of licensing the material or removing it, they chose to purge the source files while retaining the trained model weights, laundering the stolen data into a commercial product.
The existence of these messages the “innocent infringer” defense. Copyright law distinguishes between accidental infringement and willful infringement, with the latter carrying significantly higher statutory damages. If internal emails or chats confirm that employees identified the datasets as “pirated” or “illegal” and proceeded to use them anyway, OpenAI faces a liability catastrophe. The decision to hide the “Books3” dataset (a known component of the open-source “The Pile” dataset, which OpenAI likely used or replicated) further implicates them. While open-source competitors like EleutherAI were transparent about using Bibliotik to build “Books3,” OpenAI kept their equivalent source hidden, likely to avoid the very lawsuits they face.
The ‘Fair Use’ Charade
OpenAI’s legal defense relies entirely on the doctrine of “fair use,” arguing that their models transform the original works into something new. Yet, their internal actions tell a different story. A company confident in its fair use defense does not stop publishing datasheets. It does not delete the raw training data when litigation looms. It does not obfuscate the sources of its most valuable assets. The pattern of secrecy, deletion, and obfuscation suggests that OpenAI’s leadership understood that their “fair use” argument was a legal gamble, not a settled fact. They built their empire on a foundation of mass copyright infringement, betting that they could become too big to fail before the legal system caught up. The internal communications and deleted datasets are not just procedural footnotes; they are the smoking gun of a calculated industrial theft.
Licensing as Admission: Why Deals with Axel Springer Undermine Fair Use
Licensing as Admission: Why Deals with Axel Springer Undermine Fair Use
The legal defense of “fair use” relies on a delicate balance, one that OpenAI shattered the moment it began writing checks. For years, the company maintained that ingesting copyrighted works to train Large Language Models (LLMs) was a major act requiring no permission and no payment. This argument crumbled on December 13, 2023, when OpenAI announced a “global partnership” with German publishing giant Axel Springer. By agreeing to pay tens of millions of euros for access to content from *Politico*, *Business Insider*, *Bild*, and *Die Welt*, OpenAI did more than secure a data pipeline; it created a market. In the eyes of intellectual property law, the existence of a market for licensing training data is a fatal blow to the claim that no such market exists to be harmed.
The “Market Harm” Trap
Under United States copyright law, the fourth factor of the fair use test examines “the effect of the use upon the chance market for or value of the copyrighted work.” Courts have historically ruled that if a use usurps a market that the copyright holder could reasonably exploit, it is not fair use. OpenAI’s strategy of selective payment hands plaintiffs the evidence they need to prove this usurpation. When OpenAI pays Axel Springer or the Associated Press (AP) for the right to train on their archives, they validate the premise that news archives are a tradable asset in the AI economy. The *New York Times* capitalized on this exact contradiction in its December 27, 2023, lawsuit filed just two weeks after the Axel Springer announcement. The complaint explicitly cites these deals as proof that a viable licensing market exists, arguing that OpenAI’s unauthorized use of *Times* content deprives the publisher of licensing revenue that other tech companies are clear to pay.
| Publisher / Entity | Date Announced | Estimated Value / Terms | Scope of Access |
|---|---|---|---|
| Associated Press (AP) | July 13, 2023 | Undisclosed (IP exchange) | Text archive licensing for training; access to OpenAI tech. |
| Axel Springer | Dec 13, 2023 | Tens of millions (EUR) | Training data + RAG (summaries) for Politico, Business Insider, etc. |
| Financial Times | Apr 29, 2024 | Undisclosed (Multi-million) | Training on archived content; attribution in ChatGPT. |
| Dotdash Meredith | May 7, 2024 | >$16 Million | Content from People, Better Homes & Gardens for training/ad targeting. |
| News Corp | May 22, 2024 | >$250 Million (5 Years) | Access to WSJ, The Times, New York Post, MarketWatch. |
| Time Magazine | June 27, 2024 | Multi-year deal | Access to 101-year editorial archive for training. |
Buying Silence, Fragmenting Opposition
The timing of these agreements suggests a strategy of “divide and conquer” rather than a genuine respect for intellectual property. By cutting lucrative deals with the largest conglomerates, News Corp’s deal is valued at over $250 million over five years, OpenAI splits the publishing industry into two camps: the paid “partners” and the unpaid “litigants.” This fragmentation serves a dual purpose., it secures a steady stream of high-quality, real-time journalism to ground ChatGPT’s increasingly erratic outputs. Second, it isolates holdouts like the *New York Times* and the Authors Guild, painting them as Luddites standing in the way of progress, rather than victims of theft. Yet, this method backfires legally. Every dollar paid to Rupert Murdoch’s News Corp is a dollar that establishes the “going rate” for the theft of similar content from independent authors and smaller outlets who received nothing. Legal scholars note that “major use”, the defense that the AI creates something fundamentally new, is harder to sustain when the raw material is being bought and sold for that specific purpose. If the use were truly fair, OpenAI would not need to pay anyone. The checkbook reveals the truth: the data is not just raw material for a major machine; it is a product being consumed.
The “Clean” vs. “Dirty” Data Paradox
A serious contradiction sits at the heart of OpenAI’s operations. The company pays News Corp for “clean” access to the *Wall Street Journal*, yet its models remain trained on the “dirty” data of the Books3 dataset, Common Crawl, and the pirated libraries discussed in previous sections. The licensing deals are prospective; they do not scrub the illicitly obtained data already baked into GPT-4’s neural weights. This creates a liability paradox. OpenAI admits that high-quality non-fiction requires compensation * *, refuses to apply that logic retroactively to the millions of books and articles it ingested to build its empire. The “opt-out” method offered to authors are similarly hollow, as they only prevent *future* scraping, leaving the existing infringing models untouched. The Axel Springer and News Corp deals are not partnerships; they are admissions of guilt priced into the cost of doing business. They demonstrate that the “fair use” defense is a temporary shield, discarded the moment a copyright holder is large enough to pose a serious threat. For the non-fiction author whose work was stolen to build the model that these deals, the message is clear: fair use applies only to those who cannot afford to fight back.
Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks
SECTION 13 of 14: Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks
The intellectual foundation of the post-2020 artificial intelligence boom rests not on code, on a vast, unauthorized appropriation of human knowledge. While news outlets focus on the plagiarism of fiction, a far more systematic extraction has targeted the academic sector. OpenAI’s Large Language Models (LLMs) have ingested millions of copyrighted textbooks, peer-reviewed journal articles, and monographs. This process, frequently described by critics as data laundering, converts the proprietary output of the global scientific community into a commercial product, frequently without a single cent reaching the researchers or educators who created it. For years, the specific composition of OpenAI’s training datasets, known unclear as “Books1” and “Books2,” remained a closely guarded corporate secret. Yet, forensic analysis and class-action lawsuits have pierced this veil. The sheer volume of data required to train GPT-3 and GPT-4—hundreds of billions of tokens—mathematically the inclusion of “shadow libraries.” These illicit repositories, such as Library Genesis (LibGen), Z-Library, and Sci-Hub, host pirated copies of nearly every academic text and journal article in existence. In *Authors Guild v. OpenAI*, plaintiffs allege that the company’s models correlate so strongly with the contents of these shadow libraries that unauthorized ingestion is the only logical explanation. The “Books3” dataset, a component of the open-source “Pile” dataset created by EleutherAI to replicate OpenAI’s methods, provides a grim proxy for what lies inside GPT-4. Books3 contains nearly 200, 000 books derived from a torrent of the Bibliotik tracker, a notorious piracy hub. This dataset includes standard university textbooks on subjects ranging from quantum mechanics to macroeconomics. When a student prompts ChatGPT to “explain the concept of elasticity as defined in Mankiw’s *Principles of Economics*,” and the model returns a breakdown mirroring the text’s unique structure and examples, it functions not as a search engine, as a replacement for the textbook itself. This unauthorized ingestion creates a direct market substitute. Academic publishers and authors rely on the sale of textbooks and access to journals for revenue. An LLM that can synthesize the specific pedagogical methods of a copyrighted textbook removes the incentive for a student to purchase the original work. The model does not “read” the book; it metabolizes the author’s labor—the structuring of complex ideas, the creation of problem sets, the curation of case studies—and regurgitates it as a service. In 2025, the “AI Disclosures Project” released a study providing empirical evidence of this theft. Researchers tested GPT-4o against a dataset of paywalled technical books from O’Reilly Media. The model demonstrated “strong recognition” of the non-public content, completing passages and solving problems that exist only within those paid resources. This capability confirms that the model was trained on data that could not have been accessed legally through public web scraping. The inclusion of such material suggests a deliberate strategy to bypass copyright controls to acquire high-value technical knowledge. While OpenAI fights legal battles over shadow libraries, a secondary form of exploitation has emerged from within the publishing industry itself. In 2024, academic conglomerate Taylor & Francis sparked outrage when it signed a deal worth over $10 million to license its authors’ work to Microsoft, OpenAI’s primary backer. This agreement allowed the tech giant to train its AI models on thousands of journals and textbooks. Crucially, the authors of these works were neither consulted nor offered an opt-out method. This deal, and similar arrangements by Wiley (which projected $44 million in AI licensing revenue), represents a betrayal of academic trust. Scholars publish to advance human knowledge, signing over copyright to publishers under the assumption that the publisher protect the work and manage its distribution. Instead, these corporations have sold the cumulative life’s work of their authors to train systems that may eventually render those same authors obsolete. The Society of Authors and various academic unions have condemned these deals, noting that while publishers reap millions, the researchers who performed the actual labor receive nothing. The economic for the academic ecosystem are severe. If an AI can generate a literature review, summarize the latest findings in oncology, or solve complex engineering problems by accessing a training set of pirated journals, the value of the original publication collapses. The “Fair Use” defense employed by OpenAI that their use is major. Yet, in the context of textbooks and reference materials, the use is frequently derivative and competitive. A chemistry student using ChatGPT to solve reaction method from a specific textbook is not using the tool for a “major” purpose; they are using it to avoid reading the book. also, the integrity of the scientific record faces a serious threat. LLMs are prone to hallucination, fabricating citations and misinterpreting data. When a model trained on a mix of high-quality journals and unverified internet detritus answers a query, it flattens the hierarchy of credibility. A peer-reviewed study from *Nature* becomes statistically equivalent to a blog post, both reduced to mere tokens in a probability distribution. This degradation of sourcing undermines the very purpose of academic rigor. The exploitation extends to the “publish or perish” pattern. Professors and researchers spend decades building a body of work. OpenAI has enclosed this intellectual commons, privatizing the output of public and university-funded research. The company charges users $20 a month to access a model built on the unpaid labor of the global academic community. It is a wealth transfer of proportions, moving value from the public education and research sectors directly into the coffers of a private Silicon Valley entity. As of 2026, the legal remains in flux, the ethical verdict is clear. The unauthorized use of scholarly literature constitutes a massive, uncompensated extraction of value. Whether through the illicit scraping of shadow libraries or the unclear, backdoor deals struck by publishers, the academic community has been strip-mined to fuel the AI revolution. The “intelligence” of these models is not an emergent property of code; it is the stolen wisdom of human scholars, repackaged and sold back to them.
The Substitution Effect: AI-Generated Summaries as Direct Market Competitors
The Mechanics of Market Usurpation
The core of the conflict lies in the architectural difference between a search engine and an “answer engine.” A traditional search engine acts as a signpost; an LLM acts as a librarian who reads the book for you and recites the relevant passages. When a user prompts ChatGPT for a summary of a non-fiction bestseller or a breakdown of a complex news event, the model generates a detailed response that satisfies the user’s curiosity instantly. The user has no reason to click through to the publisher’s site. This phenomenon is what Gartner analysts described as the rise of “substitute answer engines,” predicting a 25% drop in traditional search volume by 2026, a forecast that has proven devastatingly accurate. Data from Chartbeat and the Reuters Institute confirms the of this displacement. Between November 2024 and November 2025, organic search traffic to news sites plummeted by 33%. This is not a fluctuation; it is a structural collapse. The “zero-click” future, once a fear, is the baseline. For publishers, this means the content they spend millions to produce is being ingested, processed, and served to users by a third party that captures 100% of the engagement while remitting zero percent of the revenue.
The Wirecutter Evidence: A Smoking Gun
The *New York Times v. OpenAI* complaint provided the most tangible evidence of this parasitic. The *Times* highlighted its product review site, Wirecutter, which generates revenue through affiliate links. When a user reads a review on Wirecutter and clicks a link to buy a blender or a set of headphones, the *Times* earns a commission. The legal filing demonstrated that ChatGPT could reproduce Wirecutter’s recommendations verbatim. Yet, in doing so, the AI stripped the affiliate links. The user received the value of the *Times’* rigorous testing and editorial judgment the *Times* received nothing. In instances, the model even “hallucinated” recommendations, attributing endorsements to Wirecutter for products the editorial team had explicitly rejected, damaging the brand’s reputation while simultaneously starving it of revenue. This is not “fair use” transformation; it is direct market competition using the victim’s own inventory.
Factor Four and the Death of Fair Use
OpenAI’s legal defense rests heavily on the doctrine of Fair Use, specifically the claim that their use of copyrighted data is “major.” Yet, the Copyright Act of 1976 mandates a four-factor test to determine fair use. The fourth factor is widely considered the most significant: “the effect of the use upon the chance market for or value of the copyrighted work.” If a secondary work serves as a market substitute for the original, fair use fails. The Authors Guild and non-fiction writers that LLMs are the market substitute. Why purchase a business strategy book when an LLM can synthesize its core arguments, chapter by chapter, into a five-minute read? The “Books3” dataset and other shadow libraries allowed models to ingest tens of thousands of non-fiction titles. Users can treat ChatGPT as an on-demand summarization service, bypassing the bookstore. The AI does not critique the book; it metabolizes it.
The “Browse” Loophole and Paywall Evasion
The integration of real-time browsing capabilities (RAG, Retrieval-Augmented Generation) further exacerbates the substitution problem. Early iterations of “Browse with Bing” allowed users to bypass paywalls simply by asking the AI to print the text of a locked article. While OpenAI patched the most egregious exploits, the fundamental function remains: the AI visits the page, reads the content, and synthesizes the information. For a subscription-based outlet, this is fatal. If a subscriber can cancel their $20/month newspaper subscription because their $20/month AI assistant can brief them on the morning’s top stories using data scraped from that very newspaper, the market for the original work evaporates. The *Times* complaint specific examples where the model reproduced significant portions of Pulitzer Prize-winning articles, such as “Snow Fall,” rendering the original publication redundant.
The Vampire Economy
The “Custom GPT” store launched by OpenAI introduced another of unauthorized substitution. Users began creating specialized bots trained on specific libraries of copyrighted texts, “The Harry Potter Bot,” “The Warren Buffett Investment Bot,” or “The Python Coding Interview Bot.” These user-generated agents were frequently fed pirated EPUBs or PDFs. OpenAI provided the infrastructure for this mass infringement, profiting from the subscription fees while the authors of the underlying material watched their royalty checks dwindle. This creates a “vampire economy.” The AI company sucks the lifeblood (data) from the creative industries to fuel its own growth, leaving the host bodies (publishers and authors) anemic. Unlike the search era, where Google needed a healthy web to index, LLMs theoretically benefit if the open web dies, provided they have already archived its contents. They are not symbiotic; they are extractive.
The 2026 Reality
As of February 2026, the consequences are measurable. Niche publishers, particularly in the “how-to” and informational sectors, have seen traffic from search evaporate. The “10 Best” lists, the tutorial sites, and the explainer journalism sector are being decimated by AI summaries. The user intent, “I need to know how to fix my sink”, is satisfied by the chat interface. The website that actually hired the plumber to write the guide receives no visit, no ad impression, and no affiliate sale. OpenAI’s strategy relies on the assumption that they can outrun the legal consequences until they become too big to fail. They are betting that the “major” argument hold, or that they can settle individual lawsuits with licensing deals that amount to hush money, pennies on the dollar compared to the value of the content they have appropriated. the substitution effect proves that this is not a victimless technological evolution. It is a wealth transfer from the creators of knowledge to the owners of the machines that process it.
Conclusion of the Review
This investigation has examined the systematic, unauthorized utilization of copyrighted works by OpenAI. From the ingestion of the “Books3” shadow library to the scraping of premium news archives, the evidence points to a deliberate strategy of “ask forgiveness, not permission.” The company built a trillion-dollar valuation on the backs of authors, journalists, and academics who never consented to their work being used to train their replacements. The legal battles currently moving through the courts—*The New York Times v. OpenAI*, *The Authors Guild v. OpenAI*— define the future of human intellectual property. If the courts rule that training an AI to replace a writer is “fair use,” the economic foundation of the creative class collapse. If they rule against OpenAI, the AI industry faces a reckoning that could its current business model. Until then, the theft continues, one token at a time.
The 'Books3' Revelation: Uncovering the Shadow Library in Training Data — The transition of OpenAI from a non-profit research laboratory to a closed-source commercial entity is best illustrated by the sudden obfuscation of its training data. In.
The New York Times v. OpenAI: Allegations of Mass Copyright Infringement — The legal war between The New York Times and OpenAI began on December 27, 2023. This filing in the U. S. District Court for the Southern.
The 'Fair Use': Redefining Copyright for the Age of Algorithms — OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the "fair use" doctrine. The company asserts that training a.
The 'Impossible' Admission: A confession to the House of Lords — The confidence of this fair use defense was severely tested by OpenAI's own admissions to the UK Parliament. In a written submission to the House of.
The Piracy Pivot: Distinguishing 'Legal' Access from 'Shadow' Libraries — The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the.
The Authors Guild Class Action: Fiction and Non-Fiction Writers Unite — The Authors Guild class action represents the most significant organized legal challenge to generative AI in history. Filed in the Southern District of New York on.
SECTION 5 of 14: Digital 'Regurgitation': Evidence of Verbatim Text Reproduction in ChatGPT — The sanitized term is "memorization." In the sterile corridors of machine learning research, it refers to a model's tendency to encode specific training data so perfectly.
SECTION 6 of 14: Bypassing Paywalls: The Unauthorized Ingestion of Premium News Archives — The economic survival of modern journalism relies on a simple contract. Readers pay for access to high-quality reporting and publishers use those funds to sustain newsrooms.
The 'Learning' Fallacy: Anthropomorphism as Legal Shield — OpenAI's primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a.
The Warhol Effect: A Supreme Court Reality Check — For years, Silicon Valley relied on a broad interpretation of "major use," assuming that any technological processing of data qualified for fair use protection. That assumption.
The Google Books False Equivalence — OpenAI frequently cites Authors Guild v. Google (2015), the "Google Books" case, as its legal shield. In that case, the Second Circuit ruled that Google's scanning.
Market Usurpation in the Information Economy — The "major" defense collapses completely when applied to the market for non-fiction and news archives. Unlike fiction, where the "experience" of reading the prose is the.
From 'LibGen1' to 'Books2': The Internal Laundering — Legal filings from the Authors Guild and other plaintiffs have unearthed internal practices that contradict OpenAI's public stance on safety and compliance. Court documents allege that.
The Non-Fiction Imperative: Why Fiction Wasn't Enough — The theft of shadow libraries was not about volume; it was a strategic need for model performance. While public domain fiction provides syntax and narrative structure.
Spoliation and the Deletion of Evidence — As scrutiny intensified in 2022, OpenAI took steps that plaintiffs describe as spoliation of evidence. The company reportedly deleted the original "Books1" and "Books2" datasets from.
The Industry Standard of Theft — OpenAI does not stand alone in this practice, though it remains the primary target of litigation. The use of shadow libraries appears to be an open.
Market Usurpation: How LLMs Threaten the Livelihood of Non-Fiction Authors — The economic threat posed by Large Language Models (LLMs) to non-fiction authors and journalists is not theoretical; it is a documented, quantifiable displacement of human labor.
The Opt-Out Illusion: Criticisms of OpenAI's Retroactive Data Policies — The introduction of "opt-out" method by OpenAI in late 2023 marked a strategic pivot in the company's handling of copyright disputes. Faced with mounting lawsuits and.
The Pivot to Secrecy: Concealing the Source — The trajectory of OpenAI's transparency offers a timeline of incriminating silence. In its early years, the organization operated with a mandate of openness, publishing detailed datasheets.
The 'Accidental' Deletion of Evidence — Allegations of willful misconduct intensified during the discovery phase of The New York Times v. OpenAI. In late 2024, a serious dispute emerged when OpenAI engineers.
Licensing as Admission: Why Deals with Axel Springer Undermine Fair Use — The legal defense of "fair use" relies on a delicate balance, one that OpenAI shattered the moment it began writing checks. For years, the company maintained.
The "Market Harm" Trap — Under United States copyright law, the fourth factor of the fair use test examines "the effect of the use upon the chance market for or value.
SECTION 13 of 14: Academic Exploitation: The Uncompensated Use of Scholarly Journals and Textbooks — The intellectual foundation of the post-2020 artificial intelligence boom rests not on code, on a vast, unauthorized appropriation of human knowledge. While news outlets focus on.
The Substitution Effect: AI-Generated Summaries as Direct Market Competitors — The economic engine of the publishing industry relies on a simple, established contract: content creators provide information, and platforms provide traffic. For two decades, search engines.
The Mechanics of Market Usurpation — The core of the conflict lies in the architectural difference between a search engine and an "answer engine." A traditional search engine acts as a signpost.
Factor Four and the Death of Fair Use — OpenAI's legal defense rests heavily on the doctrine of Fair Use, specifically the claim that their use of copyrighted data is "major." Yet, the Copyright Act.
The 2026 Reality — As of February 2026, the consequences are measurable. Niche publishers, particularly in the "how-to" and informational sectors, have seen traffic from search evaporate. The "10 Best".
Questions And Answers
Tell me about the the 'books3' revelation: uncovering the shadow library in training data of OpenAI.
The transition of OpenAI from a non-profit research laboratory to a closed-source commercial entity is best illustrated by the sudden obfuscation of its training data. In the early years, the organization published detailed accounts of its inputs. By the release of GPT-3 in 2020, this transparency had. The technical paper for GPT-3 contained a single, innocuous-looking table listing five datasets. Among them was a dataset labeled simply "Books2." This dataset.
Tell me about the the new york times v. openai: allegations of mass copyright infringement of OpenAI.
The legal war between The New York Times and OpenAI began on December 27, 2023. This filing in the U. S. District Court for the Southern District of New York marked the moment the legacy press stopped negotiating and started shooting. The Times became the major American media organization to sue the makers of ChatGPT over copyright problem. They alleged that OpenAI had built a valuation exceeding $80 billion by.
Tell me about the the 'fair use': redefining copyright for the age of algorithms of OpenAI.
OpenAI has constructed a legal defense strategy that relies almost entirely on a radical expansion of the "fair use" doctrine. The company asserts that training a Large Language Model is functionally identical to a human student reading a textbook in a library. In their view, the model does not "copy" the expressive content of a or a news article. Instead, it analyzes statistical relationships between words to learn the underlying.
Tell me about the the 'impossible' admission: a confession to the house of lords of OpenAI.
The confidence of this fair use defense was severely tested by OpenAI's own admissions to the UK Parliament. In a written submission to the House of Lords Communications and Digital Committee in January 2024 the company stated explicitly that it would be "impossible" to train leading AI models without using copyrighted materials. This declaration stripped away any pretense that the company could rely solely on public domain works or licensed.
Tell me about the the piracy pivot: distinguishing 'legal' access from 'shadow' libraries of OpenAI.
The legal ground shifted significantly following judicial rulings in 2024 and 2025. Courts began to distinguish between the act of training and the source of the data. In the consolidated cases of Tremblay v. OpenAI and Silverman v. OpenAI the defense successfully argued that the mere existence of a model does not prove it is a "derivative work" of the books it read. Judge Araceli Martínez-Olguín dismissed several claims including.
Tell me about the the dmca technicality: evading the 'removal of rights' charge of OpenAI.
A serious component of the authors' lawsuits involved the Digital Millennium Copyright Act (DMCA). Plaintiffs alleged that OpenAI violated Section 1202 by removing "Copyright Management Information" (CMI) such as titles, author names, and ISBNs during the training process. The argument was that by stripping this identifying data OpenAI facilitated copyright infringement and concealed the origin of the text. This claim posed a serious threat because the DMCA allows for statutory.
Tell me about the the licensing paradox: paying for what you claim is free of OpenAI.
The most contradiction in OpenAI's defense strategy is its aggressive of licensing deals. While asserting in court that they have a fair use right to train on all publicly available text the company has simultaneously signed multi-million dollar agreements with entities like The Associated Press, Axel Springer, and News Corp. If training is fair use then these payments are unnecessary. OpenAI characterizes these deals as "partnerships" for real-time access and.
Tell me about the load shifting: the 'opt-out' defense of OpenAI.
OpenAI has further fortified its position by introducing "opt-out" method like the GPTBot user agent. They that because they offer a way for webmasters to block their crawler any site that does not block them has implicitly consented to be scraped. This argument attempts to shift the load of copyright enforcement from the user to the owner. It ignores the fact that the vast majority of the training data was.
Tell me about the the authors guild class action: fiction and non-fiction writers unite of OpenAI.
The Authors Guild class action represents the most significant organized legal challenge to generative AI in history. Filed in the Southern District of New York on September 19, 2023, the complaint brought together a coalition of literary giants who alleged their life's work had been ingested without consent to fuel a commercial product capable of replacing them. The lead plaintiffs included household names such as George R. R. Martin, John.
Tell me about the section 5 of 14: digital 'regurgitation': evidence of verbatim text reproduction in chatgpt of OpenAI.
The sanitized term is "memorization." In the sterile corridors of machine learning research, it refers to a model's tendency to encode specific training data so perfectly that it can be recalled sequence-for-sequence. In the courtroom, yet, this phenomenon is known by a far more damaging name: digital regurgitation. This is not a technical quirk; it is the smoking gun that strips away the veneer of "learning" to reveal what critics.
Tell me about the section 6 of 14: bypassing paywalls: the unauthorized ingestion of premium news archives of OpenAI.
The economic survival of modern journalism relies on a simple contract. Readers pay for access to high-quality reporting and publishers use those funds to sustain newsrooms. OpenAI shattered this model. The company did not scrape the open web. It systematically penetrated digital blocks designed to protect intellectual property. The training datasets for GPT-3 and GPT-4 contained millions of articles from subscription-based publications. These archives were ingested without permission. They were.
Tell me about the the 'learning' fallacy: anthropomorphism as legal shield of OpenAI.
OpenAI's primary defense against copyright liability rests on a seductive legally hollow metaphor: that a Large Language Model (LLM) is akin to a student reading a book in a library. In this narrative, the AI "learns" concepts, facts, and styles just as a human would, and therefore its ingestion of copyrighted material constitutes "fair use." This anthropomorphic framing is a calculated distraction. A human student does not ingest 500, 000.
