NYT v. OpenAI: The Times’s About-Face | Harvard Law Review (original) (raw)

The New York Times has sued OpenAI and Microsoft for the unpermitted use of Times articles to train GPT large language models. The case could have a significant impact on the relationship between generative AI and copyright law, particularly with respect to fair use, and could ultimately determine whether and how AI models are built.

But the complaint is noteworthy for another reason. The Times has been here before — not as plaintiff but as defendant. Less than three decades ago, in New York Times Co. v. Tasini, the publisher fought against a group of its own freelance authors. There, it enlisted a strategy contradictory to the one it deploys today. Indeed, comparing the Times’s complaint against OpenAI and Microsoft to the defense it mounted in Tasini reveals a stark about-face. In Tasini, the Times explicitly prioritized its own financial interests over those of its authors; now, the publisher is relying on a theory of romantic authorship — one that champions the “creative and deeply human” work of journalists — to justify its claims. The Times also argued — though the Tasini Court disagreed — that technological development would be irrevocably thwarted by a win for the freelancers. But today, the Times is entirely unsympathetic to the analogous threat its suit poses to AI companies. The publisher’s change of heart illustrates the persistent challenges of balancing technological development against private intellectual property interests, but it also suggests that, as a plaintiff, the Times ought to be more flexible in its demands for relief in this case.

The facts underlying the Times’s current complaint center around the use of copyrighted works in the development of generative AI tools, namely OpenAI’s ChatGPT and Microsoft’s Bing Chat (or “Copilot”), both of which are built on top of OpenAI’s GPT model. These tools are large language models (LLMs), which are built by “training” on massive corpora of text. The models incorporate information from these datasets and “learn” the patterns of words within a given context. Then, when queried, the LLM can predict the most likely combination of words, generating a natural-language response to the user’s prompt. The latest models of GPT are trained on trillions of words — a dataset so big that it would be “the equivalent of a Microsoft Word document that is over 3.7 billion pages long.” At the root of the Times’s complaint is that this dataset contains a “mass of Times copyrighted content.”

The Times’s core allegation is that OpenAI is infringing on copyrights through the unlicensed and unauthorized use and reproduction of Times works during the training of its models. But the problem is amplified in two ways. First, LLMs sometimes “memorize” parts of the works included in training data. When this happens, the models can occasionally generate near-verbatim reproductions of the works. Second, and relatedly, LLMs produce “synthetic” search results that, when prompted, can reproduce “significantly more expressive content from [an] original article than what would traditionally be displayed” by an online search, effectively allowing readers to circumvent the Times’s paywall.

These problems, according to the Times, present a significant threat to high-quality journalism. If readers can easily and at no cost generate summaries — or near-verbatim reproductions — of Times works using GPT models, it will “obviate the need” to purchase access through the Times itself, hamstringing the publisher’s ability to continue funding its journalism. If this happens, “the cost to society will be enormous.” In fact, the Times begins its complaint by emphasizing that “[i]ndependent journalism is vital to our democracy” and is “increasingly rare and valuable.” By contrast, OpenAI is described as nothing better than “a multi-billion-dollar for-profit business built in large part on the unlicensed exploitation of copyrighted works.”

In constructing its complaint around themes of the public good and in painting the dispute as a sort of battle between “good and evil,” the Times is drawing, in part, on a particular narrative of copyright that prioritizes “originality, creativity, and individuality” — what scholars refer to as “romantic authorship.” By highlighting the talent, expertise, and effort of its journalists throughout the complaint, the Times can be understood as positioning itself as a sort of gladiator for the romantic authors it employs.

The problem? The publisher hasn’t always displayed such veneration for romantic authorship. In 1997, the Times found itself on the opposite side of a copyright suit when a group of freelance writers sued the publisher for including their articles in new digital archives. Working with computer database companies — but without the permission of the freelancers — the Times uploaded all the articles published in its periodicals into three databases. At issue was whether or not these databases constituted a “revision” of the original collective works in which the freelancers’ articles had first been published. Ultimately, the Supreme Court found that, because the databases presented the articles individually rather than maintaining the collective works in their entirety, the databases could not constitute “revisions” of those collective works. The Times was therefore found to have infringed the freelancers’ copyrights.

But more so than the actual outcome in Tasini, it’s the Times’s approach to that case that makes its current reliance on romantic authorship seem hypocritical. In Tasini, the Times seemed not to value the “creative and deeply human” work of the authors that it relies on today as a justification for protecting its own copyrights. Instead, the publisher’s focus was on the issue of damages and the “irreparable harm” it could face from the lower court’s decision to enforce the freelancers’ copyrights.

In Tasini, the Times was adamant that the judgment would require the deletion of electronically stored articles “on a massive scale” and that the “[t]he impact on electronic archives as a useful tool for research would be inestimable.” Dissenting from the Court’s decision, Justice Stevens likewise underscored that forcing archives to “purge” content would undermine the benefits offered by these databases, namely efficiency, accuracy, and comprehensiveness. (The majority dismissed those worries, assuming the parties could reach a post-judgment agreement – which they eventually did, leading to the development of entire rights-and-clearance departments and ultimately undercutting authors’ bargaining positions).

But the Times no longer seems so concerned with preserving the usefulness of new technological tools. In fact, the Times — ironically — now seeks that same sort of total, destructive relief. In its complaint against OpenAI, the Times asks not only for monetary damages and a permanent injunction against further infringement but also for “destruction…of all GPT or other LLM models and training sets that incorporate Times Works.”

It will be interesting to see whether the Times is receptive to arguments about the consequences to OpenAI of such a drastic remedy. In Tasini, the Times was reticent to reach a compromise with the plaintiffs and rejected alternative compensation schemes that would have provided carve-outs for electronic use. Indeed, at oral arguments, Justice Ginsburg suggested that contracting separately for print and online publications might further a goal of the 1976 Copyright Act and “give . . . the author more muscle vis-a-vis the publisher.” But the Times insisted that such a regime would have a “very serious impact” on existing writings in electronic databases, which would need to be “defensive[ly] delet[ed].” When the freelancers argued that business could continue as-usual if publishers simply negotiated adaptive compensation schemes, the Times maintained that “[t]he retroactive nature of the problem, as well as the sheer logistical difficulties inherent in the process” made such a solution a “practical impossibility.” In other words, authors’ rights were to take second place to the “[e]ntire industries [that had] been built…on the expectation” that those authors’ works were fair game for secondary use.

The now-multi-billion-dollar generative AI industry has likewise been built — literally — on the expectation of access to digital datasets. In Tasini, the Times conveyed a preference: it valued the reliance interests of industry over the private protections of copyright. If it’s not willing to accommodate OpenAI’s interests now that the tables have turned and instead moves forward with its dramatic prayer for relief, it will be difficult to excuse the publisher’s shift. The Times’s one-eighty may be chalked up (unsatisfyingly) to litigation strategy, but the story the Times is telling in its complaint will nevertheless be a less convincing one. The Times’s purported commitment to the “deeply human” journalistic endeavor will feel less like a commitment to the humans doing that journalism than to the commercial interests that drive the media giant. Its invocation of authorship will feel exploitative, not romantic at all.