OpenAI Faces Existential Threat in New York Times Copyright Suit (original) (raw)

The New York Times Co.'s lawsuit claiming ChatGPT has produced near-verbatim text of published articles threatens to upend the foundation of the booming AI industry, as the creators of the chatbot fight their most consequential copyright battle to date.

The Times’ nearly 70-page complaint filed in Manhattan federal court argued OpenAI Inc. and partner Microsoft Corp. scraped millions of the newspaper’s articles without a license. It alleged the chatbot output text virtually identical to published work after being provided the URL of the original article and a snippet of the beginning of the story text.

VIDEO: ChatGPT and Generative AI Are Hits! Can The New York Times and Copyright Law Stop Them?

If the lawsuit is successful, OpenAI could have to pay billions of dollars in damages, and the Times is asking the court to order the destruction of any GPT or other models and training sets that incorporate the Times’ articles. The newspaper’s evidence of “memorization” puts OpenAI in especially dangerous legal territory, attorneys specializing in copyright and technology issues said in interviews.

The Times isn’t arguing ChatGPT is learning facts from the articles or emulating the writing style, said Kristelia García, a copyright law professor at Georgetown University.

“They’re saying it’s spitting out exactly what was put into it,” she said.

The lawsuit also has other potential strengths compared with the handful of ongoing copyright class actions brought by novelists, visual artists, and open-source coders.

The complaint filed Dec. 27 details the Times’ long history of licensing its content to platforms, including those developed by Google, Meta Platforms Inc., and Apple Inc., as well as its recent attempts to strike a similar deal with OpenAI. The suit comes only weeks after German publishing giant Axel Springer SE, which owns Politico and Business Insider, reached a licensing agreement with the AI company.

The case is the first AI copyright suit from the news industry, which has expressed growing concerns that the technology could prove an existential threat to its business model without the proper guardrails.

Copyright experts say the existence of a licensing market for news content could harm OpenAI’s arguments that its copying falls under the “fair use” doctrine that allows copying under certain circumstances.

The case was also brought by a single company, which could help streamline the litigation. Class actions, on the other hand, can get bogged down in disputes over how to accurately define and represent a large class of individual plaintiffs.

“Bad facts make bad law, but I think this will be a case of good facts making good law,” said technology attorney and startup investor Cecilia Ziniti, who previously worked for Amazon.com Inc.

“They crank out words and words and words every day and get 50 to 100 million visits to their website,” she said, referring to the Times. “In terms of a plaintiff that is a high-value copyright holder, this is a good plaintiff.”

OpenAI said in a Dec. 27 statement to Bloomberg News that it respects the rights of content creators and is “committed to working with them to ensure they benefit from AI technology and new revenue models.”

‘Memorization’ Evidence

In an exhibit attached to the complaint, the newspaper presents over a hundred pages of side-by-side comparisons showing the text of Times articles and nearly identical output from OpenAI’s most advanced model, ChatGPT-4. The articles include a Pulitzer Prize-winning investigation of New York’s taxi industry and an infamous negative review of Guy Fieri’s restaurant.

That trove is the most convincing evidence in a lawsuit so far showing an AI model can memorize training data, said Matthew Sag, a law professor at Emory University who studies AI.

“The NYT complaint is categorically different to almost all of the other suits,” he said.

A typical ChatGPT user might not take all the same steps to prompt the chatbot, said Peter Henderson, an incoming professor at Princeton University who studies the intersection of AI and the law. He is a co-author of a research paper showing that with advanced prompting techniques, ChatGPT-4 can produce multiple chapters of a book from the Harry Potter series.

But even if a user needs to use tricks to goad a chatbot into producing copyrighted content, that won’t necessarily protect OpenAI from liability.

“It’s a good way of showing that what the model has learned is not at the abstract level, but that the model has actually specifically memorized a lot of text,” Sag said.

In a copyright lawsuit filed this fall, Universal Music Group and a dozen other music publishers were able to show that Anthropic PBC’s AI chatbot Claude could reproduce copyrighted song lyrics. The AI company hasn’t yet responded in court to the substance of the copyright claims.

OpenAI does appear to have implemented some copyright filters—ChatGPT will refuse to output entire copyrighted works when asked directly. The most recent version of its AI image generator DALL-E also won’t produce images in the style of a living artist and allows artists to opt out of having their work used to train the bot.

“But it is a hard problem to get a filter at the scale of all the input content for every query,” Henderson said.

Fair Use Battle

The case will ultimately hinge on copyright law’s fair use doctrine, attorneys said. The legal test asks courts to evaluate four factors to determine whether an unauthorized copy is legal: the “transformativeness” of the copying; the nature of the copyrighted work; the amount of work copied; and the market harm from the copying.

The complaint has evidence that could favor the Times on the fourth factor, which examines market harm.

By making copied Times articles available for free, the newspaper said, OpenAI and Microsoft’s copying threatens its subscription revenue. In November, the Times reported quarterly subscription revenue of $418.6 million.

Moreover, the Times argued, the existing licensing market for its articles means the paper is losing out on potentially millions of dollars a year from OpenAI’s copying. A license for a single article posted to a commercial website for a year can cost thousands of dollars, according to the complaint.

The companies have been in licensing negotiations since April, but the Times said they weren’t able to reach an agreement. OpenAI’s statement to Bloomberg News said the negotiations had “been productive and moving forward constructively, so we are surprised and disappointed” by the lawsuit.

Emory University’s Sag said the AI company may be able to counter claims about market harm by arguing that copying the articles was ultimately for a “transformative” purpose, to create a new and useful AI model.

OpenAI will also likely argue that copying news content will favor the company on the second fair use factor, the nature of the work being copied. Facts can’t receive copyright protection, and that factor recognizes articles based on real-world events tend to have thinner copyright protection compared to more creative works like fictional stories.

The examples of verbatim copying, however, could be damning, attorneys said. They show wholesale copying of “actual expression, the words strung together in sentences and paragraphs,” said Ed Klaris, an IP attorney and former in-house counsel for ABC and Conde Nast.

In a case this consequential, the result could ultimately be a matter of public policy and optics. The AI genie is out of the bottle, and OpenAI and Microsoft could use that to their advantage in the case, said Ziniti, the former Amazon attorney.

“The reality is, when a product is really good, and customers really want it, including judges, they’ll figure out a way to rule for you,” she said.

Susman Godfrey LLP and Rothwell Figg Ernst & Manbeck PC represent the New York Times.

The case is The New York Times Co. v. Microsoft Corp., S.D.N.Y., No. 1:23-cv-11195, 12/27/23.