cmi – Techdirt (original) (raw)

Judge: Just Because AI Trains On Your Publication, Doesn’t Mean It Infringes On Your Copyright

from the that's-not-how-any-of-this-works dept

I get that a lot of people don’t like the big AI companies and how they scrape the web. But these copyright lawsuits being filed against them are absolute garbage. And you want that to be the case, because if it goes the other way, it will do real damage to the open web by further entrenching the largest companies. If you don’t like the AI companies find another path, because copyright is not the answer.

So far, we’ve seen that these cases aren’t doing all that well, though many are still ongoing.

Last week, a judge tossed out one of the early ones against OpenAI, brought by Raw Story and Alternet.

Part of the problem is that these lawsuits assume, incorrectly, that these AI services really are, as some people falsely call them, “plagiarism machines.” The assumption is that they’re just copying everything and then handing out snippets of it.

But that’s not how it works. It is much more akin to reading all these works and then being able to make suggestions based on an understanding of how similar things kinda look, though from memory, not from having access to the originals.

Some of this case focused on whether or not OpenAI removed copyright management information (CMI) from the works that they were being trained on. This always felt like an extreme long shot, and the court finds Raw Story’s arguments wholly unconvincing in part because they don’t show any work that OpenAI distributed without their copyright management info.

For one thing, Plaintiffs are wrong that Section 1202 “grant[ s] the copyright owner the sole prerogative to decide how future iterations of the work may differ from the version the owner published.” Other provisions of the Copyright Act afford such protections, see 17 U.S.C. § 106, but not Section 1202. Section 1202 protects copyright owners from specified interferences with the integrity of a work’s CMI. In other words, Defendants may, absent permission, reproduce or even create derivatives of Plaintiffs’ works-without incurring liability under Section 1202-as long as Defendants keep Plaintiffs’ CMI intact. Indeed, the legislative history of the DMCA indicates that the Act’s purpose was not to guard against property-based injury. Rather, it was to “ensure the integrity of the electronic marketplace by preventing fraud and misinformation,” and to bring the United States into compliance with its obligations to do so under the World Intellectual Property Organization (WIPO) Copyright Treaty, art. 12(1) (“Obligations concerning Rights Management Information”) and WIPO Performances and Phonograms Treaty….

Moreover, I am not convinced that the mere removal of identifying information from a copyrighted work-absent dissemination-has any historical or common-law analogue.

Then there’s the bigger point, which is that the judge, Colleen McMahon, has a better understanding of how ChatGPT works than the plaintiffs and notes that just because ChatGPT was trained on pretty much the entire internet, that doesn’t mean it’s going to infringe on Raw Story’s copyright:

Plaintiffs allege that ChatGPT has been trained on “a scrape of most of the internet,” Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so. When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.

Finally, the judge basically says, “Look, I get it, you’re upset that ChatGPT read your stuff, but you don’t have an actual legal claim here.”

Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.

While the judge dismisses the case with prejudice and says they can try again, it would appear that she is skeptical they could do so with any reasonable chance of success:

In the event of dismissal Plaintiffs seek leave to file an amended complaint. I cannot ascertain whether amendment would be futile without seeing a proposed amended pleading. I am skeptical about Plaintiffs’ ability to allege a cognizable injury but, at least as to injunctive relief, I am prepared to consider an amended pleading.

I totally get why publishers are annoyed and why they keep suing. But copyright is the wrong tool for the job. Hopefully, more courts will make this clear and we can get past all of these lawsuits.

Filed Under: ai, cmi, copyright, dmca, generative ai, reading
Companies: alternet, openai, raw story

OpenAI’s Motion To Dismiss Highlights Just How Weak NYT’s Copyright Case Truly Is

from the not-living-up-to-the-times'-own-journalistic-standards dept

A few weeks ago, Prof. James Grimmelmann and (former Techdirt) journalist Tim Lee wrote a piece for Ars Technica, stating why the NY Times might win its copyright lawsuit against OpenAI. It’s no secret that I’m skeptical of the underpinnings of the lawsuit and think the NY Times is being silly in filing it, but I don’t think there’s any question that the NY Times could win. Copyright law (as both Grimmelmann and Lee well know) ‘tis a silly place, where judges will justify just about anything if they feel one party has been “wronged” no matter what the law might say. The Supreme Court’s ruling in the Aereo case should always be a reminder of that. Sometimes copyright cases are decided on vibes and not the law.

The main crux of the argument for why the NY Times could win is that the NYT showed how they got OpenAI to regurgitate very similar versions of stories, as lots of people commented on regarding the lawsuit. However, as we noted in our analysis, they only did so by effectively limiting the potential output to such a narrow range of possibilities, that a very near copy was about the only possible answer. Basically, the system is trained on lots and lots of input training data, but if you systematically use your prompt to basically say “give me exactly this, and exclude every other possibility” eventually an LLM may return something kinda like what you asked for.

This is why it seems that, if there is any infringement (or other legal violation), the liability should fall almost entirely on the prompter. They’re the ones using the tool in such a manner to produce potentially violative works. We don’t blame the car company because a driver drove a car recklessly and caused damage. We blame the driver.

Either way, we now have OpenAI’s motion to dismiss in the case. While I’ve seen lots of people saying that OpenAI is claiming the NY Times “hacked” their system and finding such an allegation laughable, the reality is (as usual) more nuanced and important to understand. The NY Times definitely had to do a bunch of gaming to get the outputs it wanted for the lawsuit, which undermines the critical claim that OpenAI’s tools magically undermine the value of a NY Times’s subscription.

As OpenAI points out, the claims in the NY Times’ complaint would not live up to the Times’ well-known journalistic standards, given just how misleading the complaint was:

The allegations in the Times’s Complaint do not meet its famously rigorous journalistic standards. The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.

This is where the “hacked” headlines come from. And, frankly, claiming it’s a “hack” is a bit silly for OpenAI. The other points it’s raising are much more important. A key part of the Times’ lawsuit is claiming that because of their prompt engineering, they could reproduce similar (though not exact) language to articles, which would allow users to bypass a NY Times paywall (and subscription) to just have OpenAI generate the news for them.

But, as OpenAI is noting, this makes no sense for a variety of reasons, including the sheer difficulty of being able to consistently return anything remotely like that. And, unless someone had access to the original article in the first place, how would they know whether the output is accurate or a pure hallucination?

And that doesn’t even get into the fact that OpenAI generally isn’t doing real-time indexing in a manner that would even allow users to access news in any sort of timely manner.

OpenAI makes the obvious fair use argument, rightly highlighting how much of its business (and the wider AI) space has been built in the belief that reading/scanning of content that is publicly available is obviously fair use, and that to change that would massively upend a whole industry. It even makes a nod to the point that I raised in my initial article about the lawsuit: the NY Times itself relies regularly on the kind of fair use it now claims doesn’t exist.

Indeed, it has long been clear that the non-consumptive use of copyrighted material (like large language model training) is protected by fair use—a doctrine as important to the Times itself as it is to the American technology industry. Since Congress codified that doctrine in 1976, see H.R. Rep. No. 94-1476, at 65–66 (1976) (courts should “adapt” defense to “rapid technological change”), courts have used it to protect useful innovations like home video recording, internet search, book search tools, reuse of software APIs, and many others.

These precedents reflect the foundational principle that copyright law exists to control the dissemination of works in the marketplace—not to grant authors “absolute control” over all uses of their works. Google Books, 804 F.3d at 212. Copyright is not a veto right over transformative technologies that leverage existing works internally—i.e., without disseminating them—to new and useful ends, thereby furthering copyright’s basic purpose without undercutting authors’ ability to sell their works in the marketplace. See supra note 23. And it is the “basic purpose” of fair use to “keep [the] copyright monopoly within [these] lawful bounds.” Oracle, 141 S. Ct. at 1198. OpenAI and scores of other developers invested billions of dollars, and the efforts of some of the world’s most capable minds, based on these clear and longstanding principles

It makes that point even more strongly a bit later:

To support its narrative, the Times claims OpenAI’s tools can “closely summarize[]” the facts it reports in its pages and “mimic[] its expressive style.” Compl. ¶ 4. But the law does not prohibit reusing facts or styles. If it did, the Times would owe countless billions to other journalists who “invest[] [] enormous amount[s] of time, money, expertise, and talent” in reporting stories, Compl. ¶ 32, only to have the Times summarize them in its pages

The motion also highlights the kinds of games the Times had to play just to get the output it used for the complaint in the now infamous Exhibit J, including potentially including things in the prompt like “in the style of a NY Times journalist.” Again, this kind of prompt engineering is basically using the system to systematically limit the potential output in an effort to craft output that the user could claim is infringing. GPT doesn’t just randomly spit out these things.

OpenAI highlights how many of the claimed “infringements” fall outside the three-year statute of limitations. As for the contributory infringement claims, they are equally as ridiculous because to do that, you have to show that the defendant knew of users making use of the platform to infringe and somehow encouraged that behavior.

Here, the only allegation supporting the Times’s contributory claim states that OpenAI “had reason to know of the direct infringement by end-users” because of its role in “developing,testing, and troubleshooting” its products. Compl. ¶ 180. But “generalized knowledge” of “the possibility of infringement” is not enough. Luvdarts, 710 F.3d at 1072. The Complaint does not allege OpenAI “investigated or would have had reason to investigate” the use of its platform to create copies of Times articles. Popcornflix.com, 2023 WL 571522, at *6. Nor does it suggest that OpenAI had any reason to suspect this was happening. Indeed, OpenAI’s terms expressly prohibit such uses of its services. Supra note 8. And even if OpenAI had investigated, nothing in the Complaint explains how it might evaluate whether these outputs were acts of copyright infringement or whether their creation was authorized by the copyright holder (as they were here).

The complaint had also made a bunch of DMCA 1202 claims. That’s the part of the law that dings infringers for removing copyright management info (CMI). This (kinda silly) part of the law is basically designed as a tool to go after commercial infringers who would strip or hide a copyright notice from a work in order to resell it (e.g., on a DVD sold on a street corner or something). But clearly that’s not what’s happening here. Here, the Times didn’t even say what CMI was removed.

Count V should be dismissed at the outset for failure to specify the CMI at issue. The Complaint’s relevant paragraph fails to state what CMI is included in what work, and simply repeats the statutory text. Compl. ¶ 182 (alleging “one or more forms of [CMI]” and parroting language of Section 1202(c)). The only firm allegation states that the Times placed “copyright notices” and “terms of service” links on “every page of its websites.” Compl. ¶ 125. But, at least for some articles, it did not. And when it did, the information was not “conveyed in connection with” the works, 17 U.S.C. § 1202(c) (defining CMI), but hidden in small text at the bottom of the page. Judge Orrick of the Northern District of California rejected similar allegations as deficient in another recent AI case. Andersen v. Stability AI Ltd., No. 23-cv-00201, 2023 WL 7132064, at *11 (N.D. Cal. Oct. 30, 2023) (must plead “exact type of CMI included in [each] work”).

Another key point is that the Times claims that the parts of NY Times articles that showed up as close (but usually not exact) excerpts in GPT output couldn’t be dinged for CMI removal. This is because if that was the law it would open up tons of other organizations (including the NY Times itself) that quote or excerpt works without including the CMI:

Regardless, this “output” theory fails because the outputs alleged in the Complaint are not wholesale copies of entire Times articles. They are, at best, reproductions of excerpts of those articles, some of which are little more than collections of scattered sentences. Supra 12. If the absence of CMI from such excerpts constituted a “removal” of that CMI, then DMCA liability would attach to any journalist who used a block quote in a book review without also including extensive information about the book’s publisher, terms and conditions, and original copyright notice. See supra note 22 (example of the Times including 200-word block quote in book review).

And then there’s this tidbit:

Even setting that aside, the Times’s output-based CMI claim fails for the independent reason that there was no CMI to remove from the relevant text. The Exhibit J outputs, for example, feature text from the middle of articles. Ex. J. at 2–126. As shown in the exhibit, the “Actual text from NYTimes” contains no information that could qualify as CMI. See, e.g., id. at 3; 17 U.S.C. § 1202(c) (defining CMI). So too for the ChatGPT outputs featured in the Complaint, which request the “first [and subsequent] paragraph[s]” from Times articles. See, e.g., Compl. ¶¶ 104, 106, 118, 121. None of those “paragraphs” contains any CMI that OpenAI could have “removed.”

There’s some more in there, but I find it a very strong motion. That doesn’t mean that the case will get dismissed outright (remember, copyright land ‘tis a silly place), but it sure lays out pretty clearly how silly the examples in the Times lawsuit are and how weak their claims are as soon as you hold them up to the light.

Yes, in some rare circumstances, you can reproduce content that is kinda similar (but not exact) to copyright covered info if you tweak the outputs and effectively push the model to its extremes. But… as noted, if that’s the case, any liability should still feel like it should be on the prompter, not the tool. And the NY Times can’t infringe on its own copyright.

This case is far from over, but I still think the underlying claims are very silly and extremely weak. Hopefully the court agrees.

Filed Under: cmi, copyright, dmca 1202, fair use, prompt engineering, statute of limitations
Companies: ny times, openai