prompt engineering – Techdirt (original) (raw)

Stories filed under: "prompt engineering"

OpenAI’s Motion To Dismiss Highlights Just How Weak NYT’s Copyright Case Truly Is

from the not-living-up-to-the-times'-own-journalistic-standards dept

A few weeks ago, Prof. James Grimmelmann and (former Techdirt) journalist Tim Lee wrote a piece for Ars Technica, stating why the NY Times might win its copyright lawsuit against OpenAI. It’s no secret that I’m skeptical of the underpinnings of the lawsuit and think the NY Times is being silly in filing it, but I don’t think there’s any question that the NY Times could win. Copyright law (as both Grimmelmann and Lee well know) ‘tis a silly place, where judges will justify just about anything if they feel one party has been “wronged” no matter what the law might say. The Supreme Court’s ruling in the Aereo case should always be a reminder of that. Sometimes copyright cases are decided on vibes and not the law.

The main crux of the argument for why the NY Times could win is that the NYT showed how they got OpenAI to regurgitate very similar versions of stories, as lots of people commented on regarding the lawsuit. However, as we noted in our analysis, they only did so by effectively limiting the potential output to such a narrow range of possibilities, that a very near copy was about the only possible answer. Basically, the system is trained on lots and lots of input training data, but if you systematically use your prompt to basically say “give me exactly this, and exclude every other possibility” eventually an LLM may return something kinda like what you asked for.

This is why it seems that, if there is any infringement (or other legal violation), the liability should fall almost entirely on the prompter. They’re the ones using the tool in such a manner to produce potentially violative works. We don’t blame the car company because a driver drove a car recklessly and caused damage. We blame the driver.

Either way, we now have OpenAI’s motion to dismiss in the case. While I’ve seen lots of people saying that OpenAI is claiming the NY Times “hacked” their system and finding such an allegation laughable, the reality is (as usual) more nuanced and important to understand. The NY Times definitely had to do a bunch of gaming to get the outputs it wanted for the lawsuit, which undermines the critical claim that OpenAI’s tools magically undermine the value of a NY Times’s subscription.

As OpenAI points out, the claims in the NY Times’ complaint would not live up to the Times’ well-known journalistic standards, given just how misleading the complaint was:

The allegations in the Times’s Complaint do not meet its famously rigorous journalistic standards. The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.

This is where the “hacked” headlines come from. And, frankly, claiming it’s a “hack” is a bit silly for OpenAI. The other points it’s raising are much more important. A key part of the Times’ lawsuit is claiming that because of their prompt engineering, they could reproduce similar (though not exact) language to articles, which would allow users to bypass a NY Times paywall (and subscription) to just have OpenAI generate the news for them.

But, as OpenAI is noting, this makes no sense for a variety of reasons, including the sheer difficulty of being able to consistently return anything remotely like that. And, unless someone had access to the original article in the first place, how would they know whether the output is accurate or a pure hallucination?

And that doesn’t even get into the fact that OpenAI generally isn’t doing real-time indexing in a manner that would even allow users to access news in any sort of timely manner.

OpenAI makes the obvious fair use argument, rightly highlighting how much of its business (and the wider AI) space has been built in the belief that reading/scanning of content that is publicly available is obviously fair use, and that to change that would massively upend a whole industry. It even makes a nod to the point that I raised in my initial article about the lawsuit: the NY Times itself relies regularly on the kind of fair use it now claims doesn’t exist.

Indeed, it has long been clear that the non-consumptive use of copyrighted material (like large language model training) is protected by fair use—a doctrine as important to the Times itself as it is to the American technology industry. Since Congress codified that doctrine in 1976, see H.R. Rep. No. 94-1476, at 65–66 (1976) (courts should “adapt” defense to “rapid technological change”), courts have used it to protect useful innovations like home video recording, internet search, book search tools, reuse of software APIs, and many others.

These precedents reflect the foundational principle that copyright law exists to control the dissemination of works in the marketplace—not to grant authors “absolute control” over all uses of their works. Google Books, 804 F.3d at 212. Copyright is not a veto right over transformative technologies that leverage existing works internally—i.e., without disseminating them—to new and useful ends, thereby furthering copyright’s basic purpose without undercutting authors’ ability to sell their works in the marketplace. See supra note 23. And it is the “basic purpose” of fair use to “keep [the] copyright monopoly within [these] lawful bounds.” Oracle, 141 S. Ct. at 1198. OpenAI and scores of other developers invested billions of dollars, and the efforts of some of the world’s most capable minds, based on these clear and longstanding principles

It makes that point even more strongly a bit later:

To support its narrative, the Times claims OpenAI’s tools can “closely summarize[]” the facts it reports in its pages and “mimic[] its expressive style.” Compl. ¶ 4. But the law does not prohibit reusing facts or styles. If it did, the Times would owe countless billions to other journalists who “invest[] [] enormous amount[s] of time, money, expertise, and talent” in reporting stories, Compl. ¶ 32, only to have the Times summarize them in its pages

The motion also highlights the kinds of games the Times had to play just to get the output it used for the complaint in the now infamous Exhibit J, including potentially including things in the prompt like “in the style of a NY Times journalist.” Again, this kind of prompt engineering is basically using the system to systematically limit the potential output in an effort to craft output that the user could claim is infringing. GPT doesn’t just randomly spit out these things.

OpenAI highlights how many of the claimed “infringements” fall outside the three-year statute of limitations. As for the contributory infringement claims, they are equally as ridiculous because to do that, you have to show that the defendant knew of users making use of the platform to infringe and somehow encouraged that behavior.

Here, the only allegation supporting the Times’s contributory claim states that OpenAI “had reason to know of the direct infringement by end-users” because of its role in “developing,testing, and troubleshooting” its products. Compl. ¶ 180. But “generalized knowledge” of “the possibility of infringement” is not enough. Luvdarts, 710 F.3d at 1072. The Complaint does not allege OpenAI “investigated or would have had reason to investigate” the use of its platform to create copies of Times articles. Popcornflix.com, 2023 WL 571522, at *6. Nor does it suggest that OpenAI had any reason to suspect this was happening. Indeed, OpenAI’s terms expressly prohibit such uses of its services. Supra note 8. And even if OpenAI had investigated, nothing in the Complaint explains how it might evaluate whether these outputs were acts of copyright infringement or whether their creation was authorized by the copyright holder (as they were here).

The complaint had also made a bunch of DMCA 1202 claims. That’s the part of the law that dings infringers for removing copyright management info (CMI). This (kinda silly) part of the law is basically designed as a tool to go after commercial infringers who would strip or hide a copyright notice from a work in order to resell it (e.g., on a DVD sold on a street corner or something). But clearly that’s not what’s happening here. Here, the Times didn’t even say what CMI was removed.

Count V should be dismissed at the outset for failure to specify the CMI at issue. The Complaint’s relevant paragraph fails to state what CMI is included in what work, and simply repeats the statutory text. Compl. ¶ 182 (alleging “one or more forms of [CMI]” and parroting language of Section 1202(c)). The only firm allegation states that the Times placed “copyright notices” and “terms of service” links on “every page of its websites.” Compl. ¶ 125. But, at least for some articles, it did not. And when it did, the information was not “conveyed in connection with” the works, 17 U.S.C. § 1202(c) (defining CMI), but hidden in small text at the bottom of the page. Judge Orrick of the Northern District of California rejected similar allegations as deficient in another recent AI case. Andersen v. Stability AI Ltd., No. 23-cv-00201, 2023 WL 7132064, at *11 (N.D. Cal. Oct. 30, 2023) (must plead “exact type of CMI included in [each] work”).

Another key point is that the Times claims that the parts of NY Times articles that showed up as close (but usually not exact) excerpts in GPT output couldn’t be dinged for CMI removal. This is because if that was the law it would open up tons of other organizations (including the NY Times itself) that quote or excerpt works without including the CMI:

Regardless, this “output” theory fails because the outputs alleged in the Complaint are not wholesale copies of entire Times articles. They are, at best, reproductions of excerpts of those articles, some of which are little more than collections of scattered sentences. Supra 12. If the absence of CMI from such excerpts constituted a “removal” of that CMI, then DMCA liability would attach to any journalist who used a block quote in a book review without also including extensive information about the book’s publisher, terms and conditions, and original copyright notice. See supra note 22 (example of the Times including 200-word block quote in book review).

And then there’s this tidbit:

Even setting that aside, the Times’s output-based CMI claim fails for the independent reason that there was no CMI to remove from the relevant text. The Exhibit J outputs, for example, feature text from the middle of articles. Ex. J. at 2–126. As shown in the exhibit, the “Actual text from NYTimes” contains no information that could qualify as CMI. See, e.g., id. at 3; 17 U.S.C. § 1202(c) (defining CMI). So too for the ChatGPT outputs featured in the Complaint, which request the “first [and subsequent] paragraph[s]” from Times articles. See, e.g., Compl. ¶¶ 104, 106, 118, 121. None of those “paragraphs” contains any CMI that OpenAI could have “removed.”

There’s some more in there, but I find it a very strong motion. That doesn’t mean that the case will get dismissed outright (remember, copyright land ‘tis a silly place), but it sure lays out pretty clearly how silly the examples in the Times lawsuit are and how weak their claims are as soon as you hold them up to the light.

Yes, in some rare circumstances, you can reproduce content that is kinda similar (but not exact) to copyright covered info if you tweak the outputs and effectively push the model to its extremes. But… as noted, if that’s the case, any liability should still feel like it should be on the prompter, not the tool. And the NY Times can’t infringe on its own copyright.

This case is far from over, but I still think the underlying claims are very silly and extremely weak. Hopefully the court agrees.

Filed Under: cmi, copyright, dmca 1202, fair use, prompt engineering, statute of limitations
Companies: ny times, openai