I Don't Think That Number Means What You Think It Means: How to Not Screw Up Piracy Estimates - Disruptive Competition Project (original) (raw)

I happened upon an ITIF announcement of a paper release tomorrow, which stated that “past research has shown that almost a quarter of global Internet traffic is attributable to copyright-infringing content, while online piracy costs the U.S. economy $22 billion annually.” The release pertains to a forthcoming paper by David Price on piracy, which may have been somewhat preempted by an impressively extensive research report on piracy last week by the United Kingdom’s independent telecom regulatory authority, Ofcom. In any event, it was the claim in the announcement that caught my attention, because it repeats a statistic that has been previously questioned.

The “quarter of Internet traffic” figure came from NBCUniversal-funded research, also by David Price, released in 2011 in support of the controversial Stop Online Piracy Act (SOPA). My co-blogger Rob Pegoraro, then reporting for the Washington Post, took issue both with what the study claimed, and what the MPAA claimed the study claimed. According to the MPAA, the study claimed there was no lawful BitTorrent traffic: “Excluding pornography, Envisional project that 99.24% of all material on bittorrent was copyright infringing.” This figure is somewhat difficult to reconcile with last week’s UK Ofcom report, which found that among Internet users 12 and older, just 1.6% were responsible for 79% of infringed content.

The “quarter of all traffic” is even harder to reconcile with the fact that “World of Warcraft” game developer Blizzard, the Internet Archive, Linux distributors, NASA, Wikipedia, and artists like the Counting Crows use BitTorrent for distribution of authorized content, including software patches, public domain works, and open source code. So a study being interpreted to suggest this doesn’t happen is suspect. Rob’s Washington Post article noted that this error may have arisen from a selection bias inherent in the particular BitTorrent tracker used for the study, leading to unrepresentative results.

More broadly, measuring Internet activity by data volume can skew perceptions, since the size of different file types varies by orders of magnitude. For example, if we only look at data volume, the problem of email spam largely disappears. A video file can be thousands of times larger than an audio file, and a mostly text file like email, of course, is a similar order of magnitude smaller than the average audio file. Imagine that: we solved the problem of email spam, just by changing our metric. Of course, this makes no sense. It illustrates that using data volumes as a proxy for activity volume skews observations toward data-intensive activity like video transmission, and away from less data-heavy activity like spam and identity theft. Ofcom’s research, incidentally, recognizes and accounts for this error by focusing on the number of allegedly infringing files, instead of data volume.

The other item in this release that interested me is the $22 billion figure being characterized as a loss “to the economy” instead of “to industry.” I have taken issue with this error before, including here and here. Lacking a citation to verify this figure, the empirical basis of this claim cannot be determined, but I rather suspect it contains the oft-made error of attributing all industrial losses to infringement to the macroeconomy, and overlooking that foreign infringement has a greater negative impact on GDP.

Several of the flaws in the 2011 study, and in how it was interpreted by its sponsors, could have been remedied. We will have to see whether tomorrow’s paper addressed these criticisms made about the previous iteration.