tdm – Techdirt (original) (raw)

Stories filed under: "tdm"

German Court: LAION’s Generative AI Training Dataset Is Legal Thanks To EU Copyright Exceptions

from the one-good-ruling dept

The copyright world is currently trying to assert its control over the new world of generative AI through a number of lawsuits, several of which have been discussed previously on Walled Culture. We now have our first decision in this area, from the regional court in Hamburg. Andres Guadamuz has provided an excellent detailed analysis of a ruling that is important for the German judges’ discussion of how EU copyright law applies to various aspects of generative AI. The case concerns the freely-available dataset from LAION (Large-scale Artificial Intelligence Open Network), a German non-profit. As the LAION FAQ says: “LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images.” Guadamuz explains:

The case was brought by German photographer Robert Kneschke, who found that some of his photographs had been included in the LAION dataset. He requested the images to be removed, but LAION argued that they had no images, only links to where the images could be found online. Kneschke argued that the process of collecting the dataset had included making copies of the images to extract information, and that this amounted to copyright infringement.

LAION admitted making copies, but said that it was in compliance with the exception for text and data mining (TDM) present in German law, which is a transposition of Article 3 of the 2019 EU Copyright Directive. The German judges agreed:

The court argued that while LAION had been used by commercial organisations, the dataset itself had been released to the public free of charge, and no evidence was presented that any commercial body had control over its operations. Therefore, the dataset is non-commercial and for scientific research. So LAION’s actions are covered by section 60d of the German Copyright Act

That’s good news for LAION and its dataset, but perhaps more interesting for the general field of generative AI is the court’s discussion of how the EU Copyright Directive and its exceptions apply to AI training. It’s a key question because copyright companies claim that they don’t, and that when such training involves copyright material, permission is needed to use it. Guadamuz summarizes that point of view as follows:

the argument is that the legislators didn’t intend to cover generative AI when they passed the [EU Copyright Directive], so text and data mining does not cover the training of a model, just the making of a copy to extract information from it. The argument is that making a copy to extract information to create a dataset is fine, as the court agreed here, but the making of a copy in order to extract information to make a model is not. I somehow think that this completely misses the way in which a model is trained; a dataset can have copies of a work, or in the case of LAION, links to the copies of the work. A trained model doesn’t contain copies of the works with which it was trained, and regurgitation of works in the training data in an output is another legal issue entirely.

The judgment from the Hamburg court says that while legislators may not have been aware of generative AI model training in 2019, when they drew up the EU Copyright Directive, they certainly are now. The judges use the EU’s 2024 AI Act as evidence of this, citing a paragraph that makes explicit reference to AI models complying with the text and data mining regulation in the earlier Copyright Directive.

As Guadamuz writes in his post, this is an important point, but the legal impact may be limited. The judgment is only the view of a local German court, so other jurisdictions may produce different results. Moreover, the original plaintiff Robert Kneschke may appeal and overturn the decision. Furthermore, the ruling only concerns the use of text and data mining to create a training dataset, not the actual training itself, although the judges’ thoughts on the latter indicate that it would be legal too. In other words, this local outbreak of good sense in Germany is welcome, but we are still a long way from complete legal clarity on the training of generative AI systems on copyright material.

Follow me @glynmoody on Mastodon and on Bluesky. Originally posted to Walled Culture.

Filed Under: ai, copyright, copyright directive, germany, hamburg, laion, reading, robert kneschke, tdm, text and data mining, training
Companies: laion

It's Time To End The Anti-Circumvention Exemption Circus

from the pleading-for-your-rights dept

Copyright as we know it goes back to the Statute of Anne of 1710. A law that old is clearly going to struggle to cope with the enormous changes in technology that have taken place since then – notably the Internet. But even relatively recent copyright laws were framed in ways that have become unworkable for the digital world we live in.

For example, arguably one of the most important pieces of recent legislation in this area is the Digital Millennium Copyright Act (DMCA) in the US, and its sibling, the EU’s Copyright Directive (EUCD). Both are wide-ranging, affecting many aspects of copyright, and a particularly problematic aspect of both concerns anti-circumvention. The DMCA and EUCD prohibit the bypassing of any “technical protection measure” (TPM) used to protect works under copyright. That typically means the much-hated Digital Rights Management (DRM), which aims to control who can do what with copyright material, and thus often gets in the way of people enjoying material that they have paid for.

The DMCA and EUCD introduced severe penalties for circumventing any such TPM, no matter how weak it is, and no matter how reasonable the need to do so may be. As a tiny recognition of this lack of proportion, the DMCA includes Section 1201, which provides a mechanism for giving people permission to circumvent protection:

The Digital Millennium Copyright Act (DMCA), codified in part in Title 17, section 1201, of the United States Code, generally makes it unlawful to circumvent technological measures used to prevent unauthorized access to copyrighted works, including copyrighted books, movies, videos, video games, and computer software. Section 1201, however, also directs the Librarian of Congress, upon the recommendation of the Register of Copyrights following a rulemaking proceeding, to determine whether the prohibition on circumvention is having, or is likely to have, an adverse effect on users’ ability to make noninfringing uses of particular classes of copyrighted works. Upon such a determination, the Librarian may adopt limited temporary exemptions waiving the general prohibition against circumvention for such users for the ensuing three-year period.

The new temporary exemptions have just been announced. One of them is to allow copy protections on ebooks to be circumvented so that the visually impaired can use technology to help them access the texts. Wired reports:

The victory is tainted somewhat by the struggle it represents. Although the exemption protects people who circumvent digital copyright protections for the sake of accessibility – by using third-party programs to lift text and save it in a different file format, for example – that it’s even necessary strikes many as a fundamental injustice.

Another exemption concerns text and data mining (TDM) – using automated techniques to analyze typically large quantities of text and data in order to find patterns and trends there. This kind of work allows existing facts to generate new ones, but often requires circumventing publishers’ TPMs. The Authors Alliance reported on a new exemption to section 1201 of the DMCA that would permit TDM by researchers affiliated with academic institutions. But even here, there were restrictions that made it less useful than it could have been:

In the recommendation, Register Perlmutter also recommended adding a limitation that “circumvention be permitted only on copies of the copyrighted works that were lawfully acquired and that the institution owns or for which it has a non-time-limited license,” and should not be permitted on works the institution had “rented or borrowed.” This limitation has the potential to complicate the usability of the exemption with regards to TDM research on e-books: because e-books are generally licensed rather than owned, whether the exemption will permit TDM research on a certain e-book will depend on the terms of the license for that e-book.

Finally, there is an example of how outdated copyright laws are seriously hampering what is manifestly a legal activity: repairing digital equipment. Because software is involved, and that code is often protected, it has hitherto been illegal in the US to repair many digital devices. The new exemption will now allow circumvention for the purpose of repair, but with a major restriction, as explained by the iFixit site. The new exemption does not allow you to distribute repair tools that circumvent manufacturers’ TPMs:

Without access to shared tools, the exemptions are largely academic. Right now, if you want to use this new exemption to repair your Xbox, you’re going to have to whittle your own optical drive unlocking app or device from scratch. That just doesn’t scale – most gamers are not security engineers.

It should not be necessary to beg every three years for limited and flawed exemptions, like the ones above, to overly-strong copyright laws; instead, people should have a right to carry out these perfectly reasonable activities. The DMCA and EUCD need to be amended, and their logic inverted, so that circumventing a TPM is always permitted, except when it is for an illegal purpose.

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Originally published to the Walled Culture blog.

Filed Under: dmca, dmca 1201, drm, exemptions, research, tdm, technical protection measures, triennial review

Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Carl Malamud is one of Techdirt’s heroes. We’ve been writing about his campaign to liberate US government documents and information for over ten years now. The journal Nature has a report on a new project of his, which is in quite a different field: academic knowledge. The idea will be familiar to readers of this site: to carry out text and data mining (TDM) on millions of academic articles, in order to discover new knowledge. It’s a proven technique with huge potential to produce important discoveries. That raises the obvious question: if large-scale TDM of academic papers is so powerful, why hasn’t it been done before? The answer, as is so often the case, is that copyright gets in the way. Academic publishers use it to control and impede how researchers can help humanity:

[Malamud’s] unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control — and often limit — the speed and scope of such projects, which typically confine themselves to abstracts, not full text.

Malamud’s project gets around the limitations imposed by copyright and publishers thanks to two unique features. First, Malamud “had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub”. Drawing on Sci-Hub‘s huge holdings means his project doesn’t need to go begging to publishers in order to obtain full texts to be mined. Secondly, Malamud is basing his project in India:

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud’s contention is that this allows him to mine academic material in India without the permission of publishers. But he also believes that his TDM project would be legal in the US:

The data mining, he says, is non-consumptive: a technical term meaning that researchers don’t read or display large portions of the works they are analysing. “You cannot punch in a DOI [article identifier] and pull out the article,” he says. Malamud argues that it is legally permissible to do such mining on copyrighted content in countries such as the United States. In 2015, for instance, a US court cleared Google Books of copyright infringement charges after it did something similar to the JNU depot: scanning thousands of copyrighted books without buying the rights to do so, and displaying snippets from these books as part of its search service, but not allowing them to be downloaded or read in their entirety by a human.

The fact that TDM is “non-consumptive” means that the unhelpful attitude of academic publishers is even more unjustified than usual. They lose nothing from the analytical process, which is merely extracting knowledge. But from a sense of entitlement publishers still demand to be paid for unrestricted computer access to texts that have already been licensed by academic institutions anyway. That selfish and obstructive attitude to TDM may be about to backfire spectacularly. The Nature article notes:

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

The thing is, if anyone were by any chance interested in reading the full text, there’s an obvious place to turn to. After all, the mining is carried out using papers held by Sci-Hub, so?

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Filed Under: academic papers, carl malamud, copyright, india, journals, research, science, tdm, text and data mining
Companies: sci-hub

Proposed Update To Singapore's Copyright Laws Surprisingly Sensible

from the EU-should-look-and-learn dept

Techdirt writes plenty about copyright in the US and EU, and any changes to the respective legislative landscapes. But it’s important to remember that many other countries around the world are also trying to deal with the tension between copyright’s basic aim to prevent copying, and the Internet’s underlying technology that facilitates it. Recently, we covered the copyright reform process in South Africa, where some surprisingly good things have been happening. Now it seems that Singapore may bring in a number of positive changes to its copyright legislation. One of the reasons for that is the very thorough consultative process that was undertaken, explained here by Singapore’s Ministry of Law:

The proposed changes are made, following an extensive three-year review and two rounds of public consultations conducted from August to November 2016 and May to June 2017 respectively. Three public Town Halls and ten engagement sessions with various stakeholder groups, including consumer, industry and trade associations, businesses, intellectual property practitioners and academics were held. Close to 100 formal submissions and more than 280 online feedback forms were received.

The full 70-page report (pdf) spells out the questions asked during that review, the answers received, and the government’s proposals. The Ministry of Law’s press release lists some of the main changes it wants to make. One of the most welcome is a new exception for text and data mining (TDM) for the purpose of analysis:

Today, people who use automated techniques to analyse text, data and other content to generate insights risk infringing copyright as they typically require large scale copying of works without permission. It is proposed that a new exception be established to allow copying of copyrighted materials for the purpose of data analysis, where the user has lawful access to the materials that are copied. This will promote applications of data analytics and big data across a gamut of industries, unlocking new business opportunities, speeding up processes, and reducing costs for all.

Importantly, Singapore’s proposed new TDM exception applies to everyone — including big businesses. That’s unlike the corresponding Article 3 in the EU’s awful Copyright Directive, currently working its way through the legislative process, which imposes an unnecessary restriction that more or less guarantees the European Union will be a backwater in this fast-growing area. An obvious but wise move by Singapore is the proposal for an enhanced copyright exception for educational purposes:

Non-profit schools and their students will be able to use online resources that are accessible without payment, for instruction purposes. This will be in addition to their existing exceptions which generally cover only copying of a portion of a work. The enhancement will facilitate instruction and make it easier for teachers and students to use online materials in classes. For example, teachers and students will be able to use various audio-visual materials (e.g. videos, pictures) found online for their classroom lessons and project presentations. They will also be able to share those materials, or lessons and project presentations which have included those materials, on student learning portals for other schools to view. Online resources that require payment will not be covered by this exception.

Another suggested exception is for non-profit galleries, libraries, archives, and museums (GLAMs) to make copies for exhibition purposes. Also useful for GLAMs is a new limit on the protection given to unpublished works. This will stand at life plus 70 years for literary and artistic works, just as for published versions. GLAMs will be protected from contract override, as is the text and data mining exception. That’s important, because it means that copyright owners cannot nullify the new exceptions by insisting organizations sign contracts that waive them. Individual creators receive new rights too:

the report proposes that creators be given a new right to be attributed as the creator of their work, regardless of whether they still own or have sold the copyright. For example, anyone using a work publicly, such as posting it on the internet, will have to acknowledge the creator of the work. This will accord creators due recognition and allow them to build their reputation over time. Currently, they do not need to be attributed as the creator of their work when others use it.

This is essentially a moral right alongside the usual economic ones. As the Wikipedia page on the subject explains, the degree to which moral rights exist for creators of copyright works varies enormously around the world. In France, for example, moral rights are perpetual and inalienable, whereas in the US they are less to the fore. Singapore’s Ministry of Law also proposes that where rights have not been explicitly signed away in a contract, they remain with the creator. Although that will prevent naive creators being tricked out of their rights, it won’t apply to work created by employees: there, it’s employers who will continue to retain rights. As for enforcing copyright, there is the following:

the report proposes that new enforcement measures be made available to copyright owners to deter retailers and service providers from profiting off providing access to content from unauthorised sources, such as through the sale of set-top boxes that enable access to content from unauthorised sources, also commonly known as grey boxes or illicit streaming devices. The measures, which are absent today, will make clear that acts such as the import and sale of such devices are prohibited.

This is clearly aimed at Kodi boxes, which are currently one of the main targets of the entertainment industry. To its credit, the Ministry of Law’s proposal does include important additional requirements for the measures to apply:

the product can be used to access audio-visual content from an unauthorised source and additionally must be:

designed or made primarily for providing access to such content

advertised as providing access to such content, or

sold as providing access to such content, where the retailer sells a generic device with the understanding that “add-on” services such as the provision of website links, instructions or installation of subscription services will subsequently be provided

At least that makes a clear distinction between basic Kodi boxes, and those specifically built and sold with a view to providing unauthorized access to materials. That understanding of the difference is of a piece with the rest of the legislation, which is unusually intelligent. Other governments could learn from that, and from the overall thrust of the proposals to move Singapore’s copyright law towards a fair use system similar to that of the US — something that is fiercely resisted elsewhere.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: copyright, moral rights, singapore, tdm, text and data mining, user rights