training data – Techdirt (original) (raw)

Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard

from the public-domain-impossibility-theorem dept

In response to a number of copyright lawsuits about AI training datasets, we are starting to see efforts to build ‘non-infringing’ collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently ‘non-infringing’, I think these efforts to build ‘safe’ or ‘clean’ or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.

That’s why I was excited to read about Source.Plus (via a post from Open Future). Source.Plus is a tool from Spawning that purports to aggregate over 37 million “public domain and CC0 images integrated from dozens of libraries and museums.” That’s a lot less than are used to train current generative models, but still a lot of images that could be used for all sorts of useful things.

However, it didn’t take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.

The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

photograph of a library reading room full of patrons shot from above

According to the image page on Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.

Clicking through to the wikimedia page reveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears to no longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.

Clicking through to the Pixabay page for the image reveals that the image is available under the Pixabay Content License. That license is fairly permissive, but does state:

You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
You cannot use Content in a misleading or deceptive way.
You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.

Which is to say, not CC0.

However, further investigation into the Pixabay Wikipedia page suggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of the Pixabay terms confirms that. The additional information on the image’s Pixabay page confirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image’s wikimedia page).

All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!

At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be ‘good enough’ in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?

Michael Weingberg is the Executive Director of NYU’s Engelberg Center for Innovation Law and Policy. This post is republished from his blog under its CC BY-SA 4.0 license. Hero Image: Interieur van de Bodleian Library te Oxford

Filed Under: ai, copyright, public domain, training data

An Only Slightly Modest Proposal: If AI Companies Want More Content, They Should Fund Reporters, And Lots Of Them

from the so-stupid-it-could-work? dept

In Jonathan Swift’s “A Modest Proposal,” he satirized politicians who were out of touch and were treating the poor as an inconvenience, rather than a sign of human suffering and misery. So, he took what seemed like two big problems, according to those politicians, and came up with an obviously barbaric solution to solve both problems: by letting the poor sell their kids as food. This really only was designed to highlight the barbaric framing of the “problem” by the Irish elite.

But, sometimes, there really are scenarios where there are two very real problems (not of a Swiftian nature) that might actually be in a position to be combined such that both problems are actually solved. And thus I present a non-Swiftian modest proposal: that AI companies desperate for high quality content should create funds to pay for journalists to create high quality content that the AI companies can use for training.

Lately, there have been multiple news articles about how desperate the AI companies are for fresh data to feed the voracious and insatiable training machine. The Wall Street Journal noted that “the internet is too small” for AI companies.

Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans.

Ever more powerful systems developed by OpenAI, Google and others require larger oceans of information to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies.

Some executives and researchers say the industry’s need for high-quality text data could outstrip supply within two years, potentially slowing AI’s development.

The problem is not just data, but high-quality data, as that report notes. You need the AI systems trained on well-written, useful content:

Most of the data available online is useless for AI training because it contains flaws such as sentence fragments or doesn’t add to a model’s knowledge. Villalobos estimated that only a sliver of the internet is useful for such training—perhaps just one-tenth of the information gathered by the nonprofit Common Crawl, whose web archive is widely used by AI developers.

The NY Times also published a similar-ish story, though it framed it in a much more nefarious light. It argued that the AI companies were “cutting corners to harvest data for AI” systems. However, what the Times actually means is that AI companies believe (correctly, in my opinion) that they have a very strong fair use argument for training on whatever data they can find.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

I’ve discussed the copyright arguments repeatedly, including why I think the AI companies are correct that training on copyright-covered works shouldn’t be infringing. I also think the rush to rely on copyright as a solution here is problematic. Doing so would only enrich big tech, since smaller companies and open source systems wouldn’t be able to keep up. Also, requiring all training to be licensed would effectively break the open internet, by creating a new “license to read.” This would be bad.

But, all of this is coming at the same time that journalism is in peril. We’re hearing stories of news orgs laying off tons of journalists. Or publications shutting down entirely. There are stories of “news deserts” and how corruption is increasing as news orgs continue to fail.

The proposed solutions to this very real problem have been very, very bad. Link taxes are even more destructive to the open web and don’t actually appear to work very well.

But… that doesn’t mean there isn’t a better solution. If the tech companies need good, well-written content to fill their training systems, and the world needs good, high-quality journalism, why don’t the big AI companies agree to start funding journalists and solve both problems in one move?

This may sound similar to the demands of licensing works, but I’m not talking about past works. Those works are out there. I’m talking about paying for the creation of future works. It’s not about licensing or copyright. It’s about paying for the creation of new, high-quality journalism. And then letting those works exist freely on the internet for everyone.

It was already mentioned above that Meta considered buying a book publisher. Why not news publishers as well? But ownership of the journalists shouldn’t even be the focus, as it could raise some other challenges. Instead, they can just set up a fund where anyone can apply. There can be a pretty clear set of benefits to all parties.

Journalists who join the programs (and they should be allowed to join multiple programs from multiple companies) agree to publish new, well-written articles on a regular basis, in exchange for some level of financial support. It should be abundantly clear that the AI companies have no say over the type of journalism being done, nor do they have any say in editorial beyond the ability to review the quality of the writing to make sure it’s actually useful in training new systems.

The journalists only need to promise that anything they publish that receives funding from this program is made available to the training systems of the companies doing the funding.

In exchange, beyond just some funding, the AI companies could make a variety of AI tools available to the journalists as well, to help them improve the quality of their writing (I have a story coming up soon about how I’ve been using AI as a supplemental editor, but never to write any content).

This really feels like something that could solve at least some of the problems at both ends of this market. There are some potential limits here, of course. The AI companies need so much new content that it’s unclear if this would create enough to matter. But it would create something. And it could be lots of somethings. And not only that, but it should be pretty damn up-to-date somethings (which can be useful).

There could be reasonable concerns about conflicts of interest, but as it stands today, most journalism is funded by rich billionaires already. I don’t see how this is any worse. And, as suggested, it could be structured such that the journalists aren’t employees, and it could (should?) have explicit promises about a lack of editorial control or interference.

The AI companies might also claim that it’s too expensive to create a large enough pool, but if they’re so desperate for good, high-quality content, to the point of potentially buying up famous publishers, then, um, it seems clear that they are willing to spend, and it’s worth it to them.

It’s not a perfect solution, but it sure seems like one that solves two big problems in one shot, without fucking up the open web or relying on copyright as a crutch. Instead, it funds the future production of high-quality journalism in a manner that is helpful both for the public at large and the AI companies that could contribute to the funding. It also doesn’t require any big new government law. The companies can just… see the benefit themselves and set up the program.

The public gets a lot more high-quality journalism, and journalists get sustainable revenue sources to continue to do good reporting. It’s not quite a Swiftian modest proposal, in that… it actually could make sense.

Filed Under: a modest proposal, ai, copyright, generative ai, journalism, link taxes, llms, training, training data
Companies: google, meta, openai

EU Parliament Fails To Understand That The Right To Read Is The Right To Train

from the reading-is-fundamental-(to-AI) dept

Walled Culture recently wrote about an unrealistic French legislative proposal that would require the listing of all the authors of material used for training generative AI systems. Unfortunately, the European Parliament has inserted a similarly impossible idea in its text for the upcoming Artificial Intelligence (AI) Act. The DisCo blog explains that MEPs added new copyright requirements to the Commission’s original proposal:

These requirements would oblige AI developers to disclose a summary of all copyrighted material used to train their AI systems. Burdensome and impractical are the right words to describe the proposed rules.

In some cases it would basically come down to providing a summary of half the internet.

Leaving aside the impossibly large volume of material that might need to be summarized, another issue is that it is by no means clear when something is under copyright, making compliance even more infeasible. In any case, as the DisCo post rightly points out, the EU Copyright Directive already provides a legal framework that addresses the issue of training AI systems:

The existing European copyright rules are very simple: developers can copy and analyse vast quantities of data from the internet, as long as the data is publicly available and rights holders do not object to this kind of use. So, rights holders already have the power to decide whether AI developers can use their content or not.

This is a classic case of the copyright industry always wanting more, no matter how much it gets. When the EU Copyright Directive was under discussion, many argued that an EU-wide copyright exception for text and data mining (TDM) and AI in the form of machine learning would be hugely beneficial for the economy and society. But as usual, the copyright world insisted on its right to double dip, and to be paid again if copyright materials were used for mining or machine learning, even if a license had already been obtained to access the material.

As I wrote in a column five years ago, that’s ridiculous, because the right to read is the right to mine. Updated for our AI world, that can be rephrased as “the right to read is the right to train”. By failing to recognize that, the European Parliament has sabotaged its own AI Act. Its amendment to the text will make it far harder for AI companies to thrive in the EU, which will inevitably encourage them to set up shop elsewhere.

If the final text of the AI Act still has this requirement to provide a summary of all copyright material that is used for training, I predict that the EU will become a backwater for AI. That would be a huge loss for the region, because generative AI is widely expected to be one of the most dynamic and important new tech sectors. If that happens, backward-looking copyright dogma will once again have throttled a promising digital future, just as it has done so often in the recent past.

Follow me @glynmoody on Mastodon. Originally posted to WalledCulture.