How Licensing Models Can Be Used for AI Training Data in Generative AI and Beyond (original) (raw)
Licensing is likely to become a more common occurrence between generative AI developers and rights-holding content companies. That’s even in the unlikely event AI companies sweep numerous pending copyright battles currently making their way through U.S. courts.
First, it’s not guaranteed that all applications of gen AI systems will be deemed legal under the U.S. Copyright Act.
In the many U.S. AI litigation cases that claim copyright infringement, generative AI companies have argued that training their models with copyrighted works qualifies under the Copyright Act’s fair use exception. That argument relies heavily on a 2015 court finding that Google’s digitization of millions of books was fair use.
Yet fair use determinations are fact specific and require a court’s application of a four-factor test in which the use’s purpose and market impact are crucial. While Google digitizing books for the purpose of displaying brief quotes in response to a search query satisfied the Second Circuit’s application of the fair use test, every AI model’s use case may not similarly receive a court-sanctioned stamp of fair-use worthiness.
It’s clear the courts will take time to settle these cases, and that might not be fast enough for gen AI to gain enterprise confidence and adoption.
Where U.S. courts will ultimately land in the ongoing generative AI cases remains uncertain. More predictable is the drawn-out time it will take the courts to stick that landing: The Google Books fair use lawsuit took over 10 years. Further, these decisions would naturally make law only for the U.S. and won’t resolve any of the cross-border questions swirling around gen AI.
Customers of generative AI software want assurances now that they can leverage AI models without legal liability. Authorized training datasets with documented and traceable provenance stand out as the quickest path to providing that assurance.
Accordingly, developers’ vested interest in fostering customer trust will play a pivotal role in the coming shift toward licensing third-party training data and paying the corresponding license fees.
Potential Gen AI License Structures
Generative AI licensing will likely have a few commonly used models. So far, both direct licensing via one-to-one negotiation and via commercial aggregator are already operational. Direct license models will certainly remain and coexist with any additional model placed in service.
Meanwhile, collective, compulsory and judicial settlement license models are not yet present in the U.S. generative AI market. But one or more of these could materialize as an outcome of court proceedings and regulatory agency review.
The ability to leverage output without legal liability remains an ongoing concern for generative AI customers. A collective or compulsory model would likely treat only training data. By contrast, both forms of direct licensing — and to a certain extent, a judicial settlement — would more readily offer AI companies an opportunity to negotiate with rights holders and secure rights holder-approved guidelines for AI model users who want to leverage generated output.
Depending on where one sits, potential goals for a gen AI license model for training data include one or more of the following features:
Maximizing the pool of quality training data: A compulsory license model, which entails license terms and rates set by statute, has the best chance of maximizing the available pool of training material for AI developers. Since a compulsory license permits the use of copyrighted material without explicit owner consent and with no ability for owners to opt out, it’s also the model with the least support among rights holders. Any compulsory license would probably deal with a specific content sector rather than encompass all content types.
An opt-out license would yield a larger pool than one that’s opt-in; however, both pose an educational challenge in making all eligible rights holders aware of the license and prompting them to take requisite steps for participation and compensation.
Ensuring access of training data to both small and large AI developers: This is a result influenced heavily by license fees and, to some degree, by transparency of license terms. An industry characterized by license agreements with high fees and undisclosed terms would make securing training data deals more challenging for those AI developers with limited cash and limited market intelligence to inform their negotiations.
One-to-one direct licensing lacks transparency and can act as a barrier to entry by limiting buying opportunities for less economically advantaged developers. In contrast, collective and compulsory license models provide more balance in pricing, transparency and equitable treatment for various classes of both AI companies and rights holders.
Direct licensing through commercial aggregators falls within the middle spectrum. When compared solely to one-to-one direct licensing, aggregation potentially democratizes purchasing ability by lowering license fees to a price that ideally remains fair to rights holders while still offering more transparency and streamlined administrative requirements.
Ensuring equitable license fee compensation to both small and large rights holders: Direct license models offer the best opportunity for large rights holders to maximize license revenue. The downside to direct licensing via one-to-one negotiation and, to a lesser extent, via commercial aggregator is the heightened administrative burden on AI companies to negotiate numerous agreements. Due to their desire to minimize that administrative burden, direct license models might hinder full participation by smaller rights holders.
Judicial, collective and compulsory models might more equally distribute license revenue and opportunities; however, small rights holders often view their collective license compensation as insufficient, while the entire spectrum of rights holders often regard the government-established rates of compulsory licenses as too low.
Minimizing the burden and cost of license administration: A one-to-one direct licensing model will miss content from rights holders with smaller catalogs due to the heightened administrative burden. While direct licensing typically has the highest administrative burden per license contract, each of the other license models do come with registration, record keeping and/or reporting responsibilities as well as periodic court and/or regulatory proceedings to review rates and other terms.
While they work toward establishing licensing standards in whatever form they may take, stakeholders can take the following steps:
Generative AI model developers and their investors: Companies offering commercial AI products and seeking market share should at least explore limiting their operations to models trained on authorized content with documented and traceable provenance.
Categories of authorized training data include material that is owned (the stakeholder’s own original or commissioned content), licensed, public domain (notably not the same thing as publicly available) or open source (used consistently with its terms).
Also, while forging dataset licensing relationships with rights holders, AI companies can simultaneously attempt to expand the license scope to output rights available for the AI model end users. As an example, Stability AI reports training its text-to-music generator Stable Audio on 800,000 tracks obtained through its partnership with music library AudioSparx. As part of that partnership deal, Stability AI secured rights for Stable Audio customers to use generated output in certain commercial contexts.
An authorized dataset is a competitive advantage when pursuing a customer base increasingly concerned about generative AI risk. Providing customers with clear and guaranteed guidance on legal uses of an AI model’s output exponentially fortifies that competitive advantage. The caveat to these initial licensing arrangements is building in sufficient flexibility for appropriate reaction to evolving legal, regulatory and market conditions.
Companies seeking to leverage AI-generated output: Distributing AI-generated output requires the same vetting and rights clearance review one should conduct for any third-party image, music, footage, quote or other asset to be incorporated into a production or creative work. (See VIP+’s November 2023 special report “Rights Clearance for Film & TV.”)
If relying on a third party for generative AI capability, the providers using authorized datasets are the ones most likely able to supply the information required for a reliable clearance review. A certification program offered by Fairly Trained, a nonprofit founded by former Stability executive Ed Newton-Rex, has recently emerged to help the market identify gen AI companies that rely on authorized training material and maintain adequate records. Similar certification mechanisms are sure to follow.
Rights holders seeking additional revenue by making their content available for gen AI use: Most publicized data training license deals have involved large rights holders such as Associated Press, Getty Images and AudioSparx. The leading AI companies seek millions or even billions of data points and find purchasing small catalogs from numerous rights holders too administratively burdensome.
As a result, prominent licensing deals may continue to favor rights holders with large catalogs, leaving moderate and small catalog owners with less attractive licensing options. Small and midsize rights holders might investigate opportunities for licensing to specialized generative AI applications, where their specific unique content library — albeit small — can represent greater value as part of the whole and command a premium license fee.