12/21/2023 11:39:27 AM | 9 minute read

Training AI models: Content, copyright and the EU and UK TDM exceptions

Get in touch

Ally Clark

Associate

Duncan Calow

Partner

Get in touch

Ally Clark

Associate

Duncan Calow

Partner

DLA Piper’s media, sport and entertainment team hosted a webinar on ‘The Impact of AI on the Media & Entertainment Industry’ on 14 December 2023 joined by Lord Clement-Jones – former Chair of the House of Lords Artificial Intelligence Select Committee and currently Co-Chair of the All-Party Parliamentary Group on Artificial Intelligence – who provided his thoughts on the emergence of governance and regulation of AI and its likely impact on the Media and Entertainment Industry including the EU’s draft Artificial Intelligence Act (EU AI Act) and the text and data mining exceptions.

On 8 December 2023 the representatives of the European Parliament, EU Member States and European Commission reached a provisional agreement on the EU AI Act.

The scope of the EU AI Act, which establishes layered obligations in respect of AI based on its potential risks and level of impact, has expanded significantly since the beginning of its legislative journey. Only in June 2023 (14 months after the bill was first introduced) in the wake of this year’s generative AI boom did EU legislators consider it necessary to seek (for the first time) to include specific obligations in relation to foundation models used in AI systems intended to generate content. Since the summer, this has been one of the main areas of debate between the EU legislators.

While an updated version of the EU AI Act has not yet been published (expected early 2024), according to the ‘Compromise proposal on general purpose AI models/general purpose AI systems’ published by Politico on 8 December 2023 (available here: https://www.openfuture.eu/wp-content/uploads/2023/12/231206GPAI_Compromise_proposalv4.pdf) (Compromise Proposal) providers of foundations models will have to:

draw up and maintain technical documentation of the model including its training and testing processes and the results of its evaluation;
draw up, maintain and make available information and documentation to providers of AI systems who intend to integrate the foundation model in their AI System;
put in place a policy to respect EU copyright law. In particular, to identify and respect, including through state-of-the-art technologies where applicable, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790 (DSM Directive); and
draw up and make publicly available a sufficiently detailed summary about the content used for training of the foundation model, according to a template provided by the AI Office.

The recitals to the copyright relevant provisions of the Compromise Proposal (Recitals) also clarify the territorial application of the EU’s TDM rules (as restated in the EU AI Act) that: ‘Any provider placing a [foundation model] on the EU market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these foundation models take place. This is necessary to ensure a level playing field among providers of general-purpose AI models where no provider should be able to gain a competitive advantage in the EU market by applying lower copyright standards than those provided in the Union’.

The Recitals also add colour to the provisions by confirming that any use of copyright protected content requires the authorisation of the rights holders concerned unless a relevant copyright exception applies. Specifically, in relation to the DSM Directive, the Recitals state that rights holders may choose to reserve their rights over their copyright works or other subject matter to prevent text and data mining (TDM), unless this is done for the purposes of scientific research. Where the rights have been expressly reserved by way of an “opt out” in an appropriate manner, providers of foundation models will need to obtain an authorisation from rights holders if they want to carry out TDM over such works. This, of course, expressly restates the position under the 2019 DSM Directive which introduced two exceptions to rights holders’ copyright and database rights which allow reproductions and extractions of lawfully accessible copyright works and database rights:

for the purpose of TDM made by research organisations and cultural heritage institutions in order to carry out TDM for the purpose of scientific research (Article 3, DSM Directive); and
for the purpose of TDM generally (Article 4, DSM Directive). This includes TDM for a purpose other than scientific research, including for commercial purposes. Hence, its importance in the context of using content to train foundation models. The exception under Article 4, DSM Directive applies to all content unless the rights holder has pro-actively taken steps to “opt out” of the exception by expressly reserving their rights “in an appropriate manner such as by machine readable means.” In other words, under the TDM Directive (and now also confirmed in the context of generative AI in the EU AI Act) in the absence of an opt out … content (including that which is protected as a copyright work or as database rights) which is freely available on the internet can be freely reproduced and extracted including for commercial TDM purposes, such as to train a foundation model.

This obviously presents a tremendous risk if rights holders are not aware of the Article 4 TDM exception and fail to expressly “opt out”. Especially given that, once the data is extracted from a website the duration of the mining right is unlimited. An effective opt out is therefore crucial if rights holders wish to reserve their copyright and data base rights.

Other countries such as Japan and Singapore also have broad exceptions and, depending on the facts, TDM may also be fair use under US law.

In July 2023, the Authors Coalition a coalition of US creative workers and other writers, illustrators, photographers, graphic artists, digital media workers, journalists, novelists, playwrights, composers, and songwriters issued a joint statement urging EU legislators to promptly cure what they considered to be violations of the Berne Convention. They also urged the US government to use all available means to bring the EU in compliance with Berne:

“Much of the copying of our works for generative AI, including “scraping” of Web pages and compilation of “datasets” for use in generative AI, has been carried out from, and/or by entities in, the European Union, claiming to rely on the exceptions to copyright for “text and data mining” (TDM) in Articles 3 and 4 of the Directive on Copyright in the Digital Single Market (“DSM Directive”) enacted by the European Union in 2019.

But allowing these exceptions to be applied to copying for ingestion and reuse by generative AI systems constitutes a significant violation of the obligations of EU member states as parties to the Berne Convention….

We urge the European Union to promptly cure this violation of the Berne Convention and provide effective redress for the violations which have already occurred….”

This, of course, was an argument raised by other rights holders when the TDM exemptions in the DSM Directive were first introduced. They also argue that the EU Commission and EU IPO have failed to provide proper guidance on how to opt out in an appropriate manner which therefore means opting out effectively is not possible.

In the absence of guidance from EU legislators and the EUIPO, rights holders and AI developers have made a number of proposals for how to opt out under the DSM Directive. A World Wide Web Consortium (W3C) ‘Text and Data Mining Reservation Rights Community Group’, formed by a federation of publishers, has proposed sample disclaimers (not yet adopted) which can be used to declare the reservation of rights under Article 4 DSM Directive on each web resource which a rights holder controls. Some of the major AI developers have also proposed opt out methods to their own website crawlers.

UK TDM Exception

The implementation deadline for the DSM Directive occurred after the expiry of the UK/EU Brexit transition period and there was no requirement to implement the DSM Directive, and the DSM Directive was not implemented, in the UK.

However, section 29A of the Copyright Designs and Patents Act 1988 (CDPA) already provides for a copyright exception which allows TDM for non-commercial purposes provided: (a) that the user has lawful access to the copyright work (e.g. by virtue of a licence or permission in the terms and conditions); (b) there is sufficient acknowledgment; and (c) it is otherwise dealt with in accordance with the rules contained in this provision. For example, the copy of the copyright work is not transferred to another person.

In October 2014, the UK Intellectual Property Office (IPO) published guidance for publishers around the time that the exception was first introduced. Notably the guidance confirmed that:

“If a researcher has the right to read a copyright document under the terms of the licensing agreement with the content provider, they must be permitted to copy the work for the purpose of non-commercial text and data mining. Contract terms which have the effect of preventing this will be unenforceable”
“Publishers and content providers are able to apply reasonable measures to maintain their network security or stability so long as these measures do not prevent or unreasonably restrict a researcher’s ability to make the copies they need to make for their text and data mining. Contract terms that stop researchers making copies of works to which they have lawful access in order to carry out a text and data mining analysis will be unenforceable.”
“There are no restrictions on how or where outputs of text and data mining can be published, including journals published for profit by academic publishers and under licences that permit commercial research […]. Other commercialisation of the research outputs is not restricted either. But it is important to be scrupulous in assessing whether the original purpose of carrying out the text and data mining analysis is solely non-commercial; if it isn’t, then researchers are very likely to be infringing copyright.”

Publishers developing TDM licences over the last ten years have relied on the scope of the above to frame their contractual approach.

In June 2022, the UK IPO published a response to its consultation which was intended to seek ‘evidence and views on a range of options on how AI should be dealt with in the patent and copyright systems’. In the response, the UK IPO indicated that it planned to broaden the CDPA TDM exception to allow TDM for any purpose (including a commercial one) and without granting rights holders the ability to opt out. In February 2023, the UK Government u-turned on its plans after the proposals drew widespread criticism from rights holder groups, particularly within the creative industries, over loss of revenues under TDM licences. Dan Conway, the CEO of the Publishers Association described the now scrapped plans as a “sledgehammer to crack a nut” while arguing in favour to improve the licensing environment. As to the broadening of the TDM exemption, he said that it: “would allow any … businesses of any size, located anywhere in the world, to access all my members’ data for free for the purposes of text and data mining.”

In March 2023, following recommendations made by Sir Patrick Vallance (the UK Government’s former chief scientific advisor) in his “Pro Innovation Regulation of Technology Review”, the government nevertheless accepted that there remained a need to clarify how AI providers and users can utilise copyright works and data in order to promote AI. As a result, the UK IPO is now working with users and rights holders to assist in developing a Code of Practice on copyright and AI which explicitly states that its aim is to ‘make licences for data mining more available.It will help to overcome barriers that AI firms and users currently face, and ensure there are protections for rights holders. This ensures that the UK copyright framework promotes and rewards investment in creativity. It also supports the ambition for the UK to be a world leader in research and AI innovation’. It is currently envisaged that that it will be entered-in-to on a voluntary basis. In the Government’s ‘Response to Sir Patrick Vallance’s Pro-Innovation Regulation of Technologies Review’ published in March 2023 it said that ‘An AI firm which commits to the code of practice can expect to be able to have a reasonable licence offered by a rights holder in return’ indicating that those firms who sign-up to the Code of Practice may benefit from more favourable licence conditions.

In November 2023, the IP Federation signed a public statement urging the UK IPO to introduce a Code of Practice which is supportive of AI innovation. In the Open Statement, the IP Federation said (amongst other things) that: “In order to achieve the necessary scale, AI developers need to be able to use the data they have lawful access to, such as data that is made freely available to view on the open web or to which they already have access to by agreement.” And also that: “any restriction on the use of such data or disproportionate legal requirements will negatively impact on the development of AI, not only inhibiting the development of large-scale I the UK but exacerbating further pre-existing issues caused by unequal access to data”. Signatories to the letter included a number of alliances from the education and research sector, software industry and organisations such as Creative Commons.

So, what can we expect next year?

The ratification of the EU AI Act in early 2024.
Further EU guidance on the interaction between the EU AI Act and TDM;
Potentially greater activity by rights holders seeking reservation of rights and the evolution of opt-outs to comply with Article 4 DSM.
Action towards opt-out standardisation, to allow increased certainty for all sides on when content is available for TDM.
Wider availability of licensing options including from CMOs.
The publication of the UK Code of Practice on Copyright and AI (expected in early 2024).