RIGHTBRAIN BLOG

The Road to Hell is Paved with Yellow Bricks

Off the back of recent comments from Sir Elton John, Pete Tiarks explores the overlooked nuance in proposed legislation—specifically, a transparency measure from the House of Lords.

Image generated in Google Gemini from the prompt "nerdy Elton John Glasses"

This post is the first in a series of posts on the use and impact of LLMs on the generation of creative content. This is, potentially, a really interesting and important debate. Sadly, most of the takes I see on it are profoundly unhelpful, going for an easy framing of a fight between the creative industry on one side, and the tech industry on the other. This isn’t wrong, exactly: there are obviously real and important issues there. But the resort to this as a framing is generating a lot more heat than light, as they say.

So, I felt a certain sense of dread on Sunday when I saw that this had once again broken out of my nerdy legal updates feed and into my general news feed, this time thanks to an intervention by Sir Elton John on Sunday with Laura Kuensberg:

Sir Elton John described the government as "absolute losers" and said he feels "incredibly betrayed" over plans to exempt technology firms from copyright laws.
Speaking exclusively to Sunday with Laura Kuenssberg, he said if ministers go ahead with plans to allow AI firms to use artists' content without paying, they would be "committing theft, thievery on a high scale".
This week the government rejected proposals from the House of Lords to force AI companies to disclose what material they were using to develop their programmes.
A government spokesperson said that "no changes" to copyright laws would be "considered unless we are completely satisfied they work for creators".
Generative AI programmes mine, or learn, from vast amounts of data like text, images, or music online to generate new content which feels like it has been made by a human.
Sir Elton said the "danger" is that, for young artists, "they haven't got the resources ... to fight big tech [firms]".
"It's criminal, in that I feel incredibly betrayed," he added.

Sir Elton was in a state of pretty high dudgeon, and Kuensberg is far more interested in the Punch and Judy show than the policy detail, so the interview focused entirely on the “can you train LLMs on copyright work?” angle. But if you look at the text of the write-up, you can just about pick out that the House of Lords’ proposal itself was a little more subtle and interesting than that.

Now to be fair, this is not entirely the BBC’s fault. The Lords themselves largely framed this as a measure to protect the creative industries.1 But look at it again: “proposals from the House of Lords to force AI companies to disclose what material they were using to develop their programmes.” If you thought, “that’s not really about whether it’s legal to train LLMs on creative material”, then give yourself a prize. The key text from the proposed amendment is as follows:

The regulations must require specified business data to be published by [LLM providers] so as to provide copyright owners with information regarding the text and data used in the pre-training, training, fine-tuning and retrieval-augmented generation in the AI model, or any other data input to the AI model.

This is a transparency measure, not a copyright measure. Now, this does have relevance for the creative industries: if you’re a copyright holder, knowing whether a particular LLM provider has used your work is a necessary first step if you want to make the provider give you some money for it. But it has relevance for quite a lot more than that.

I said “if you’re a copyright holder” before, but I might as well not have bothered, because I can guarantee with 99.99999% certainty that you are. You’ve got copyright in your notebook doodles, your homework projects from school, your emails, etc. etc. Obviously you’re not making the same sort of bank as Sir Elton from them, but both you and he would have been entitled to receive information about all the copyright material (i.e. more or less everything) that was in OpenAI’s training dataset.

Why would you want to know about that? Maybe you aspire to be the next Elton John and you want to be paid for the privilege of having your work included in the training set. But maybe you’re just interested in LLMs and want to know how they’re trained. Maybe you want to deploy an LLM for some high stakes use case, and are wondering if there’s anything in the training data that might make it unsuitable. Perhaps you’re hoping to build an LLM and would like to see how the big boys are doing it.

This is a really important distinction. Whilst I sympathise with a lot of Sir Elton’s concerns, my thoughts on the copyright debate are diametrically opposed. I think the best protection against the predations of big tech is more competition. If anyone building an LLM has to license training data, then only the rich will be able to build LLMs. Unlike Sir Elton, I’d want to allow very liberal use of copyright content for the training of LLMs by everyone - not just the titans with billions on their balance sheet. But, like Sir Elton, I would very much like to take a peek at the training data those titans are using.

Indeed, I struggle to think of many tech businesses that would actually have been disadvantaged by the House of Lords proposal. It’s a pretty good rule of thumb that most consumers of a product would like there to be more information about it. I’m currently working at a UK tech business - Rightbrain AI - where we allow our customers to make educated decisions about LLMs and rapidly deploy them. I can easily imagine our customers considering an LLM’s training data as one of the criteria for whether to choose it over another. That’s a service we’d love to be able to provide to them, and the proposed amendment would have let us do it.

I have no idea why the government has rejected that amendment. Their official reason is that it would cost public money to implement, which is largely just a way of ending the conversation. The optimistic view is that if there’s going to be transparency, they want it done properly. But one explanation may be that they bought into the same, lazy, “creative industries vs. AI” narrative that the BBC did. If so, we’re all worse off for it.