OpenAI, Meta Fair Use Bids Risky in Copyright Drama

Speaker 1

Hello, and welcome to the Boats and Verdicts podcast hosted by the litigation and policy team at Bloomer Intelligence, an investment research platform of Bloomberg LP. Bloomberg Intelligence has five hundred analysts and strategists working across the globe and focused on all major markets. Our coverage includes over two thousand equities and credits, and we have outlooks on more than ninety industries and one hundred market indices, currencies and commodities. Now, this podcast series examines the intersection of business policy and law. I'm Tamlin Basin. I'm an analyst with Bloomberg Intelligence covering intellectual property litigation impacting the tech sector. Now, often that means a focus on patent litigation, but right now we're in a period of a rising tide of copyright litigation that could have profound impacts on the singular technology issue of the past few years, that being artificial intelligence, more specifically generative AI, which is made capable by large language models that are trained on massive data sets. Now, the sticking point is that much of that training data was copyrighted, and we've seen dozens of lawsuits pop up in the US brought by rights holders to some of those data sets and against the developers of those large language models. Today we're going to explore some of those lawsuits, and we're going to spend quite a bit of time talking about something called fair use. And we're going to be doing all of this with the very capable help of Jeremy Goldman, partner and co chair of the Emerging Technology Group with the law firm Frankfurt Kernat Klein and Seals. Jeremy, thanks so much for joining us today. Now let's start with fair use. This seems to be the primary defense that's going to be put forward by developers of large language models. Now, fair use seems like a bit of a get out of jail card, and that it doesn't mean that the rights holders copyrights weren't in fringe, but it does seem to mean that the infringement was justified by the end use and geremany, what's someone accused of copyright infringement have to show in order to get that fair use protection?

Speaker 2

Yeah, well, thanks Tomlin for having me and happy to talk about these cases and fair use. You know first, just you know it, actually, fair use is encoded in the United States Copyright Act in a section called you know, seventeen USC. One oh seven, and that section actually says that even though you might exercise one of the exclusive rights of copyright, even though you might make a copy, even though you might make a distribution of somebody's copyright protected work, with section one of seven says that even if you exercise one of those exclusive rights that make up the bundle of sticks that is a copyright, that it's not an infringement. It's actually not an infringement, and it's fair use. And it just says if it's a fair use, And then it says, well, what is a fair use? And it says, well, it's a case specific question about whether it's fair use. And Congress tells courts that we're going to look at four factors to decide in each instance, and they're non exclusive, but there's four factors that they say courts should look at whether it's fair use, and they're all encoded in the statute. We can go through the four of them, but that's basically in all the cases involving fair use, courts will look and examine each of the four factors, weigh them, and then decide whether on balance those factors tilt in favor of it being a fair use or not being a fair use.

Speaker 1

Yeah, and I think that's excellent important that this is condified. But at the same time, it's very much is a doctrine that has evolved through case law. I think it's fair to say. I mean, there's there's been a number of high profile cases in fair use, both in the circuit courts and of course at the Supreme Court, but probably relevant to what we're talking about today, to the development of large language models, there's probably a few to the standout, and I think one of those was the Google Books litigation. Can you maybe tell us what that litigation was about and why there might be some parallels to these large language models.

Speaker 2

Sure, and you know, I'm happy to talk about the Google Books litigation. I'll also mention that I was a litigator that was that worked on those cases, and in those cases I represented the Author's Guild in lawsuits that were brought against both Google and against a consortium of libraries that worked with Google and connection with the Google Books program. My views today are just talking as a lawyer to appinn on sort of the decisions that came out in those cases, and certainly don't represent the views of the author's guild, or the views of any parties in those cases, or even the views of my law firm. But I am happy, and I'm very familiar with those cases and those decisions. So with that disclaimer, let me just talk about those cases. Google made its mission to make the entire world searchable, right, That was the goal of Google was to make the entire world searchable. They started with the Worldwide Web, of course, and you know, you put into Google and put in search results and get results back from the Web. And to do that they had to copy, you know, essentially the entirety of the World Wide Web in order for that to happen. Well, Google decided that, you know, maybe that's not enough, and we also want to make all of the world's knowledge, including all of the books of the world, searchable. So in order to do that, Google partnered with major university libraries around the country, and they entered into agreements with them to go to the libraries and take all of their books, millions upon millions of books, many many of which were protected by copyright still, and they set up scanning stations and they scanned and reproduced and made copies of millions upon millions of books in order to digitize them and then run them through an optical character recognition process and OCR process, and then ultimately for the purpose of creating a search index so that when you do searches, either on books dot Google dot com or even on the main Google search engine, if you find keywords that hit on words within the books, you would be able to find those books. And what Google was doing was displaying a snippet of where the keywords were located within a book, and so you'd see that, oh, it was in this book on this page or whatever. And the other side the university libraries, as part of the deal, they also got access to this search index, and there was a website called hoti Trust where you could put in keywords and they didn't show snippets, but they did show that in this book on this page, you'd be able to find that information. The university libraries also decided that as part of this process, they were going to start making certain works that were quote orphan works available to members of the university library to make them available online well authors around the country. Indeed, authors around the world were not happy that Google was digitizing and scanning and making use potentially for commercial purposes through Google, millions and millions of books without permission. And you know, books being at sort of the core of what copyright protects, protects the rights of authors, it protects the exclusive rights of authors. And they were felt that this was a violation of their copyright, and so they filed lawsuits against Google. They filed lawsuits against the university libraries alleging that the digitization and scanning of the books was copyright infringemen. And in those cases, Google and the university libraries defended themselves primarily on the ground that, yes, we made copies, but those copies were a fair use, and they argued that it was transformative that books were written and published for one purpose, and we were using them for quite a different purpose, which was to make them searchable. Also, they said, to make them available for people that were blind was another example of what they were doing. And the court, ultimately the court the case went up to the Second Circuit Court of Appeals, and the court found that indeed it was a fair use and held that this digitization for the purpose of making it searchable was a fair use.

Speaker 1

Yeah, and I think you sort of alluded to it then and Ultimately it did go up to the Second Circuit, but this was by no means a swift resolution of a case. I recall it going on for years and years. There was potential settlements that I believe were ultimately rejected by the court. But I think what to some extent, what they showed is that these can be very difficult issues. It was to sort of resolving court. Their use is kind of a squishy content. There's very few bright line rules totally.

Speaker 2

And i'd also at at a high level, you know, that case sort of not just on the legal points, does it have analogies to the current cases that authors and creators are filing against the AI companies, but on a on a feeling level, on an icky level, there's a very similar sentiment where there was a sentiment that you know, authors who are really you know, the Copyright Act itself and indeed in the Constitution, it talks about giving authors exclusive rights over their works for limited periods in order to promote progressive science in the art. That's the constitutional mandate of the Copyright Act, and authors, you know, felt, here's this technology company coming in and scanning and digitizing and making use and profiting off of our work and they're claiming that it's use and that just seems wrong. And I feel like there's the same kind of sentiment now with the you know, the AI cases that we'll get into.

Speaker 1

Yeah, I think that's absolutely right. And I guess just to drill down a little bit further on sort of you mentioned it right there, the transformative use. I believe that's ultimately how Google was decided. But that's only one kind of of the fair use factors. It's sort of, I think, kind of baked in the and and the fairs for you factor you want to really quickly sort of run us through those fair use factors.

Speaker 2

Let me run through the fair use factors, and also in doing so, we'll talk about transformative use, because indeed that word of transformativeness doesn't appear in the fair use factors. That's a judicial creation. It's a creation really of a judge. Judge levolved in from a Harvard Law Review article that he wrote about fair use, and then the Supreme Court picked it up in a case involving a two life cruse song. I won't go too far down that road, but just to just to tick through the factors, okay, I don't have it in front of me, but I think I know it by hard, right, Which is the first factor is the purpose at and character of the use. The second factor is the nature of the work and the nature of the work. And well, we'll go back to the first factor because it's kind of the most important, So I'll go through the four and then we'll go back to the first. So purpose and character of the use. The second one is the nature of the copyrighted work, which is really like how close is it to the core of copyright protection? And then you could think of like a fictional work versus a nonfictional work. Copyright doesn't protect ideas, so copyright will be more steadfast and protecting a fictional work or a highly creative illustration, for example, right versus something like an encyclopedia. The third factor is the amount of substantiality of the use, right, how much of it was used in relation to the whole sort of a quantitative question. And then the fourth is the harm of the use on the market. And so that's really like a damages thing, like how harmful is this? And really courts tend to weigh very strongly. The first factor, which is the purpose in character and the fourth factor. Now, just going back, like I said, to the first factor, when we talk about the purpose of the use, the way that this has as courts have sort of dealt with this over the years, they've articulated the first factor and broken it into two pieces. First, is is it a commercial use or a non commercial use? Until recently that factor, of that sub factor of factor one, got very little, actually very little kind of deference. Right. People often think, oh, if it's a nonprofit organization, they're going to get a lot of difference. If it's a school, if it's an educational institution. Indeed, that does provide some it does provide some weight, but ultimately that's very rarely been dispositive in a court's analysis about whether it's commercial or non commercial with the US is the most important factor and what courts have paid the most attention to and what starts getting to the core of modern fairies jurisprudence is the purpose of the use. And there what the Supreme Court has said in this case of originally involving the case involving two Life Crew and a kind of a parody of what was held to be a parody of the song Pretty Woman by Roy Orbison, and a dirty version by Two Life Crew. The Court held that some like a parody is transformative because it's taking the original work and it's commenting on the original and shedding new light on the original, sort of adding something new, and it is the new work has a new purpose from the original. It has transformed the work from what it was before into what it was later. And the Supreme Court weighed in on that in the mid nineties in that Two Life Crew case, and hadn't really weighed in on it again until just what last year or two years ago when they decided that a work involving Andy Warhol, and we can I'm sure, I imagine we'll get to that work involving Andy Warhol and Andy Warhol work was not held to be fair use. So though I should be careful the way I say that the particular use of an Andy Warhol work was held out to be fair use, But the transformative use test kind of controls frequently. And what you see in the cases involving fair use is the defendant trying to argue that their use is new and transformed against the original one, and the plaintiffs saying, no, we did it for this purpose and you're using it for the same purpose.

Speaker 1

That's very very well spoken, and also you crushed it on those various factors for not having them good job, law school training, coming back. So I guess you drew a lot of parallels between Google Books and so the litigation that's going now with the lms. But I'd like to point out a distinction that I've written about, and it's that Google Books was not necessarily Google's core business model. You mentioned, yes, they wanted to digitize everything, but books are really only a small portion of this. I don't think it was ever expecting to make massive amounts of revenue necessarily from Google Books. It was kind of a side project. Obviously implications for authors, but for Google maybe not core to the business model. I think that's that's different than what we see now with Open AI or Anthropic or some other lms. The LLM is the business model, and in the US copyright infringement damages they can be massive. Of course, you can have for willful infringement. I believe it's up to one hundred and fifty thousand dollars per work. We're talking about millions and millions of potential works being used. So I think this really does potentially oppose sort of an existential threat to those business models if the fair use defense doesn't hold up. So I think this is why that I think it's so important as this field is still kind of in its infancy. But let's turn back, I think to the Warhol case, and also I think the way Warhol was implemented more recently in probably the first AI case that we've had, and that being Ross Intelligence. Now you wrote about that, and I think in your post you said this case helps delineate the boundaries of acceptable fair use of a copyrighted material for an AI model training. I guess can you tell us why was Ross what Ross did outside of those accepted boundaries.

Speaker 2

Well, we don't know what the accepted boundaries are, right, we don't know the extended accepted boundaries what we have now, And just to give a little context for listeners here is you know, you have now upwards of forty cases for copyright infringement or copyright adjacent type claims that have been asserted against AI models for using copyright protected materials without permission and without payment. And you know the you know, the big ones and the most famous ones being sort of like New York Times filing a lawsuit against Open AI or Sarah Silverman filing similar types of lawsuits, and those cases are against and it's sort of important to understand that for your question to understand the differences of the models and why this case is important those cases. Let's just take the New York Times versus Open AI. That's a case by publishers and authors against a generative AI large language model general purpose. That's sort of like underlying it is this you know GPT model that's being trained on billions and billions of points of training data and has kind of limitless purposes. And people can put those models to use in all sorts of industries for all sorts of purposes. And you know, one application being chat cheept for example, the case by Thompson Reuters against Ross Intelligence, which was brought way back in in twenty twenty actually, so it really predates a lot of this. That's a case, you know where Thompson Reuters, which is you know, provides among other things, legal research tool West Law, which is you know, my go to legal research tool of choice, and they filed a lawsuit against Ross Intelligence, which is a legal focused AI research tool, an AI powered Legal research tool, and they claimed that Ross Intelligence used Westlaw's headnotes, which are you know, summaries of points, legal points of holdings and what you know. Thomson Reuter says, those are copyright protected, and they claimed that Ross Intelligence use them to train the Legal Tool AI model and claimed that this was copyright infringement. Ross Intelligence argued that, no, we don't ever output the West Law headnotes and our users transformative and its fair use. And in that case the court held that it would not fair use. And that decision came out just February eleventh, very recently, and it is the first meaty, substantive fair use decision that is in this existential question that you talked about in the beginning, which is is the use of copyright protected materials to train artificial intelligence models copyright infringement or is it fair use? And in this particular case, the court said that it was not fair use. Very important to understand, though, is the particular facts of this case. As we said, fair use is a case by case analysis. It's very facts specific, and there are important distinctions between the facts of the Ross Intelligence case and the facts of cases like New York Times versus Open AI whether those and we can talk about what those differences are and why they might matter, whether those will be enough to change the outcome in those cases. As I called it, the trillion dollar question. I say trillion dollar because of what you said, right, I mean, you're talking about massive copyright damages, potential copyright damage is in and an industry that is now valued. Probably you're a Bloomberg, you're in a better position to tell me. But I imagine it's a trillion dollar industry now, and you know the outcomes of those rulings could really uh you know, have an impact on that, on that industry and the amount of damages that could be at stake there.

Speaker 1

Yeah, I mean, I mean AI has certainly been the key theme in the investment community for years and years, and the evaluations seem to sort of double every every few months. So it's a massive amount. But yeah, let's let's sort of dive into too. Why I think you mentioned Ross was not generative AI. I think I think the judge was pretty clear that to draw that distinction. But will that necessarily make it it different in how Ross applied I guess Warhol. When we do start dealing with the generative AI defendants.

Speaker 2

I I don't know. That's that's that's one that I'm not going to give you a I don't have a crystal ball, and I'm not I'm not even going to give you my personal take on that. It's too that's too dicey, that's too dicey. Right, what I here's what I'll tell you. Let's just talk about why it might matter as a why it might be a relevant distinction and not what we call in the legal world of distinction without a difference, right. And so sometimes you have factual differences that don't have any legal bearing, and sometimes they have legal bearing, and let's talk about why it might matter. Here's an example. And this just goes to the fact that the Ross Intelligence model was not a generative AI. What that model allowed you to do was, you know input. You would put in an input with a legal research question and asking, you know, the AI model, hey, what's the law on X, Y and Z, and then it would come back with relevant judicial opinions, which are public domain, and say, here's a judicial opinion that answers the relevant model. The point of the West Law headnotes why they used it was to help train it to understand the way that people talk, because not everyone talks like a judge and a judicial opinion. People talk more like West Law headnotes than they talk like judges and opinions. So they use those West Law headnotes to train the AI to talk like a human and not just like a lawyer or a judge. So they never generated in generate new texts. Contrast that with something like a chat sheept, which generates original texts that presumably in the outputs are not going to You know, it could, but that's a different question. The outputs likely don't infringe the underlying materials that we're input into the system. Courts have in cases involving fair use looked at things like is the new use going to create more content that's creative and new and useful and ultimately driving and further to the constitutional mandate that I mentioned before in the Constitution, Article one, Section eight, Clause eight says to promote the progress of science and the useful arts, we're going to give authors exclusive rights over their works for limited periods of time. That's what copyright is. It's giving exclusive rights to authors to promote progress of science in the arts. One thing that a court might look at is does generative AI help promote progress and science and useful arts by creating all of this output that's useful for society. I think, you know, another sort of angle on this which isn't really about the generative AI, which is more like the ultimate use of it. But another factor that plays into that and is very important to understand the less about the generative AI, which is the application of a large language model or the application of the GPT for example, that underlies open AIS model or an anthropic equivalent. Right, is what data is used to train those models and to make them able to do the extraordinary work that they do. And what the court found relevant in the Raws Intelligence case is that they were using these West Law headnotes in order to train it to provide something that ultimately was found to be sort of competitive with West Law. But what they said was, the court said, you know, that's material that you could have gotten a license for, right, you could have paid. It's not like you're dealing with that large a body of work like you could have created your own headnotes. You know, if like Thompson Breuter's in the court said this, Thompson Broider's pay, you know paid people authors, which is the way it's supposed to work under copyright and pay them to go and create headnotes, and like they get to protect those and you have to pay for them, and you could and there's really was nothing preventing getting a license from a Lexus Nexus or a West Thought or just hiring authors to create their own headnotes. Now it's difficult to see. And I think an argument that open AI is going to the open AIS anthropics of the world, I believe will try to argue in those cases is to say we couldn't license all of this stuff that is needed to change these large language models. Let's emphasize large. We needed to be really large. An argument that I think that's going to be made, and whether this is going to prevail again, I'm not gonna opine on it, but I think the argument will be something like, we need all of the line wage of the world, we need all of the culture of the world, we need everything that's out there, and we couldn't possibly under any circumstances, license everything that we need in order to teach these models to speak language the second of LLM and to understand culture and understand who we are. And so there was there was a need and courts look to see under the fair use factors like was there really a need for what you were taking here without sort of a license or without permission? And the argument again it is a novel type of argument, and it's also there's like a good rejoinder that I see the author's cure make, which is like, well, just because you can't have it for free, maybe means you don't get it. It doesn't mean like you get to just take it, right. But I think the argument would be we need everything, we need everything, and there's no world and there's no possibility of ever having a licensing model in order to get everything in the world.

Speaker 1

Yeah, that's a really interesting point, and I think sort of again goes to Ross Intelligence where sort of they were in some way is sort of vertically focus. They were sort of designed to compete with Thompson Reuters in this area of headnoe cases. And I think that might to the extent that it might have some read through to the generative AI companies. It's to some of these cases where I also see in some of the complaints the potential for that to be similar. I think some of the visual artists against some of these AI models that also output images that could be licensed. Also, I think there's a new one by dal Jones that I think also raises in the complaint and least argument that you're actually doing this so that you can license sort of newsflow news data. So I think that's what we might see potential read through. But I think also, as you mentioned, I think anybody who says they know how this is going to go, there's no possible way. One you have of wards of forty cases, they could go very different ways. And two, the question is who is going to decide fair use? Now Again, in rosson Intelligence, it was decided at the summary judgment stage, but at first the judge didn't want to decide it at that stage. So I think it's a Again we talked about these very hard concepts and judges struggle with them, and jer Hurry's are certainly going to struggle with them if they are given at least some of these very used fatactors. Do you think we're going to see sort of maybe potentially diverging precedent here, some judges decigdning some juries, assigning some circuits, taking different stances on this.

Speaker 2

I think it's very I think that is more likely than not, and I think that's not I think that it's more likely than not. I also think that in general with jurisprudence, when you have multiple cases that deal with you know, similar but not identical issues, but typically you have is judges try not to reject the reasoning of their brethren and sisters in the other courts. Rather, what they try to do is to distinguish their cases on the facts. And there's a lot of ground for doing so in these cases. So if, for example, a judge in one of the cases in the New York Times case and stop an AI or any of the other cases, even if they don't adopt the exact reasoning of Judge Bibis who issued this decision in ross Intelligence, or if it goes up to the third Circuit of appeals, even if they don't agree with the reasoning, what a judge will try to do first is say our facts are different and so we come out differently. It's more likely that courts will do that than to, you know, head on, disagree with the underlying reasoning. You know that said right now, we just have a district level case. It almost certainly I'd be surprised if it doesn't end up on appeal. There's every reason to believe that a circuit court could disagree with Judge Bbis. Judge Vibas is sitting by designation. He's a circuit judge who's sitting by designation in the district court. And like you said, just a few months ago, he came out arguing that there was a good ground for fair use. Some was going to allow Ross Intelligence to make its fair use case to the jury. And then you know, something changed and he decided that it wasn't fair use as a matter of law. So these are really close calls. I think it's really likely that it goes up to a circuit. I think it's very likely that judges in the Ninth Circuit, for example, which tends to be fairly liberal around fair use, and the Second Circuit, the judges tend to be fairly liberal around technology and fair use. It was the Second Circuit out of New York that held that Google Books was fair use, and it's in the Ninth Circuit that covers you know, San Francisco and all of the tech companies. That doesn't mean that they're going to hold that this is fair use. It does mean that you do have a situation where you could end up with circuit splits, and you also could no one would be surprised if this does end up in the Supreme Court, and whether the Supreme Court has the copyright expertise to handle a case like this is a real question, and you know, something that's creating a lot of uncertainty around what's going to happen.

Speaker 1

Yeah, absolutely, and I think Warhol I don't know if people were necessarily stunned by the outcome, but I believe it. It was seven to two, so it was fairly a sizable majority. And now we should point out that often IP in general, and also copyright isn't necessarily have the partisan vent that a lot of other issues necessarily do. But still seven to two was a fairly strong opinion and again against fair use in that case. So it's definitely an evolving landscape. I would say one thing that we haven't necessarily talked about is sort of the input versus output distinction in this debate. So you can either have infringement based on the input that is, when the LM digests, scans, whatever it does to get that information into the system to train on that data, that can potentially be an infringement of an author's rights. Also, potentially the AI output could infringe There's also an entirely separate manner, and that's whether ALLM can actually produce copyrightable content. But we're going to leave that aside for now. That's why the entirely separate conversation where we don't have necessarily litigation to dive into that. But the input output, What are your views on that and how that might play out?

Speaker 2

Yeah, I mean it can play out in a few different ways. So one, let's just talk about how the output can play into the questions that really involve the input. And you know, the easiest way to think about that is in the cases that have been filed against the against the LM's like open AI. There is an effort by authors in some of those cases, including The New York Times, to argue that our articles are being reproduced almost verbatim when users ask certain queries. Right, so user put an input, tell me what happened in this article, and according to the New York Times complaint, CHATCHPT will with certain prompting output material that the New York Times argues is substantially similar to the articles. Now if that's true, then it becomes a very much easier case for the New York Times. Right then it's like, well, what's the difference between, you know, that versus just a pirate website that copies New York Times articles and makes them available for free. If you're able to just go into this end and reproduced articles in full, there's sort of little doubt that that would be sort of an infringement. Open Aye is really contesting that that's the way the tool is supposed to be used, and arguing that, you know, the only way that you were able to get chat gpt to display those outputs was by basically taking the model and beating it with a stick until it came out with the articles that were verbatim copies. And so it's much easier under copyright, much easier to argue that the technology you've created is supplanting the original if the outputs are generally substantially similar to the inputs, right, that's sort of like pretty pretty intuitive. And so the authors are frequently trying to focus on the outputs, and they've tried different theories. Another theory that was tried and rejected was in one of those cases involving images. The argument was, because all of your images that you output from your model are trained on our copyright protected images, every thing that is output from your model is a derivative of our inputs. Even if the output is not substantially similar to any particular input, all of the outputs are derivative works of the original. A novel argument, but was rejected already on a motion to dismiss by the court, because the law of a copyright is that in order for something to be a derivative work, it has to be substantially similar in some way to the original. And so the court said, no, that's not the way derivative works go. So in many of the cases, the plaintiffs and the authors are having to focus primarily on the inputs and to say that there's still infringement. Even if the outputs are not substantially similar, they're infringing. Let me just say one other category of why the outputs matter, which is sort of a connected question, and one that the defendants the models are happier to have better ground for them to argue, which is this. Listen, if you're the user and you're controlling this platform, and you create an output that because of your prompting and because of your directions, turns out to be infringing. And then, for example, you make some commercial use of that image. Right, take you create an image, you put it into an advertisement, and you start selling shoes with this, you know, infringing image. The models would like to say that's kind of on you, and that you were the one who engaged in that, and that the model had no volitional conduct. And part of copyright is you kind of have to It's not willful, it's not like a malevolent thing, but you at least have to have some volitional intent, Like you know, if you sneeze and you have an accident, often you can get off the hook, like it's an involuntary movement. Right. They kind of want to argue that if somebody is using the tool in a way that it's not supposed to be used, or create some kind of infringing content, then who should be responsible. Should it be the user of the platform or should it be the model. And that's the other question that sort of hasn't been answered yet by courts.

Speaker 1

So it sounds like potentially there's some contributory issue.

Speaker 2

That's that's right, Well, that's what the that's what the place just would argue that's.

Speaker 1

Yeah, yeah, yeah, so that's uh that that could get even more tricky to wagh through. So when you're you're talking about how the New York Times, I think that's right. In their complaint, they're like, these are regurgitation of our articles, I think, and some of the author's complaints potentially of the author's guild or maybe the original Silverman complaint also had that, And it strikes me as also interesting because we also many have divergence there because a news article that is largely a regurgitation of potentially a historical event might have less copyright protection under sort of the second fair use factor than potentially a novel. So so it's there's so many different directions that this could evolve that, Yeah, it's going to be really interesting to see how this plays out over the probably in the next two years or so. Just really quickly, I want to touch on licensing. You sort of mentioned before that a potential argument will be that we can't license the world, but they certainly are licensing a lot. We've seen so many large licensing agreements. I think the largest ones I can remember is Google with Reddit for I think something like sixty million dollars a year. New York Times, of course, is sued. A lot of rather newspaper publishers have sued, but a lot of them have also just agreed to licensing agreements with whether it's Open Ai, whether it's Google, whether it's some of these other lms. So how does that potentially cut against ALM in fair use if they can go out and license a lot of this content, or does it maybe not impact it? Or again who knows?

Speaker 2

Yeah, I mean so one of the challenges in just going back to the Google Books case that we talked about in the beginning, one of the challenges in that case for the authors was proving that there was a licensing market for the uses that were being made by Google Books, and that was a challenge, and the court picked up on that challenge. And that fourth factor, which is like the harm to the market for the copyright work the court found in that case was that there was no market for licensing books for the purpose of creating a search index. No one ever had paid any author. I want to use your book so that I can make it searchable on the Internet. Without displaying the book, people won't be able to read the book you'll just be able to find where the information lives inside of a book. No one had ever paid for that, and so the court had difficulties seeing how that was harming the economic interests of the author's copyright. Okay, contrast that with what you just talked about, where you have a fledgling licensing market for using copyright protected works to help train AIS. And when you have the AI companies that are trying to defend themselves with fair use, are they cutting their nose to spite their face by entering into these agreements and creating a market that didn't really exist before. You know, that's a new novel market that now exists. And now the plaintiffs in those cases have fodder to argue that there's a market harm and it's a there's a cognizable developing and you know, evidenced licensing market. The question is whether that will really undercut the argument that I was talking about you just raise, which is like we need to digitize the entire world, because yeah, so that's number one, which is just because we can license some of the materials from certain players, that's a far cry from the entirety of the world's knowledge, right, And so and that's not it. We still couldn't license nearly enough data to train all these models to the point where they are. And you have new players in the market like deep seek, which had originally said something like we don't need to use that much training data, but it was kind of an artificial argument because they needed to do that based on the models that had already been developed using the entire you know, these extraordinarily massive amounts of training data. So there's that. The other important argument that I imagine will come out is by the by the AI platforms, is you're comparing apples and oranges, that this is not getting the entire corpus of the Internet and sucking it into our model. When we enter into these license agreements, in many cases, what we're paying for is not just the just the underlying like raw data. We're paying for other things that are more valuable to us. For example, you're giving us access to archived materials that are not openly available online. You're giving us access to materials that are behind a paywall. You're also allowing us in many instances to display the outputs or like you know, some of these agreements that are out there and I've worked on many of them, I see many of them. They say things like, you can display x you know, one hundred or tens or whatever it is, X number of words from the article or from the original book or whatever it is, and display that to users in response to searches, provided that there's a link back, et cetera. Sometimes a link, sometimes not. And those are rights and privilege that are not part of what the AIS are doing. So they would say, you know, it's not really fair to look to those license markets and look to those you know, agreements as evidence that there's a licensing market, because this is a totally different sort of arrangement and that's what we're paying for. Again, I'm not calling balls and strikes here. I'm just saying that these are some of the arguments that I think will be made and some of the distinctions that will attempted to be drawn, and that will you know, could could sway the court wumber or the other. Yeah.

Speaker 1

Absolutely, And I think it also sort of shows the appetite to train these models on things like news content, on things like novels. I think I was reading some executive some own companies like there's no way better way to train an l m than than on novels, seeing how people speak, you know, it's just sort of some of the the the best data that they can get to train these models on. And I think it'll be interesting to see see. I do think these cases shape UPO.

Speaker 2

I do think what's what's what's encouraging, because look, I'm somebody who it's certainly in this context I can see not to see both sides, but I certainly feel an affinity towards authors and artists and creators right. I think that copyright does exist to support their work and their effort, and they should be paid for their work and their effort. I think that there's a lot of doom and gloom and that's also appropriate and understandable and fear around what's going to happen with these models, and totally totally get it. What's also nice to see is I've seen authors get paid, like, you know, all of a sudden, get a check for like twenty five hundred bucks, you know, and of a book that hasn't sold in many, many years, and all of a sudden, like some publishers are entering into deals and like authors are getting a check for a few thousand bucks for a work because they've entered into a deal to you know, use it to train an AI or you know, make some kind of use in it. So that to me is also encouraging and so sort of from a business standpoint, you really see the development of these new markets, and that to me is exciting and encouraging. Although again I know that there's a lot of anger and fear and you know, litigation over the broader issues, but it is nice to see that there's it's nice to see authors getting paid, including in connection with these new markets. Yeah.

Speaker 1

Sure, And I think overall I look at the markets, and my feeling has been that I think the markets or to some extent underestimating just how much of an overhang this could potentially be for large language models. And I think, you know, our discussion today sort of touched on just how tricky this situation is, how fair use can go one way and in another way one court, it may certainly be fair use, another court maybe not. Distinction is drawn. I think we're going to see some more summary judgment decisions ripen in twenty twenty five. I believe one of the meta cases, I believe they're breathing that earlier part of this year, and I think there's a case with judge alsop out in California with Anthropic where I think they're scheduled to do both summary judgment, fair use and sort of class certification later in the year. So I think as possible we start to get some clarity on this in twenty twenty five, maybe moving into twenty six, but it's gonna be an absolutely fascinating area to watch, I think, over the next few years. So I think we're going to leave it there. I think we've covered a lot of ground. Jeremy, it really appreciate you coming on today and all of your your really invaluable insights on this, I think, and it is really good to hear from from a pretictioner's perspective. I'm not wrong that these are challenging issues. It's it's a yeah, it's gonna be very interesting to see how this plays out. Jeremy, thanks so much.

Speaker 2

Thanks to one anytime,

In 1 playlist(s)

Votes & Verdicts

Social links

Follow podcast

Recent clips