The current state of the theory that GPL propagates to AI models
shujisado.org166 points by jonymo 11 hours ago
166 points by jonymo 11 hours ago
Great article but I don't really agree with their take on GPL regarding this paragraph:
> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.
The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.
> What is important is how to realize the “freedom of software,” which is the philosophy of open source
Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).
I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?
> The virality is a byproduct to ensure the software is not stolen from their users.
If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
> Freedom of software means nothing.
Software is information. Does "freedom of information" mean nothing? I think you're narrowing concepts here into something not particularly useful or reflective of reality.
> Users get the freedom to enjoy the software how they like.
The freedom is to modify the code for my own purposes. This is not at all required to plainly "enjoy" the software. I instead "enjoy a particular benefit."
> Why would my software which only contains a single function not be fair use?
Because fair use implies educational, informational, or transformational outputs. Your software is none of those things.
As a user I suffer from not being able to freely use or derive my own work from Microsoft’s
This. People conflate consumer to user. A user in the sense of GPL is a programmer or technical person whom the software (including source) is intended for.
Not necessarily a “user of an app” but a user of this “suite of source code”.
> The spirit of the GPL is the freedom of the user, not the code being freely shared.
who do you mean by "user"?
the spirit is that the person who actually uses the software also has the freedom to modify it, and that the users recovering these modifications have the same rights.
is that what you meant?
and while technically that's the spirit of the GPL, the license is not only about users, but about a _relationship_, that of the user and the software and what the user is allowed to do with the software.
it thus makes sense to talk about "software freedom".
last not least, about a single GPL function --- many GPL _libraries_ are licensed less restrictively, LGPL.
I don't think you understand the GPL.
> "the user is allowed to do with the software"
The GPL does not restrict what the user does with the software.
It can be USED for anything.
But it does restrict how you redistribute it. You have responsibilities if you redistribute it. You must provide the source code, and pass on the same freedoms you received to the users you redistribute it to.
Thinking on though, if the models are trained on any GPL code then one could consider that they contain that GPL code, and are constantly and continually updating and modifying that code, thus everything the model subsequently outputs and distributes should come under the GPL too. It’s far from sufficient that, say, OpenAI have a page on their website to redistribute the code they consume in their models if such code becomes part of the model’s training data that is resident in memory every time it produces new code for users. In the spirit of the GPL all that derivative code seems to also come under the GPL, and has to be made available for free, even if upon every request the generated code is somehow novel or unique to that user.
The GPL arose from Stallman's frustration at not having access to the source code for a printer driver that was causing him grief.
In a world where he could have just said "Please create a PDP-whatever driver for an IBM-whatever printer," there never would have been a GPL. In that sense AI represents the fulfillment of his vision, not a refutation or violation.
I'd be surprised if he saw it that way, of course.
The safeguards will prevent the AI from reproducing the proprietary drivers for the IBM-whatever printer, and it will not provide code that breaks the DRM that exist to prevent third-party drivers from working with the printer. There will however be no such safeguards or filters to prevent IBM to write a proprietary driver for their next printer, using existing GPL drivers as a building block.
Code will only ever go in one direction here.
Then we'd better stop fighting against AI, and start fighting against so-called "safeguards."
I wish you luck. The music industry basically won their fight in forcing safeguards against AI music. The film industry are gaining laws regulating AI film actors. The code generating AI are only training on freely accessible code and not proprietary code. There is multiple laws being made against AI porn all over the world (or possible already on the books).
What we should fight is Rules For Thee but Not for Me.
But that isn't the same code that you were running before. And like, let's not forget GPLv3: "please give me the code for a mobile OS that could run on an iPhone" does not in any way help me modify the code running on MY iPhone.
Sure it does. Just tell the model to change whatever you want changed. You won't need access to the high-level code, any more than you need access to the CPU's microcode now.
We're a few years away from that, but it will happen unless someone powerful blocks it.
Genuine question: if I train my model with copyleft material, how do you prove I did?
Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.
I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.
By reverse inference. We can determine what content a pathway has been trained on. We can find out if it’s been trained on GPL material.
Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]
https://github.com/ocaml/ocaml/pull/14369/files#diff-062dbbe...
I once had a well-known LLM reproduce pretty much an entire file from a well-known React library verbatim.
I was writing code in an unrelated programming language at the time, and the bizarre inclusion of that particular file in the output was presumably because the name of the library was very similar to a keyword I was using in my existing code, but this experience did not fill me with confidence about the abilities of contemporary AI. ;-)
However, it did clearly demonstrate that LLMs with billions or even trillions of parameters certainly can embed enough information to reproduce some of the material they were trained on verbatim or very close to it.
So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.
There is a stupid presupposition that LLMs are equivalent to human brains which they clearly are not. Stateless token generators are OBVIOUSLY not like human brains even if you somehow contort the definition of intelligence to include them
Even if they are not "like" human brains in some sense, are they "like" brains enough to be counted similarly in a legal environment? Can you articulate the difference as something other than meat parochialism, which strikes me as arbitrary?
All law is arbitrary. Intellectual property law perhaps most of all.
Famously, the output from monkey "artists" was found to be non-copyrightable even though a monkey's brain is much more similar to ours than an LLM.
[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
If IP law is arbitrary, we get to choose between IP law that makes LLMs propagate the GPL and law that doesn't. It's a policy switch we can toggle whenever want. Why would anyone want the propagates-GPL option when this setting would make LLMs much less useful for basically zero economic benefit? That's the legal "policy setting" you choose when you basically want to stall AI progress, and it's not going to stall China's progress.
The question was "if I train my model with copyleft material, how do you prove I did?"
not your brain, but the code you produce if it includes portions of GPL code that you remembered.
> So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.
Your brain is part of you. Some might say it is your very essence. You are human. Humans have inalienable rights that sometimes trump those enshrined by copyright. One such right is the right to remember things you've read. LLMs are not human, and thus don't enjoy such rights.
Moreover, your brain is not distributed to other people. It's more like a storage medium than a distribution. There is a lot less furore about LLMs that are just storage mediums, and where they themselves or their outputs are not distributed. They're obviously not very useful.
So your analogy is poor.
> Genuine question: if I train my model with copyleft material, how do you prove I did?
An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?
In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?
Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?
> even if your model was trained strictly on copyleft material
That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.
> That's not legal use of the material according to most copyleft licenses.
That probably doesn't matter given the current rulings that training an AI model on otherwise legally acquired material is "fair use", because the copyleft license inherently only has power because of copyright.
I'm sure at some point we'll see litigation over a case where someone attempts to make "not using the material to train AI" a term of the sales contract for something, but my guess would be that if that went anywhere it would be on the back of contract law, not copyright law.
> Genuine question: if I train my model with copyleft material, how do you prove I did?
It may produce it when asked
https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...
> Genuine question: if I train my model with copyleft material, how do you prove I did?
discovery via lawyers
You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.
It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.
Now, would that be enough to put the entire AI under GPL? I doubt it.
I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.
On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.
At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.
I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.
Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.
Maybe we should requiring training data be published or at least referenced.
There's the other side of this issue. The current position of the U.S. Copyright Office is that AI output is not copyrightable, because the Constitution's copyright clause only protects human authors. This is consistent with the US position that databases and lists are not copyrightable.[1]
Trump is trying to fire the head of the U.S. Copyright Office, but they work for the Library of Congress, not the executive branch, so that didn't work.[2]
[1] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
[2] https://apnews.com/article/trump-supreme-court-copyright-off...
> Should I keep open sourcing my code now that the licence doesn't matter anymore?
your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!
> Genuine question: if I train my model with copyleft material, how do you prove I did?
The burden is on you to prove that you didn't.
genuine question: why you are training your model with content that explicitly will have requirements violated if you do?
https://www.penny-arcade.com/comic/2024/01/19/fypm
Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.
The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.
I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.
Not a lawyer, just excited to see the outcomes :).
Ideally, Congress would just settle this basket of copyright concerns, as they explicitly have the power to do—and have done so repeatedly in the specific context of computers and software.
I've pitched this idea before but my pie in the sky hope is to settle most of this with something like a huge rollback of copyright terms, to something like 10 or 15 years initially. You can get one doubling of that by submitting your work to an official "library of congress" data set which will be used to produce common, clean, and open models that are available to anyone for a nominal fee and prevent any copyright claims against the output of those models. The money from the model fees is used to pay royalties to people with materials in the data set over time, with payouts based on recency and quantity of material, and an absolute cap to discourage flooding the data sets to game the payments.
This solution to me amounts to an "everybody wins" situation, where producers of material are compensated, model trainers and companies can get clean, reliable data sets without having to waste time and energy scraping and digitizing it themselves, and model users can have access to a number of known "safe" models. At the same time, people not interested in "allowing" their works to be used to train AIs and people not interested in only using the public data sets can each choose to not participate in this system, and then individually resolve their copyright disputes as normal.
What is ideal about getting more shitty laws written at the behest of massive tech companies? Do you think the DMCA is a good thing?
As opposed to waiting for uncertain court cases (based on the existing shitty laws) to play out for years, ultimately decided by unelected judges?
Democracy is the worst system we’ve tried, except for all the others.
(Also: The GPL can only be enforced because of laws passed by Congress in the late ‘70’s and early ‘80’s. And believe you me, people said all the same kinds of things about those clowns in Congress. Plus ça change…)
Courts applying legal analysis to existing law and precedent is also an operation of democracy in action and lately they've been a lot better at it than legislators. I don't know if you've noticed, but the quality of our legislators has substantially deteriorated since the 80s, when 24-hour news networks became a thing. It got even worse after the Citizens United decision and social media became a thing. "No new laws" is really the safest path these days.
DMCA isn't intrinsically copyright. It's a questionable attempt at a safe harbor provision that has horrible provisions for abuse. I'm not even of the opinion that copyright about computer software is poorly executed. It's mostly software patents that don't make any sense to me. When you have a concept that essentially every mathematics undergrad is familiar with getting labels slapped on it & called a novel technique. It's made worse by the fact that the patent office itself isn't enabled to perform any real review. There are no shortage of impossible devices patented each year in the category of things perpetual motion.
I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.
My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.
/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.
I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.
> I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed
Not sure about undesirable, I so wish we could just ban all generative AI.
I feel profound sadness of having lost the world we had before generative AI became widespread. I really loved programming and seeing my trade devalued with vibe coding is just heart breaking. We will see mass unemployment, deep fakes, more AI induced psychosis, a devaluing of human art. I hate this new world.
It would be the morally correct thing to bann generative AI as it only benefits corporations and doesn't improve the life of the people but makes it worse.
The training of the big LLMs has been criminal. Whether we talk about GPL licensed code or the millions of artist that never released their work under a specific license and would never haven consented to it being used for training.
I still think states will allow it and legalize the crime because they believe that AI offer competitive advantages and they will fear "falling behind". Plus military use.
Anyone can very easily avoid training on GPL code. Yes, the model might be not be as strong as one that is trained that way and released under terms of the GPL, but to me that sounds like quite a good outcome if the best models are open source/open weight.
Its all about whose outcomes are optimized.
Of course, the law generally favors consideration of the outcomes for the massive corporations donating hundreds of millions of dollars to legislature campaigns.
Would it even actually help to go down that road though? IMO the expected outcome would simply be that AI training stalls for a bit while "unencumbered" training material is being collected/built up and you achieve basically nothing in the end, except creating a big ongoing logistical/administrative hassle to keep lawyers/bureaucrats fed.
I think the redistribution effect (towards training material providers) from such an scenario would be marginal at best, especially long-term, and event that might be over-optimistic.
I also dislike that stance because it seems obviously inconsistent to me-- if humans are allowed to train on copyrighted material without their output being generally affected, why not machines?
Human learning is materially different from LLM training. They're similar in that both involve providing input to a system that can, afterwards, produce output sharing certain statistical regularities with the input, including rote recital in some cases – but the similarities end there.
>Human learning is materially different from LLM training [...] but the similarities end there.
Specifically what "material differences" are there? The only arguments I heard are are around human exceptionalism (eg. "brains are different, because... they just are ok?"), or giving humans a pass because they're not evil corporations.
Why? I'm pretty sure I can learn the lyrics of a song, and probabilistically output them in response to a prompt.
Is the existence of my brain copyright infringement?
The main difference I see (apart from that I bullshit way less than LLMs), is that I can't learn nearly as much as an LLM and I can't talk to 100k people at once 24/7.
I think the real answer here is that AI is a totally new kind of copying, and it's useful enough that laws are going to have to change to accommodate that. What country is going to shoot itself in the foot so much by essentially banning AI, just so it can feel smug about keeping its 20th century copyright laws?
Maybe that will change when you can just type "generate a feature length Pixar blockbuster hit", but I don't see that happening for quite a long time.
The article repeatedly treats license and contract as though they are the same, even though the sidebar links to a post that discusses the difference.
A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.
And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
>And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
What "lobbied"? Copyright law hasn't materially changed since AI got popular, so I'm not sure where these lobbying efforts are showing up in. If anything the companies that have lobbied hard in the past (eg. media companies) are opposed to the current status quo, which seems to favor AI companies.
I am really surprised that media businesses, which are extremely influential around the world, have not pushed back against this more. I wonder whether they are looking at cost savings that will get from the technology as a worthwhile trade-off.
They're busy trying to profit from it rushing to enter into licensing agreements with the LLM vendors.
Yeah, the short term win is to enter a licensing agreement so you get some cash for a couple years, meanwhile pray someone else with more money fights the legal battle to try and set a precedent for you
Several media companies have sued OpenAI already. So far, none have been successful.
All theirs, if they properly obtained the copy.
This is a big difference that already has bit them.
In practice it wouldn't matter a whit if they lobbied for it or not.
Lobbying is for people trying to stop them; externalities are for the little people.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
fair use only applies in the united states (and Poland, and a very limited set of others)
https://en.wikipedia.org/wiki/Fair_use#/media/File:Fair_use_...
and it is certainly not part of the Berne Convention
in almost every country in the world even timeshifting using your VCR and ripping your own CDs is copyright infringement
Most commonwealth countries have fair dealing, which is similar although slightly different https://en.wikipedia.org/wiki/Fair_dealing
importantly "fair dealing" has no concept of "transformation"
(which is the linch-pin of the sloppers)
Great, so the US and China can duke it out trying to create AGI or whatever, whereas most other countries are stuck in the past because of their copyright laws?
France and most of europe has fair use (https://fr.wikipedia.org/wiki/Copie_priv%C3%A9e) but also has a mandatory tax on every sold medium that can do storage to recover the "lost fees" due to fair use
> To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Is this legally settled?
That is just the sort of point I am trying to make. That is a copyright law issue, not a contractual one. If the GPL is a contract then you are in breach of contract regardless of fair use or equivalents.
It's not specific to open source but it's most clearly enforceable with open source as there will be many contributors from many jurisdictions with the one unifying factor being they all made their copyright available under the same license terms.
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
A GPL license is a contract in most other countries. Just not US probably.
That part of the article is about US cases, so its US law that applies.
> A GPL license is a contract in most other countries. Just not US probably.
Not just the US. It may vary with version of the GPL too. Wikipedia claims its a civil law vs common law country difference - not sure the citation shows that though.
We need a new license that forbids all training. That is the only way to stop big corporations from doing this.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.
That isn't even remotely a sensible analogy. Equating copyright violation with stealing physical property is an extremely failed metaphor.
One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"
Maybe you have some legalistic point that escapes comprehension, but I certainly consider my house to be much private and the internet public.
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?
Fair use was for citing and so on not for ripping off 100% of the content.
Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
This principle is also explicitly declared in US law:
> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...
Recoding a video file doesn't get rid of the copyright therefore doing some automatic processing on a copyrighted material doesn't remove the copyright.
The problem is that openai has too much money. But if I did what they are doing I'd get into massive legal troubles.
So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.
Taken to an extreme:
"Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
Isn’t it the very reason why we need cleanroom software engineering:
https://en.wikipedia.org/wiki/Cleanroom_software_engineering
If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
> And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
Wouldn't it be still legal to train on the data due to fair use?
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Honest question: why don’t you think it is fair use?
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.
I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
> Who can't contribute to Wine?
> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.
> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?
> this is pretty much what LLMs are doing
I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?
Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.
lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them
The fair use prong that's problematic is that the fair use can't decimate the value of the original work. It's the difference between me imitating your art style for a personal project and me making 1,000,000 copies of your art so that your art isn't worth much anymore. One is a fair use, the other is exploitative extraction
Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
It isn't the difficult, a license that forbids how the program is used is a non-free software license.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
Yet the GPL imposes requirements for me and we consider it free software.
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
Running the program and analyzing the source code are two different things...?
In the context of Free Software, yes. Freedom one is about the right to study a program.
But training an AI on a text is not running it.
And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
Not sure why the FSF or any other organization hasn't released a license like this years ago already.
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.
You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).
I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".