AI assistance when contributing to the Linux kernel
github.com194 points by hmokiguess 9 hours ago
194 points by hmokiguess 9 hours ago
Basically the rules are that you can use AI, but you take full responsibility for your commits and code must satisfy the license.
That's... refreshingly normal? Surely something most people acting in good faith can get behind.
I agree this is very sane and boring. What is insane is that they have to state this in the first place.
I am not against AI coding in general. But there are too many people "contributing" AI generated code to open source projects even when they can't understand what's going on in their code just so they can say in their resumes that they contributed to a big open source project once. And when the maintainer call them out they just blame it on the AI coding tools they are using as if they are not opening PRs under their own names. I can't blame any open source maintainer for being at least a little sceptical when it comes to AI generated contributions.
But then if AI output is not under GNU General Public License, how can it become so just because a Linux-developer adds it to the code-base?
AIs are not human and therefore their output is a human authored contribution and only human authored things are covered by copyright. The work might hypothetically infringe on other people's copyright. But such an infringement does not happen until a human decides to create and distribute a work that somehow integrates that generated code or text.
The solution documented here seems very pragmatic. You as a contributor simply state that you are making the contribution and that you are not infringing on other people's work with that contribution under the GPLv2. And you document the fact that you used AI for transparency reasons.
There is a lot of legal murkiness around how training data is handled, and the output of the models. Or even the models themselves. Is something that in no way or shape resembles a copyrighted work (i.e. a model) actually distributing that work? The legal arguments here will probably take a long time to settle but it seems the fair use concept offers a way out here. You might create potentially infringing work with a model that may or may not be covered by fair use. But that would be your decision.
For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
That you can't copyright the AI's output (in the US, at least), doesn't imply it doesn't contain copyrighted material. If you generate an image of a Disney character, Disney still owns the copyright to that character.
Yes. And that’s why the rules say that the human submitting the code is responsible for preventing this case.
IANAL; this is what my limited understanding of the matter is. With that caveat: it is easy to forget that copyright is on output- verbatim or exact reproductions and derivatives of a covered work are already covered under copyright.
So if the AI outputs Starry Night or Starry Night in different color theme, that's likely infringement without permission from van Gogh, who would have recourse against someone, either the user or the AI provider.
But a starry-night style picture of an aquarium might not be infringing at all.
>For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
I would argue that if it was a verbatim reproduction of a copyrighted piece of software, that would likely be infringing. But if it was similar only in style, with different function names and structure, probably not infringing.
Folks will argue that some things might be too small to do any different, for example a tiny snippet like python print("hello") or 1+1=2 or a for loop in your example. In that case it's too lacking in original expression to qualify for copyright protection anyway.
>AIs are not human and therefore their output is a human authored contribution and only human authored things are covered by copyright.
That is a non sequitur. Also, I'm not sure if copyright applies to humans, or persons (not that I have encountered particularly creative corporations, but Taranaki Maunga has been known for large scale decorative works)
Copyright applies to legal persons, that's why corporations can have copyright at all.
Didn't a court in the US declare that AI generated content cannot be copyrighted? I think that could be a problem for AI generated code. Fine for projects with an MIT/BSD license I suppose, but GPL relies on copyright.
However, if the code has been slightly changed by a human, it can be copyrighted again. I think.
Thaler v. Perlmutter said that an AI system cannot be listed as the sole author of a work - copyright requires a human author.
US Copyright Office guidance in 2023 said work created with the help of AI can be registered as long as there is "sufficient human creative input". I don't believe that has ever been qualified with respect to code, but my instinct is that the way most people use coding agents (especially for something like kernel development) would qualify.
Interesting. That seems to suggest that one would need to retain the prompts in order to pursue copyright claims if a defendant can cast enough doubt on human authorship.
Though I guess such a suit is unlikely if the defendant could just AI wash the work in the first place.
No, a court did not declare that. The case involved a person trying to register a work with only the AI system listed as author. The Supreme Court decided that you can't do that, you need to list a human being as author to register a work with the Copyright Office. This stems from existing precedent where someone tried to register a photograph with the monkey photographer listed as author.
I don't believe the idea that humans can or can't claim copyright over AI-authored works has been tested. The Copyright Office says your prompt doesn't count and you need some human-authored element in the final work. We'll have to see.
It's almost a certainty that you can't copyright code that was generated entirely by an AI.
Copyright requires some amount of human originality. You could copyright the prompt, and if you modify the generated code you can claim copyright on your modifications.
The closest applicable case would be the monkey selfie.
https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
It's almost certain that you're wrong. It's like saying I can't copyright a song if my modular synthesizer generated it. Why would you think this?
I’m curious to see if subscription vs free ends up mattering here. If it is a work for hire, generally it doesn’t matter how the work was produced, the end result is mine, because I contracted and instructed (prompted?) someone to do it for me. So will the copyright office decide it cares if I paid for the AI tool explicitly?
> Didn't a court in the US declare that AI generated content cannot be copyrighted?
No, my understanding is that AI generated content can't be copyrighted by the AI. A human can still copyright it, however.
It's obvious that a computer program cannot have copyright because computer programs are not persons in any currently existing jurisdiction.
Whether a person can claim copyright of the output of a computer program is generally understood as depending on whether there was sufficient creative effort from said person, and it doesn't really matter whether the program is Photoshop or ChatGPT.
Just thinking out loud... why can't an algorithm be an artificial person in the legal sense that a corporation is? Why not legally incorporate the AI as a corporation so it can operate in the real world: have accounts, create and hold copyrights...
Corporations are required to have human directors with full operational authority over the corporation's actions. This allows a court to summon them and compel them to do or not do things in the physical world. There's no reason a corporation can't choose to have an AI operate their accounts, but this won't affect the copyright status, and if the directors try to claim they can't override the AI's control of the accounts they'll find themselves in jail for contempt the first time the corporation faces a lawsuit.
Same as if a regular person did the same. They are responsible for it. If you're using AI, check the code doesn't violate licenses
In certain law cases plagiarization can be influenced by the fact if person is exposed to the copyrighted work. AI models are exposed to very large corpus of works..
Copyright infringement and plagiarism are not the same or even very closely related. They're different concepts and not interchangeable. Relative to copyright infringement, cases of plagiarism are rarely a matter for courts to decide or care about at all. Plagiarism is primarily an ethical (and not civil or criminal) matter. Rather than be dealt with by the legal system, it is the subject of codes of ethics within e.g. academia, journalism, etc. which have their own extra-judicial standards and methods of enforcement.
I suspect they were instead referring to patents; for example, when I worked at Google, they told the engineers not to read patents because then the engineer might invent something infringing, I think it's called willful infringement. No other employer I've worked for has every raised this as an issue, while many lawyers at google would warn against this.
As opposed to an irregular person?
LLMs are not persons, not even legal ones (which itself is a massive hack causing massive issues such as using corporate finances for political gain).
A human has moral value a text model does not. A human has limitations in both time and memory available, a model of text does not. I don't see why comparisons to humans have any relevance. Just because a human can do something does not mean machines run by corporations should be able to do it en-masse.
The rules of copyright allow humans to do certain things because:
- Learning enriches the human.
- Once a human consumes information, he can't willingly forget it.
- It is impossible to prove how much a human-created intellectual work is based on others.
With LLMs:
- Training (let's not anthropomorphize: lossily-compressing input data by detecting and extracting patterns) enriches only the corporation which owns it.
- It's perfectly possible to create a model based only on content with specific licenses or only public domain.
- It's possible to trace every single output byte to quantifiable influences from every single input byte. It's just not an interesting line of inquiry for the corporations benefiting from the legal gray area.
How could you do that though? You can’t guarantee that there aren’t chunks of copied code that infringes.
But the responsible party is still the human who added the code. Not the tool that helped do so.
The practical concern of Linux developers regarding responsibility is not being able to ban the author, it's that the author should take ongoing care for his contribution.
That's not going to shield the Linux organization.
A DCO bearing a claim of original authorship (or assertion of other permitted use) isn't going to shield them entirely, but it can mitigate liability and damages.
In a court case the responsibility party very well could be the Linux foundation because this is a foreseeable consequence of allowing AI contributions. There’s no reasonable way for a human to make such a guarantee while using AI generated code.
It’s not about the mechanism: responsibility is a social construct, it works the way people say that it works. If we all agree that a human can agree to bear the responsibility for AI outputs, and face any consequences resulting from those outputs, then that’s the whole shebang.
Sure we could change the law. It would be a stupid change to allow individuals, organizations, and companies to completely shield themselves from the consequences of risky behaviors (more than we already do) simply by assigning all liability to a fall guy.
What law exactly are you suggesting needs to be changed? How is this any different from what already happens right now, today?