That’s not the only thing that matters. The provenance of the code also matters enormously, specifically whether the person contributing it actually has the right to do so.
If I contributed code to an Open Source project behind my old employer’s back, that would have been bad, because that code was owned by them and not me, even if I wrote it on my own time using my own equipment, because of the contract I signed with them.
If I copied code out of an AGPLv3-licensed codebase and contributed it to a BSD-licensed codebase without telling anyone, that would have been bad, because I did not have the right to change the license on that code to BSD (or change the license on the codebase to which I was contributing to AGPLv3).
If you use an LLM to produce code, you may well be doing the latter since an LLM is actually just regurgitating portions of its inputs. This is not a hypothetical scenario; I’ve personally encountered a case of someone using an LLM attempt to contribute code I recognized from a specific Open Source project under one license to another project under a different license, while claiming they “wrote it themselves.”
Any project that accepts contributions needs to take liability seriously and manage their risk appropriately.
> This is not a hypothetical scenario; I’ve personally encountered a case of someone using an LLM attempt to contribute code I recognized from a specific Open Source project under one license to another project under a different license
You say you "recognized code". Does it mean that you weren't able to find the exact match?
> an LLM is actually just regurgitating portions of its inputs
You seem to be talking about the inputs to the autoregressive pretraining stage. Correct? Then it's not how LLMs work, unless we use a definition of portions as a "few letters blocks."
I found exact matches. I also found inexact matches, where C functions had been turned into C++ member functions and the like. “Recognized” does not somehow imply a lack of precision.
The LLM the person used was trained on a very large corpus of Open Source code, and reproduced that code exactly. Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.
Were those functions trivial? With, say, 1% probability of someone who have not seen them writing them like that?
> Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.
Have you read the articles? As far as I remember they fed large chunks of an article multiple times to an LLM to sometimes get a not-so-long exact match. It can mean that LLMs can infer a style and humans are predictable.
> […] fed large chunks of an article multiple times to an LLM […]
So they had to prompt? An LLM? I got this argument before and still don’t get what it’s trying to say. These models do not output anything unless prompted, that’s not any kind of gotcha.
On the code outputting front there is a lot of relevant evidence beyond the NYC lawsuit [0].
If I slightly modify GPL code, that doesn’t give me the right to relicense.
No, the functions weren’t trivial, and a lot of the surrounding code and structure bore substantial similarities as well. If you saw the two files next to each other, you’d assume it was the result of a copy-paste-adjust process if you didn’t know an LLM was involved.
I can only speculate that the model that generated the code hasn't undergone selective unlearning for verbatim data (SUV) or something similar. As you understand "sometimes generates verbatim code" and "just regurgitates [non-trivial] portions its input" are different statements.
The possibility of SUV clearly shows that a model does more than "just regurgitating."
"LLM produced licensed code and person contributed it" is indistinguishable from "person contributed licensed code". The LLM is irrelevant. Result is the same as if they had copy pasted it.
Unfortunately, a large number of people are being told—and here, you can see many who believe it—that the output of an LLM either carries no copyright or is copyright by the one prompting it. In other words, even right here on Hacker News it’s widely believed that LLMs “launder” copyright.
Not irrelevant. A large number of people who would not copy and paste code from one project to the another will attempt to contribute the copyright-infringing output of an LLM and not think twice.
Have fun with 1000x more Buns that literally no one is using or maintaining. An entire software industry built on top of a burning garbage pile of crappy, dead code.
You think Anthropic wants to be the sole maintainer of thousands of forked OSS projects...? I seriously doubt that would happen, for legal, marketing, and logistical reasons alike.
The Fortune 10 company that I spent decades at and retired from just a couple years ago noticed this issue immediately and issued a blanket ban on the use of these tools for the company’s own code that to my knowledge has not been rescinded. (They also started developing their own coding-specific LLM, training solely on code they owned, around the same time.)
You might consider that there is a very large incentive by the large and public players in this market to promote the idea that this is not true, that they consider themselves large and powerful enough to actually flout the law, and that they plan to use the argument that enforcement will be too damaging to the economy to make their view the “new normal.”
This playbook has been run before, by Uber and Lyft, by AirBnB, by Tesla with “FSD,” and so on. It’s very clearly the approach being taken.