Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.
Copyright's ambit has been pretty much defined and run by US for over a century.
You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
I don't disagree. However, just because your assertion of copyright being initially defined by US (which is not the fact. It was England that came up with it and was adopted by the Commonwealth which US was also a part of until its independence) does not mean jurisdiction is US. Even if US Supreme Court rules one way or the other, it doesn't matter as the rest of the World have its own definitions and legalese that need to be scrutinized and modernized.
Alsup absolutely did not vindicate Anthropic as "fair use".
> Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. [0]
It was only fair use, where they already had a license to the information at hand.
These are crappy arguments. The author is seeking to re-litigate Piracy of IP is bad, and AI is bad.
If those are your axioms then you will find the old world is already in the rear-view mirror, and they want to pull back every other project to stay with them in that world.
AI is here. Free software succeeded - make as much as you want. This technology a force multiplier.
You can debate it's morality, but most people want to do their work.
I get (incorrectly) accused of writing undisclosed sponsored content pretty often, so I'm actually hoping that the visible sponsor banner will help people resist that temptation because they can see that the sponsorship is visible, not hidden.
That's actually a cleaner editorial standard than most publications follow. The major risk in tech journalism isn't disclosed sponsorships — it's the undisclosed access journalism where coverage tone shifts to maintain relationships. Visible banners beat invisible influence every time.
Honestly, after his ~23 years of writing online I think he's fairly earned the title as an independent researcher. He added those sponsorships three days ago; perhaps wait to raise your alarm bells until he actually writes about a sponsor.
I can't offer an example of code, but considering researchers were able to cause models to reproduce literary works verbatim, it seems unlikely that a git repository would be materially different.
These arguments absolutely infuriate me. You're code is not that unique. Lots of people write the same snippet everyday and have no idea that somebody else just wrote the same thing.
It's such a crock that you can somehow claim you're the only person who can write that snippet and now everyone else owes you something. No. No they don't. Get over it.
Writing a book is different. Lifting pages or chapters is different because it's much harder for two people to write the exact same thing. Code is code, it follows a formula and a everyone uses that formula.
Assuming that even works from a researcher's perspective, it's working back from a specific goal. There's 0 actual instances (and I've been looking) where verbatim code has been spat out.
It's a convenient criticism of LLMs, but a wrong one. We need to do better.
> There's 0 actual instances (and I've been looking) where verbatim code has been spat out.
That’s not true. I’ve seen it happen and remember reports where it was obvious it happened (and trivial to verify) because the LLM reproduced the comments with source information.
Either way, plagiarism doesn’t require one to copy 100% verbatim (otherwise every plagiarist would easily be off the hook). It still counts as plagiarism if you move a space or rename a variable.
You should take your findings to the large media organizations including NYT who've been trying to prove this for years now. Your discovery is probably going to win them their case.
I don't know code examples, but this tracks, for me. Anytime I have an agent write something "obvious" and crazy hard -- say a new compiler for a new language? Golden. I ask it to write a fairly simple stack invariant version of an old algorithm using a novel representation (topology) using a novel construction (free module) ... zip. It's 200loc, and after 20+ attempts, I've given up.
It happens often enough that the company I work for has set up a presubmit to check all of the AI generated and AI assisted code for plagiarism (which they call "recitation"). I know they're checking the code for similarity to anything on GitHub, but they could also be checking against the model's their training corpus.
Have you worked with a professional architect. Cost adds up fast, and you get 1-2 iterations?
I'd love to work and vibecode the house to my full liking, assuming that the agent harness will take care of all the nonfunctional things (stable design, zoning etc). Same for car if I could customize it I would.
(I definitely don't like the ramifications of it on the economy/jobs, but the above are pure consumer wins, no doubt)
> I'd love to work and vibecode the house to my full liking,
Instead of deadcode, it'll leave you with a few extra secret rooms that have no doors or windows :)
The reason you wouldn't want this is cost. The cost of building a house is marginally affected by designing it with an AI agent. Most of the cost is bricks, etc -- material.
I'm starting to get a new sense of which people LLMs are useful for. I'm sure they're life-changing for those with intelligence below that of a child, so I'm glad for you that you have this tool available now.
reply