But it does need to know personal info to be useful as an agent (calendars, email). The danger is that it’s a hassle to vet every bit of data, and to be useful it needs to know a lot, leading to oversharing, and if you use it long enough you will leak secrets that you didn’t want to leak.
Yes for those in group B I'd suspect many were doing exactly what these cheaters in group A were doing - submitting the unaltered output of an LLM as their review.
The rejection is based on the dishonesty of explicitly committing to standard A and then knowingly violating it, not on LLM use as such. I think that's pretty fair, considering that everyone could have just chosen B if they wanted to.
Sure, I'm just pointing out that the 2% headline figure is very conservative if not misleading as a far greater unknown number in group B will have done exactly the same (which I doubt ICML or those submitting papers actually want). This is probably a first step in clamping down on anyone doing this.
Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.
Given this detection method works so well in the use case of feeding reviewing LLMs instructions, it should also work for the original submitted paper itself, as long as it was passed along with its watermark intact. Even those just using LLMs to summarise could easily be affected if LLMs were instructed to generate very positive summaries.
So the 2% cheaters on policy A, AND 100% of policy B reviewers could fall for this and be subtly guided by the LLMs overly-positive summaries or even complete very positive reviews (based on hidden instructions).
That this sort of adversarial attack works is really quite troubling for those using LLMs to help them understand texts, because it would work even if asked to summarise something.
This definitely happened to a paper that I submitted a couple of years ago. ChatGPT 4 was the frontier. The reviewer gave a positive, if bland, summary with some reasonable suggestions for improvement and some nitpicks. There were no grammar or line-number comments like those from other reviewers. They were all issues that would have been resolved by reading the appendices, but the reviewer hadn't uploaded into ChatGPT. Later on I was able to replicate the output almost exactly myself.
What I found funny was that if you asked ChatGPT to provide a score recommendation, it was also significantly higher than what that reviewer put. They were lazy and gave a middle grade (borderline accept/reject). We were accepted with high scores from the other reviews, but it was a bit annoying that they seemingly didn't even interpret the output from the model.
The learning experience was this: be an honourable academic, but it's in your interest to run your paper through Claude or ChatGPT to see what they're likely to criticise. At the very least it's a free, maybe bad, review. But you will find human reviewers that make those mistakes, or misinterpret your results, so treat the output with the same degree of skepticism.
> Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.
I may or may not know a guy who added several hidden sentences in Finnish to his CV that might have helped him in landing an interview.
Not at all. It's just that reportedly LLMs used to have a blind spot for prompt injection in languages with relatively few speakers and grammar dissimilar to that of English.
> Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.
LLMs have a real problem with not treating context differently from instructions. Because they intermingle the two they will always be vulnerable to this in some form.
Then these papers with these instructions get included in the training corpus for the next frontier models and those models learn to put these kinds of instructions into what they generate and …?
Not necessarily. I was intending it as a thought experiment illustrating why some kind of formal language (whether that mean technical jargon, unambiguous syntax, unambiguous semantics, conlangs, specification languages, or some combination thereof) will eventually arise from natural language - as it has countless times in the past, within mathematics (as referenced in TFA) and elsewhere. Gherkin is kind of nice though.
Considering the many hundreds of technical comments over at the PR (https://github.com/nodejs/node/pull/61478), the 8 reviewers thanked by name in the article, and the stellar reputations of those involved, seems likely.
My mistake 19k lines. At 2 mins per line that’s (19000*2)/60/7=90 7-hour days to review it all, are you sure it was all read? I mean they couldn’t be bothered to write it, so what are the chances they read it all?
For someone’s website or one business maybe the risk is worth it, for a widely used software project that many others build on it is horrifying to see that much plausible code generated by an LLM.
I probably review about 1k LoC worth of PRs / day from my coworkers. It certainly doesn't take me 33 hours (!!) to do so, so I must be one of those rockstar 10x superhero ninja engineers I keep hearing about.
I think that goes back to whether they are programmers vs engineers.
Engineers will focus on professionalism of the end product, even if they used AI to generate most of the product.
And I'm not going by "title", but by mindset. Most of my fellow engineers are not - they are just programmers - as in, they don't care about the non-coding part of the job at all.
Depends - if it is from a human I find I can trust it a lot more. If it is large blobs from LLMs I find it takes more effort. But it was just a guess at an average to give an estimate of the effort required. I’d hope they spent more than 2 mins on some more complex bits.
Are you genuinely confident in a framework project that lands 19kloc generated PRs in one go? I’d worry about hidden security footguns if nothing else and a lot of people use this for their apps. Thankfully I don't use it, but if I did I'd find this really troubling.
It also has security implications - if this is normalised in node.js it would be very easy to slip in deniable exploits into large prs. It is IMO almost impossible to properly review a PR that big for security and correctness.
usually yes, but that's why there are tests, and there's a long road before people start depending on this code (if ever). people will try it, test it, report bugs, etc.
and it's not like super carefully written code is magically perfect. we know that djb can release things that are close to that, but almost nobody is like him at all!
I carefully review far more than 14k LoC a week… I’m sure many here do. Certainly the language you write in will greatly bloat those numbers though, and Node in particular can be fairly boilerplate heavy.
You'd have to manage the contributions, or get your AI bots to manage them or something, but it would be great to have honeypots like this to attract all the low effort LLM slop.
How would you know it’s an enthusiastic and smart expert creating the content you’re consuming, do you have the subject matter expertise to judge that?
The odds are far higher it’s somebody who knows very little about anything but wants to make money from the gullible.
reply