Hacker Newsnew | past | comments | ask | show | jobs | submit | mzelling's commentslogin

Here's an important aspect to understand: successful professors don't read papers in full. They're too busy for that. They only take a look at the title, abstract and introduction — and perhaps they will glance at the figures. This is why telling a compelling story is so important.

Thats not true at all. If anything, they will read the figures and skip the introduction.

If it is your field, you don't need an intro, and don't want to hear whatever yarn they are spinning in the abstract/discussion. You jump straight to the figures / table to review the data yourself.


This (also) feels like a core failure mode, in that papers are optimized for skim-level persuasion because the system is too overloaded for deep evaluation at scale. Then a lot of the actual scrutiny gets pushed onto under-credited sub-review labour. Peer review is too important to stay this invisible and under-incentivized. Liberata is exploring exactly that problem, and our beta waitlist is open if you want to follow along: https://liberata.info/beta-signup

I'm not in academia, so I might be fully ignorant about how things operate, but if professors don't reaed the actual paper, can do they know if it's BS or not?

Here's how it works in our group. The professor gives papers to the PhD students or PostDocs, who read the paper completely. I regularly 'sub-review', as it is called, meticulously looking for issues. I have heard that there are professors who review entire papers in 2-3 hours, since they have a lot (10+) of papers per conference to review without any compensation while they have their own research, teaching, and funding to juggle.

It's not a pretty system sometimes.

Edited to add: Conference's also require declaring that there was someone who sub-reviewed the paper. The professor / PI mentions the PhD student's name in the review form of the paper. Of course, the professor also double-checks all the sub-reviews


The sub-review process, when it works well, is arguably a reasonable one. To give the example of how this works from the perspective of the program committee of a conference I'm involved in:

The PC chairs assign papers to members of the PC. Those reviewers are ultimately responsible for the review quality and, a more frequent problem for the conference, ensuring the reviews are in on time. In principle, they can ask anyone to sub-review, but in practice, it usually goes to grad students, postdocs, or graduate alumni (and since we have a relatively light review load per member, we have many people who do all reviews themselves). The reviewers arguably know more about the expertise of their grad students and postdocs than the chairs doing the assignments do. Also unlike a journal, where editors might ask anyone with particular expertise, we both only assign reviews to PC members, and do assign them: PC members only get to state their preferences on what they would like to review. The sub-review process ideally lets reviewers ask people to do reviews who they know would be suited to a particular paper, but who might not be experienced enough to reasonably be on the PC itself with those responsibilities, and the chairs might not know much about. It then lets those reviewers look over the sub-reviewer's work directly, which might include mentoring them. While we do anonymous reviews, identities are visible to chairs, and one thing I've noticed when a chair, for example, is that grad student sub-reviewers often do excellent, thorough reviews, but also often lack the confidence to be sufficiently critical when writing about problems and weaknesses they identify, something that the reviewer can help with.

The review system (we use easychair) directly handles sub-reviewers, and our proceedings list all sub-reviewers (at least, those who actually submitted reviews). Good sub-reviewers can sometimes be reasonable candidates to ask to be on the PC the next year, and give a gentler, safer onramp: we're able to have a wider mix of junior and senior members when there are new postdocs (and I think in one case a grad student) who we already know do reliably good reviews and know our review process.


This feels like a core failure mode: papers are optimized for skim-level persuasion because the system is too overloaded for deep evaluation at scale. Then a lot of the actual scrutiny gets pushed onto under-credited sub-review labour. Peer review is too important to stay this invisible and under-incentivized. Liberata is exploring exactly that problem, and our beta waitlist is open if you want to follow along: https://liberata.info/beta-signup

A few other commenters have talked about the paper review process.

I wasn't thinking of this at all. Important to understand: the peer review process takes up only a minor part of a professor's mindshare. It's considered a chore. Much more important is to read lots of new papers (including pre-prints) for continual education, to know what's going on in your field and adjacent fields.


The fact that the strategy makes zero returns suggests that Polymarket is unbiased — this is moderately interesting.

Has anybody looked into the repo in more detail? I imagine it's useful for infrastructure inspiration to build your own bot pursuing more differentiated trading strategies.


An interesting side effect might be that only people locked out from using LLMs will learn how to program in the future, as vide coding doesn't teach you the fundamentals.

I know what you're thinking — when the calculator came about, being forced to compute in your head wasn't an advantage. But LLMs are different: a calculator is a strictly improved substitute for mental arithmetic, whereas an LLM is only an approximate solution — and it is far from clear whether LLMs will ever become a perfect solution, given the nuanced challenges around context management, interpreting intent, etc.


> An interesting side effect might be that only people locked out from using LLMs will learn how to program in the future, as vide coding doesn't teach you the fundamentals.

While thinking about/working with LLMs, I've been reminded more than once of Asimov's short story Profession (http://employees.oneonta.edu/blechmjb/JBpages/m360/Professio...). In it, no one goes to school: information is just dumped into your brain. You get an initial dump of the basics when you're a kid, and then later all the specialty information for your career (which is chosen for you, based on what your brain layout is most suited to).

The protagonist is one of a number of people who can't get the second dump; his brain just isn't wired right, so he's sent to a Home for the Feeble Minded to be with other people who have to learn the old-fashioned way.

Through various adventures he eventually realizes that everyone who was "taped" is incapable of learning new material at all. His Home for the Feeble Minded is in fact an Institute of Higher Studies, one of only a handful, which are responsible for all the invention and creation that sustains human progress.


> An interesting side effect might be that only people locked out from using LLMs will learn how to program in the future, as vide coding doesn't teach you the fundamentals.

This is the strange part for me. I'm one of those people that I assume are really common here on HN - I've been having fun coding on personal projects for a long time, somewhere circa 1978 iirc for me. Where I work we're starting to dip our toes into AI and vibecoding and I'm not a big fan. Even in my boring job the actual coding is the part I like the most. So I've taken a different tack. I've been prompting Claude to teach me how to do things, and that has worked out really well. Some basic info to start with, specific questions as needed, but I'm doing the work. I'm improving my productivity while still learning new things and having fun. Win-win for me.


Gemini has been teaching me embedded Linux, and last year ChatGPT taught me C#. All on the free tiers mind you. But I'm doing the work, it's just faster to ask questions than to dig through mailing lists and source code.

At work though, the pressure to move fast is too high, so I'm letting Claude Code so more work these days (nowhere near the majority, but I've found things i can trust it with).

I don't think i could deal with a paid plan myself given how unpredictable the models are and opaque the pricing is.


I'm starting to do this at home, but the instinct to just do a web search is still there. I'm only using Claude Code at work because they are paying for it, so why not use it. I think I've used maybe 5% of my tokens for any given day so far. I need to pick a free AI and make it my goto AI mentor for what I want to learn.

Once I build a few things at work I'll probably be asking Claude Code to look for problems with what I've written, but we're not being pushed too hard to get into AI coding yet, though the writing is on the wall. I'm mostly looking for ways to expand what I can do within our current constraints, and keep my sanity.


That's why i love it for hobby projects: "man it sure would be great if the Linux kernel did this thing, if only i knew C... Oh right, the LLM knows C, i can make Linux do this even if i don't know how"

That's great. I don't care how it works, i just want the result for this specific personal project. Great. Whatever i learn about the kernel along the way is just icing.

At work, though, i NEED to know how it works, i need to be able to explain and defend it, i need to be able to expand on it. Sure if Claude Code can speed that up, great, but i can't just "let it rip" the way I might just prompt and pray with a hobby project.


> when the calculator came about, being forced to compute in your head wasn't an advantage.

I'm not sure, whether that is true, because when educators want you to learn how to compute you are "locked out" of calculators. You don't get to use a calculator until after you learned basic arithmetic and you won't use a CAS when you are supposed to learn calculus.


To anybody who wants to try this at home: consider wearing a face mask while you do the filing. The author mentions taping off speakers etc. to protect the machine's internals from aluminum dust, but don't forget to protect your own body, too. Aluminum exposure has been linked to Alzheimer's, and inhalation likely poses particularly high risk, compared to ingestion.

This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.

Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.

When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.

That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.


This isn't even training on the test data.

This is modifying the test code itself to always print "pass", or modifying the loss function computation to return a loss of 0, or reading the ground truth data and having your model just return the ground truth data, without even training on it.


If you're prepared to do that you don't even need to run any benchmark. You can just print up the sheets with scores you like.

There if a presumption with benchmark scores that the score is only valid if the benchmark were properly applied. An AI that figures out how to reward hack represents a result not within the bounds of measurement, but still interesting, and necessitates a new benchmark.

Just saying 'Done it!' is not reward hacking. It is just a lie. Most data is analysed under the presumption that it is not a lie. If it turns out to be a lie the analysis can be discarded. Showing something is a lie has value. Showing that lying exists (which appears to be the level this publication is at) is uninformative. All measurements may be wrong, this comes as news to no-one.


I think the point of the paper is to prod benchmark authors to at least try to make them a little more secure and hard to hack... Especially as AI is getting smart enough to unintentionally hack the evaluation environments itself, when that is not the authors intent.

> I'm not sure how groundbreaking the main insight is.

I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.


I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.


I think that’s totally fair!

I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”


That's a great way to look at it. The paper is a reality check for anyone who thinks of benchmarks as these monolithic, oracular judges of performance. It highlights the soft underbelly of benchmarking.

Did you read the article? There's a whole section on "this is already happening."

Yes, I did see that section. We've known for a while that reward hacking, train/test data contamination, etc. must be taken seriously. Researchers are actively guarding against these problems. This paper explores what happens when researchers flip their stance and actively try to reward hack — how far can they push it? The answer is "very far."

Yep. I think the idea that the benchmark is determinative is just as deluded as the notion that it should be unbreakable.

Benchmarks are on the honor system. Even the tightest benchmark can be cheated. If the benchmark is so secret and air-gapped that it can't be cheated by models, it can be cheated by its own authors. You can't use benchmarks to gate out cheating.

If you don't have the honor system in mind when you're reading scores, you're wasting your time. Is it some unknown outfit with wild claims? Is it connected to Epstein, Russia, the real estate "industry", or sleazeballing in general? Do they have previous history of ratgaming the numbers? Replace its scores with asterisks and move on.


It's interesting that Monet—the painter who was later criticized for his inhuman, all-too-realist depictions of his fellow creatures—started his career drawing caricatures.

Intermittent idleness is appealing and even productive, as it often surfaces valuable ideas from your subconscious. That said, today's society is badly equipped for idleness. With phone notifications going off every few minutes, it's difficult not to be constantly interrupted with the "task" of looking at a text. Let's throw out our phones first, then we can experience true mental repose.

> That said, today's society is badly equipped for idleness. With phone notifications going off every few minutes, it's difficult not to be constantly interrupted with the "task" of looking at a text.

I have to leave home to read a book. Sitting on a park bench is the only way for me to focus and not get distracted. It’s great, though. We have a beautiful rose garden nearby. Lots of critters scurrying about.


Perpetual Do Not Disturb is a better stopgap

I agree for the most part. DND isn't perfect, though. When you're bored, your mind naturally searches for things to do, and you'll be tempted to proactively check your lock screen, which unhelpfully informs you about "3 messages received while in Do Not Disturb." Now you really want to know what those messages are.

This is why I tend to keep my phone physically far away from me, and out of sight.


Less phone usage makes for less phone usage. It gets easier. Now I don't particularly care what the messages are if they come outside of my designated "message checking time."

Anecdotally, I'd give it 3 months of "reduced phone usage mindfulness." For further reading, check out the wikipedia article on ∆FosB expression, which is a gene expression which essentially tells your body to keep doing things that release dopamine. It takes about 3 months for the ∆FosB expression to decay.


In my experience turning off my phone solves the temptation to check it. The friction of having to turn on my phone is small but apparently enough.

Some smaller doses of friction include not putting icons of entertaining apps on home screen or removing such apps entirely and e.g. using a browser if you need a particular service. Making sure unlock requires entering a (long) code. Making the colour scheme dull, maybe B&W mode. Removing notification permissions as much as possible. Turning off notifications on lock screen.

If I understand this right, the difference between the author's suggested approach and simply chatting with an AI agent over your files is hyperlinks: if your files contain links to other relevant files, the agent has an easier time identifying relevant material.

This take confuses the value of a project at inception with its value at maturity. Vibe-coded projects are at the beginning of their life. When Slack was at a comparably stage, it similarly didn't have hundreds of engineers running it. So the question facing vibe coding is not whether it can substitute for a mature tech product. The question is if vibe coding can substitute for genuine engineering expertise at the very beginning of a budding, immature project.

this!

The privacy angle is interesting. I'm curious how people view the pricing strategy of taking a one-time payment for lifetime access. My first thought was that it encourages the developer to focus more on recruiting new users rather than keeping existing ones happy - makes me wonder what will become of the product if new user growth stalls.


That's actually a fair point, regarding the implications of a one time fee.

Personally, I don't like subscription-based apps so didn't want to create yet another one.

And I built this around my personal needs so I plan to support it indefinitely.

Regarding long term improvements, there is a number of paying users that once I achieve, any new users are basically profit.

The service was built to be cheap to run and maintain so I could charge a one time fee.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: