Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I picked the “Attention is All You Need” example at the top, and wow it is not great!

Didn’t take long to find hallucination/general lack of intelligence:

> For each word, we compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give out?).

What? That’s the worst description of a key-value relationship I’ve ever read, unhelpful for understanding what the equation is doing, and just wrong.

> Attention(Q, K, V) = softmax( Q·Kᵀ / √dk ) · V

> 3 Mask (Optional) Block future positions in decoder

Not present in this equation, also not a great description of masking in a RNN.

> 5 × V Weighted sum of values = output

Nope!

https://nowigetit.us/pages/f4795875-61bf-4c79-9fbe-164b32344...

 help



I keep trying these types of things with my own academic papers, asking AIs to summarise them, and they always produce plausible looking nonsense.

LLMs, even the best ones, are still hit or miss wrt quality. Constantly improving, though.

I see more confusion from Opus 4.x about how to weight the different parts of a paper in terms of importance than I see hallucinations of flat out incorrect stuff. But these things still happen.


surely, but it is a considerable concern? deflecting constructive feedback is probably not the best encouragement for others for a show HN?

Hmmm, didn’t realize I was deflecting - just stating facts. But if I came across that way then criticism noted.

If I turned this into a paid app then more attention would be given to quality. There’s only so much an app that leverages LLMs can do, though. With enough trace data and user feedback I could imagine building out Evals from failure modes.

I can think of a few ways to provide a better UX. One is already built-in - there’s a “Recreate” button the original uploader can click if they don’t like the result.

Things could get pretty sophisticated after that, such as letting the user tweak the prompt, allowing for section-by-section re-dos, changing models, or even supporting manual edits.

From a commercial product perspective, it’s interesting to think about the cost/benefit of building around the current limits of LLMs vs building for an experience and betting the models will get better. The question is where to draw the line and where to devote cycles. Something worthy of its own thread.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: