Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.
More to that - at this point, it feels to me, that arenas are getting too focused on fitting user preferences rather than actual model quality.
In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...
Interesting idea, I think I'm on board with this correlation hypothesis. Obviously it's complicated, but it does seems like over-reliance on arbitrary opinions from average people would result in valuing "feeling" over correctness.
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.