I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO
As sibling comments point out, parents are already overly held responsible for how they care for their kids. To an absurd amount.
I have had CPS called on me by an overbearing school administrator. Have you had that happen to you? Let me tell you, it's not a fun experience.
Enough of this "blame the parents" mentality! Ironic given that the goal for all these platforms is growth at all costs. Where do you think "growth" comes from, after all? If you make being a parent so goddamn difficult that it's more rational to just not do it, guess what, poof goes your sweet, sweet growth.
So tired of this line of thinking. The parents are put into an impossible situation. Stuck between kids who by definition and by design will test the boundaries that they're given, and tech platforms that are propped up with not just trillions of dollars of valuation, but the societal expectation that you engage with them. Want your kids to compete in sports? Well, they need to have WhatsApp and Instagram to keep track of team events!
Give me a break. Equating controlling social media and devices to "look both ways when crossing the street" is disingenuous at best. There are no companies that make billions of dollars in advertising revenue telling your kids to jaywalk. But Facebook gladly weaponizes their algorithm to drive "engagement" - and, surprise, children with still-forming prefrontal cortices are drawn to content that reinforce their natural self-criticisms and doubts. So now my child, who has to be on Instagram to keep track of sports schedules, is also force fed toxic content because that's what a mechanical algorithm thinks is most "engaging" based on my derived psychological and demographic profile.
You want to talk about CSAM? X proudly proclaims that they have every right to produce deep-fake pornography with the faces of underage children. What action shall I, as an individual parent, take if my 15 year old girl's face is suddenly pasted onto sexually explicit video and widely shared thanks to xAI's actions? Shall I be held responsible for how I "let this happen" to my child?
You seem to imply in your reply that I disagree with you, hence necessitating a polemic style. I would have thought the last few sentences of my comment make it clear where I stand on simplistic appeals to "parental responsibility".
But now you have compromise _at scale_. Before poor plebs like us had to artisinally craft every back door. Now we have a technology to automate that mundane exploitation process! Win!
> the price quickly dropped to just $6,000 when they realized we were serious about going elsewhere, and they would throw in ISO 27001 and a 200 hour penetration test as well.
I'm sorry, but... $6,000 / 200 == $30 / hour? Just assuming the value of the actual certifications is $zero?
$6000 for both SOC 2 and ISO 27001 with Pen tests ? lol. I paid over $8k just for ISO 27001 for our small company and have been quoted a lot more for SOC 2.
That sort of event doesn't fade away quickly and definitely influenced energy policy that persists to this day. Thankfully the tide is turning due to safer designs.
The excessive size of Go binaries is a common complain. I last recall seeing a related discussion on Lobsters [1]. Who knows, maybe the binary could be shrunk a bit? IMHO 12mb binary size is not that big of a deal.
Kinda comparing apples to oranges. AWS was using EBS and not local instance storage. So you’re easily looking at another order of magnitude latency when transmitting data over the network versus a local pcie bus. That’s gonna be a huge factor in what I assume is a heavy random seek load.
I wrote a longer comment already (https://news.ycombinator.com/item?id=47352526) but looking at the hot run performance and making big hand wavy guesses, the performance difference might not be as big as you'd expect.
The court filing provides more information than just giving ammo to harassers so I do not see them as directly equivalent. I also do not agree with the premise that if one person does something bad it would justify someone else in doing so.
Plus - you’re telling me that highlighting an individual and posting their home address on an official government account is not “giving ammo to harassers”?
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
Ok, closed that tab.
reply