Just knowing someone's name, address, and ID number isn't enough to like, open a bank account in their name or such. You'd need a proper ID card or passport for that. Similar thing with most businesses if you try to pay for some product with credit, they won't accept just a few digits and a pinky promise, you'll need to identify yourself properly (the BankID app for instance).
> Why is mechanized thinking going to do that? When mechanized labor didn't?
You're right. There is technically a category of work that relies on neither our ability to do physical labor nor excessive thinking. It just relies on being a human.
The conclusion is thus obvious: AI is going to push us all into careers as photo models, OF-creators, and social media influencers! /s
The revenue numbers are public for the major AI companies. That's probably the best estimate for "inference for the whole market" we have, since most of that inference is billed in either API usage or subscriptions, and it won't include any in-house usage such as training.
How does it compare for models of any meaningful size?
These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.
The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.
Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.
Fair criticism. Our benchmarks are on small models because MetalRT
was built for the voice pipeline use case, where decode latency
on 0.6B-4B models is the bottleneck.
You're right that the bigger opportunity on Apple Silicon is large
models that don't fit on consumer GPUs. Expanding MetalRT to 7B,
14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter
even more at that scale where everything becomes memory-bandwidth-bound.
We'll publish benchmarks on larger models as we add support. If you
have a specific model/size you'd want to see first, that helps us
prioritize.
What choice do they really have though? More and more consumers completely forgo owning a regular computer and only use a phone or a tablet now a days. And among the ones who do own a computer there's still a strong trend towards not paying for software, presumably a behavior taught to them by the overwhelming success of strictly ad-financed apps.
It's easy to forget that us here on HN are several standard deviations from the norm.
Windows 8 was supposed to dig into that mobile device/tablet market, as well as the Windows phone. You can argue about why Win8 was a titanic failure that pushed a backtrack in 8.1 (and Win10), but it seems like Microsoft didn't really know how to approach the space at the time and failed to commit to a trend they correctly identified early enough that they could have capitalized better on it.
If we assume people are somewhat rational (big ask I know), and the Efficient-market hypothesis, then we can estimate the value created by AI to be roughly equal to the revenue of these AI companies. That is: A professional who pays 20€/month likely believes that the AI product provides them with roughly 20€ each month in productivity gains, or else they wouldn't be paying, and similarly they would pay more for a bigger subscription if they thought there was more low hanging fruit available to grab.
Of course this doesn't take into account people who just pay to play around and learn, non professional use cases, or a few other things, but it's a rough ballpark estimate.
Assuming the above, current AI models would only increase the productivity for most workplaces by a relatively small amount, around 10-200 € per employee per month perhaps. Almost indistinguishable compared to salaries and other business expenses.
> A professional who pays 20€/month likely believes that the AI product provides them with roughly 20€ each month in productivity gains, or else [...] they would pay more for a bigger subscription
Unless I'm misunderstanding, shouldn't someone rational want to pay where (value - cost) is highest, opposed to increasing cost to the point where it equals value (which has diminishing returns)?
A $40 subscription creating $1000 worth of value would be preferred over a $200 subscription creating $1100 of value, for instance, and both preferred over a $1200 subscription creating $1200 of value.
I was more so limiting myself to the simpler heuristic where people only pay roughly what they personally think something is worth, and not significantly more/less regardless of the options. But of course, as you've pointed out, in real life the options available really do matter, and someone might decline a 200:1200 trade if there are even more lopsided options available. It does complicate the though experiment somewhat if you try to take this into account.
> Putting that kind of filter in the way of speech seems ripe for abuse.
On one hand I agree with you. Any automatic filter implemented can later be expanded to cover more and more things, such as messages from political adversaries for example. It's a slippery slope as we all know.
On the other hand I don't think it applies in this context very much. If we're talking about content published by a corporation or such (say a newspaper for example) they already filter all their gathered news themselves and have no obligation to publish things they don't feel like.
Similarly if we're talking about user uploaded content on social media I don't think they have any obligation to publish everything and anything that their users decide to upload either, and it's not the expectations of the users that anything can be hosted there for them. Users already know that youtube/facebook/tiktok/what-have-you have seemingly arbitrary rules regarding what content they're willing to host and not.
Now if for example DNS providers or ISPs decide to implement these sort of filters on the web at large that's a different matter I think. In which case I agree with you.
> Sidenote: I wonder what's going to happen when the crazy money runs out and Anthropic, OpenAI & co have to start charging for more than it costs them to run the models. Hopefully by then the open source models will have caught up?
How brutal will the enshittification phase of these products be?
Will the 10x cost or whatever be something that future employers will have to pay, or will it be a more visible impact for all of us? Assuming no AGI scenario here and the investments will have to be paid back with further subscription services like today.
I really hope Open Source (Open Weights) keep up with the development, and that a continuation of Moore's Law (the bastardized performance per € version) makes local models increasingly accessible.
Is it proven that they serve the models at cost? Amodei has said that Anthropic's models make back their training cost - the reason they're so deep in the red is because they're investing substantially more in subsequent runs, and R&D dwarfs inference cost[1]. If the tech plateaus I would expect to see a lot of that R&D spend move into just powering inference.
> I've been surprised how difficult it is for LLMs to simply answer "I don't know."
It's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is "I don't know" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of "I don't know" in the training data it also won't show up in results, so what should you do?
If you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with "idk". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations.
I think one of the things that short form videos do really well is that they punish creators who pad their videos with unnecessary filler content. On TikTok for example (Not necessarily a fan of the app but it's a good example) no videos start with all that empty jabbering you often see on YouTube ("Welcome to my channel...", "Today we will...", "Please Like and Subscribe...", "This video is sponsored by...", etc), because if they tried any of that crap the viewers would just swipe the content away. So, instead they always get straight to the point. That part is really refreshing.
reply