More

ekojs · 2026-04-22T15:41:14 1776872474

As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally.

Aurornis · 2026-04-22T15:45:38 1776872738

> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

zargon · 2026-04-22T21:02:22 1776891742

I just loaded up Qwen3.6 27B at Q8_0 quantization in llama.cpp, with 131072 context and Q8 kv cache:

  build/bin/llama-server \
    -m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
    --no-mmap \
    --n-gpu-layers all \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --jinja \
    --no-mmproj \
    --parallel 1 \
    --cache-ram 4096 -ctxcp 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'

Should fit nicely in a single 5090:

  self    model   context   compute
  30968 = 25972 +    4501 +     495

Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.

ekojs · 2026-04-22T16:04:10 1776873850

> You cannot run these models at 8-bit on a 32GB card because you need space for context

You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.

I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.

alex7o · 2026-04-22T18:01:47 1776880907

Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient

zozbot234 · 2026-04-22T15:48:24 1776872904

4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision.

ekojs · 2026-04-22T15:55:57 1776873357

Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers.

storus · 2026-04-23T04:34:04 1776918844

4-bit quantization is not applied to all layers, some are kept 8/16-bit.

binary132 · 2026-04-22T15:42:53 1776872573

That seems awfully speculative without at least some anecdata to back it up.

arcanemachiner · 2026-04-22T15:46:51 1776872811

Sure, go get some.

This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time.

Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.

ekojs · 2026-04-22T15:46:10 1776872770

Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there.

ekojs · 2025-08-25T10:58:58 1756119538

Seems pretty widespread. We got mistakenly charged for ~$800 over the weekend.

Other Sources:

[0]: https://aistudio.google.com/status

[1]: https://www.reddit.com/r/GeminiAI/comments/1mycmtk/google_cl...

[2]: https://www.reddit.com/r/GeminiAI/comments/1myg04q/gemini_25...

ekojs · 2025-07-21T17:28:06 1753118886

> Btw as an aside, we didn’t announce on Friday because we respected the IMO Board's original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved

> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

From https://x.com/demishassabis/status/1947337620226240803

Was OpenAI simply not coordinating with the IMO Board then?

thegeomaster · 2025-07-21T17:34:49 1753119289

Yes, there have been multiple (very big) hints dropped by various people that they had no official cooperation.

osti · 2025-07-21T18:07:07 1753121227

I think this is them not being confident enough before the event, so they don't wanna be shown a worse result than competitors. By being private they can obviously not publish anything if it didn't work out.

esafak · 2025-07-21T19:09:57 1753124997

They shot themselves in the foot by not showing the confidence that Google did.

ml-anon · 2025-07-21T18:20:56 1753122056

As not-so-subtly hinted at by Terry Tao.

Its a great way to do PR but its a garbage way to to science.

osti · 2025-07-21T18:34:51 1753122891

True, but openai definitely isn't trying to do public research on science, they are all about money now.

ml-anon · 2025-07-21T18:38:13 1753123093

Thats not a contentious statement. Its still a pathetic way to behave at a kids competition no less.

pphysch · 2025-07-21T18:06:46 1753121206

This reminds me of when OpenAI made a splash (ages ago now) by beating the world's best Dota 2 teams using a RL model.

...Except they had to substantially bend the rules of the game (limiting the hero pool, completely changing/omitting certain mechanics) to pull this off. So they ended up beating some human Dota pros at a psuedo-Dota custom game, which was still impressive, but a very much watered-down result beneath the marketing hype.

It does seem like Money+Attention outweigh Science+Transparency at OpenAI, and this has always been the case.

NitpickLawyer · 2025-07-21T20:07:39 1753128459

Limiting the hero pool was fair I'd say. If you can prove RL works on one hero, it's fairly certain it would work on other heroes. All of them at once? Maybe run into problems. But anyway you'd need orders of magnitude more compute so I'd say that was fair game.

npinsker · 2025-07-21T21:31:04 1753133464

It's not even close to the same game as Dota. Limiting the hero (and item) pool so drastically locks off many strategies and counters. It's a bit hard to explain if you haven't played, but full Dota has many more tools and much more creativity than the reduced version on display. The behavior does not evidently "scale up", in the same way that the current SotA of AI art and writing won't evidently replace top-level humans.

I'd never say it's impossible, but the job wasn't finished yet.

pphysch · 2025-07-21T22:14:12 1753136052

That's akin to saying it's okay to remove Knights, or castling, or en passant from chess because they have a complicated movement mechanic that the AI can't handle as well.

Hero drafting and strategy is a major aspect of competitive Dota 2.

dmitrygr · 2025-07-21T17:33:30 1753119210

> Was OpenAI simply not coordinating with the IMO Board then?

You are still surprised by sama@'s asinineness? You must be new here.

antonvs · 2025-07-21T17:41:09 1753119669

When your goal is to control as much of the world's money as possible, preferably all of it, then everyone is your enemy, including high school students.

dmitrygr · 2025-07-21T17:49:44 1753120184

How dare those high school students use their brains to compete with ChatGPT and deny the shareholders their value?

MisterPea · 2025-07-21T18:04:59 1753121099

I am still surprised many people trust him. The board's (justified) decision to fire him was so awfully executed that it lead to him having even more slack

ekojs · 2025-07-17T17:05:21 1752771921

Maybe not a popular sentiment here on HN but I cancelled my Kagi subscription (9+ months) just recently. Increasingly, most of my queries/search have been through LLMs and Google search is just fine (and even better for restaurants, places, and the like). I don't think the improved search experience is worth the subscription anymore.

baggachipz · 2025-07-17T18:32:05 1752777125

In Kagi, you can just ad a "?" to your query and get an instant answer, a la LLMs.

VHRanger · 2025-07-17T21:23:43 1752787423

Or !ai to route it to kagi.com/assistant with your default model/agent to respond with kagi search results

ekojs · 2025-06-12T18:53:25 1749754405

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

> Multiple GCP products are experiencing impact due to Identity and Access Management Service Issue

IAM issue huh. The post-mortem should be interesting at least.

yard2010 · 2025-06-12T21:33:20 1749764000

Ha. With all this soviet style euphemism I rather read the onion instead.

bananapub · 2025-06-12T22:03:54 1749765834

It’s not a euphemism - every outage, including the 99.9% that don’t end up on HN gets a postmortem document written about it, which is almost always a fascinating discussion of the technical, cultural and organisational situation that led to an unexpected bad thing happening.

Even a few years ago senior management knew to stay the fuck out except for asking for more info.

ekojs · 2025-06-12T18:29:46 1749752986

Super duper frustrating having the status page being green. Why can't Google do this properly?

supportengineer · 2025-06-12T18:41:29 1749753689

Those responsible have been sacked.

18172828286177 · 2025-06-12T18:43:36 1749753816

Those responsible for sacking the people who have just been sacked, have been sacked.

ekojs · 2025-06-12T10:50:19 1749725419

I share the sentiment. I think we will only be using Next.js for static sites/prebuilt SPA in the future.

csomar · 2025-06-12T11:09:35 1749726575

Actually Next.js with App router (and with Pages being pushed out) is really bad for SPAs. See this thread: https://github.com/vercel/next.js/discussions/64660

miyuru · 2025-06-12T12:15:18 1749730518

> I think we will only be using Next.js for static sites/prebuilt SPA in the future.

With whats mentioned in the blog post I would not use it even from static builds.

Zealotux · 2025-06-12T10:57:42 1749725862

You probably have better alternatives for that: Astro, React Router 7, TanStack.

ekojs · on April 8, 2025

I think it's most illustrative to see the sample battles (H2H) that LMArena released [1]. The outputs of Meta's model is too verbose and too 'yappy' IMO. And looking at the verdicts, it's no wonder by people are discounting LMArena rankings.

[1]: https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03...

smeeth · on April 8, 2025

In fairness, 4o was like this until very recently. I suspect it comes from training on COT data from larger models.

ed · on April 8, 2025

Yep, it’s clear that many wins are due to Llama 4’s lowered refusal rate which is an effective form of elo hacking.

ekojs · on March 25, 2025

> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!

From https://x.com/OfficialLoganK/status/1904583353954882046

The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.

chipgap98 · on March 25, 2025

Any word on what that pricing is? I can't seem to find it

rvnx · on March 25, 2025

Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.

This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”

chipgap98 · on March 25, 2025

Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"

I assumed that meant there was a paid version with a higher rate limit coming out today

altbdoor · on March 26, 2025

The parent Twitter post mentions:

    Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon!

And the pricing page [1] still does not show 2.5 yet.

[1]: https://ai.google.dev/gemini-api/docs/pricing

KoolKat23 · on March 25, 2025

I expect this might be pricier. Hoping not unusable level expensive.

xnx · on March 25, 2025

Currently free, but only 50 requests/day.

sagarpatil · on March 26, 2025

Any idea what is RPM for this model?

xnx · on March 26, 2025

https://aistudio.google.com/prompts/new_chat says 2 for free, but also 5, which might be the rpm when they start charging.

ekojs · on March 19, 2025

> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable

It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.