More

disiplus · 2026-04-07T19:18:09 1775589489

i have glm and kimi. kimi was in most of the cases better and my replacement for claude when i run out of tokens. Now im finding myself using glm more then kimi. Its funny that glm vs kimi, is like codex vs claude. Where glm and codex are better for backend and kimi and claude more for frontend.

as kimi did a huge amount of claude distilation it seems to be somewhat based in data

https://www.anthropic.com/news/detecting-and-preventing-dist...

disiplus · 2026-04-07T19:13:55 1775589235

Yeah it seems they did not align it to much, at least for now. Yesterday it helped me bypass the bot detection on a local marketplace. that i wanted to scrap some listing for my personal alerting system. Al the others failed but glm5.1 found a set of parameters and tweaks how to make my browser in container not be detected.

qingcharles · 2026-04-08T00:38:27 1775608707

I always jump on the Chinese models when I'm trying to do something that the US ones chastise me for, they're a little more liberal, especially around copyright.

ReptileMan · 2026-04-07T20:47:42 1775594862

Model doing what the user wants with high quality is definitely aligned in my book.

wolttam · 2026-04-08T16:46:56 1775666816

This can never go wrong!

smallerize · 2026-04-08T00:56:16 1775609776

It's too much in the direction of the paperclip maxmizer for me. It should only hack sites when explicitly directed to, not as a default.

disiplus · 2026-04-07T19:08:51 1775588931

basically my expirience as well. Sometimes it can break past 100k and be ok, but mostly it breaks down.

disiplus · 2026-04-07T18:15:07 1775585707

When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).

i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.

DeathArrow · 2026-04-08T05:31:21 1775626281

Why don't you start a new session or use the /compact command when context gets to 100k tokens?

From my testing it was ok until 145k tokens, the largest context I had before switching to a new session. I think Z.ai officially said it should be good until 200k tokens.

Using it in Open Code is compacting the context automatically when it gets too large.

disiplus · 2026-03-19T12:21:18 1773922878

The post mentions, france, germany and nordic nations. France, Holand and nordic nations helped in the early stages of US.

disiplus · 2026-02-27T22:58:22 1772233102

It will also cost openai dearly if they don't communicate clearly, because I for one will internally push to switch from openai (we are on azure actually) to anthropic. Besides that my private account also.

gritspants · 2026-02-27T23:23:56 1772234636

You can deploy Opus and Sonnet on Azure.

madeofpalk · 2026-02-28T00:06:34 1772237194

This will not cost OpenAI anything.

t0lo · 2026-02-28T02:02:23 1772244143

Thanks for being the voice of cynical inaction.

disiplus · 2026-02-23T13:54:24 1771854864

I have them all. They're not just as good. Whoever tells you that looked only at the benchmarks, not real use. They all fall short at some point.

Kimi K2.5 is the best one, but it's still not at the level of what Anthropic released with opus 4.5.

danny_codes · 2026-02-24T06:35:35 1771914935

We’ll have to give it 3 weeks.

disiplus · 2026-02-20T15:23:04 1771600984

I think in the West we think everything is blocked. But for example, if you book an eSIM, when you visit you already get direct access to Western services because they route it to some other server. Hong Kong is totally different: they basically use WhatsApp and Google Maps, and everything worked when I was there.

embedding-shape · 2026-02-20T16:04:01 1771603441

But also yes, parent is right, HF is more or less inaccessible, and Modelscope frequently cited as the mirror to use (although many Chinese labs seems to treat HF as the mirror, and Modelscope as the "real" origin).

disiplus · 2026-02-20T15:20:28 1771600828

Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.

danielhanchen · 2026-02-20T23:03:26 1771628606

Haha for now our primary goal is to expand the market for local AI and educate people on how to do RL, fine-tuning and running quants :)

WanderPanda · 2026-02-21T03:47:18 1771645638

Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).

On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)

I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.

danielhanchen · 2026-02-21T12:08:06 1771675686

Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!

jychang · 2026-02-21T11:29:48 1771673388

> How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

Very hard. $$$

The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.

danielhanchen · 2026-02-21T12:08:54 1771675734

Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!

illusive4080 · 2026-02-21T12:55:42 1771678542

Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?

danielhanchen · 2026-02-22T09:58:44 1771754324

Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.

Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.

We could run them after a model release which might work as well.

This is also on 1 benchmark.

Zetaphor · 2026-02-21T05:11:11 1771650671

This would be amazing

danielhanchen · 2026-02-21T12:09:01 1771675741

Working on it! :)

arcanemachiner · 2026-02-20T17:58:44 1771610324

I hope that is exactly what is happening. It benefits them, and it benefits us.