More

simonw · 2026-03-23T15:42:54 1774280574

Yeah, this new post is a continuation of that work.

simonw · 2026-03-23T15:10:44 1774278644

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

anemll · 2026-03-23T18:48:55 1774291735

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

Yukonv · 2026-03-23T19:12:39 1774293159

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

anemll · 2026-03-23T19:19:23 1774293563

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

3abiton · 2026-03-23T21:18:50 1774300730

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

superjan · 2026-03-23T17:23:28 1774286608

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

trebligdivad · 2026-03-23T22:03:33 1774303413

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

simonw · 2026-03-23T15:10:04 1774278604

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

thecopy · 2026-03-23T17:23:32 1774286612

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Aurornis · 2026-03-23T18:16:17 1774289777

Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

kgeist · 2026-03-23T20:16:56 1774297016

>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.

Aurornis · 2026-03-23T20:24:10 1774297450

I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.

freedomben · 2026-03-23T18:36:31 1774290991

I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

zozbot234 · 2026-03-23T18:43:13 1774291393

Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.

anemll · 2026-03-23T18:39:39 1774291179

Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

jnovek · 2026-03-23T18:03:50 1774289030

I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.

stingraycharles · 2026-03-24T06:29:04 1774333744

One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

zozbot234 · 2026-03-24T06:57:19 1774335439

I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.

fouc · 2026-03-24T11:41:49 1774352509

looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82

zozbot234 · 2026-03-24T13:56:58 1774360618

But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.

stingraycharles · 2026-03-24T17:43:20 1774374200

Ok I am by no means an expert on this and I immediately stand corrected. But as I understand it, in order to understand the amount of active memory that’s required, it’s more accurate to go by the ~82B number, right?

zozbot234 · 2026-03-24T18:36:09 1774377369

The ~82B figure is an attempt to compare performance to an equivalent dense model. The amount of active parameters is given by the ~17B.

Hasslequest · 2026-03-23T17:59:45 1774288785

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

simonw · 2026-03-23T14:39:30 1774276770

So much this! I've been bugging Astral about addressing the sandboxing challenge for a while, I wonder if that might take more priority now they're at OpenAI?

simonw · 2026-03-23T14:01:31 1774274491

Explain to me the harm that is caused to users of pip when this particular set of platform information is sent to PyPI.

(In case you were going to say that it associates hardware platform details with IP addresses - which would have been my answer - know that PyPI doesn't record IPs: https://www.theregister.com/2023/05/27/pypi_ip_data_governme... )

Then give me your version of why it's not reasonable for the Python packaging community (who are the recipients of this data, it doesn't go to Astral) to want to collect aggregate numbers against those platform details.

stackedinserter · 2026-03-24T03:34:59 1774323299

Any telemetry should be done after explicit user consent, period. The harm is that you normalize total surveillance with these little, seemingly innocent steps.

simonw · 2026-03-24T05:41:04 1774330864

That's a solid answer, thanks.

simonw · 2026-03-23T13:30:11 1774272611

Here's where that feature was (and is still being) discussed in the uv repo: https://github.com/astral-sh/uv/issues/1495

It's been open for two years but it looks like there's a PR in active development for it right now: https://github.com/astral-sh/uv/pull/18214

simonw · 2026-03-23T13:28:11 1774272491

If you have hundreds of different Python projects on your machine (as I do) the speed and developer experience improvements of uv make a big difference.

I love being able to cd into any folder and run "uv run pytest" without even having to think about virtual environments or package versions.

dirkc · 2026-03-23T14:26:55 1774276015

Do you run those projects on the host system as your normal user without any isolation?

simonw · 2026-03-23T14:38:32 1774276712

Yes, which makes me very vulnerable to supply chain attacks.

dirkc · 2026-03-23T14:42:56 1774276976

Yikes! I had a scare once, and since then I only run sandboxed code or scripts I've written with minimal 3rd party deps.

I assume you have other mitigations in place?

simonw · 2026-03-23T15:04:55 1774278295

Not really. I have good backups and I try to stick with dependencies I trust.

I do a lot of my development work using Claude Code for web which means stuff runs in containers on Anthropic's servers, but I run things on my laptop most days as well.

simonw · 2026-03-23T13:15:56 1774271756

The telemetry they removed here isn't unique to uv, and it's not being sent back to Astral. Here's the equivalent code in pip itself: https://github.com/pypa/pip/blob/59555f49a0916c6459755d7686a...

It's providing platform information to PyPI to help track which operating systems and platforms are being used by different packages.

The result is useful graphs like these: https://pypistats.org/packages/sqlite-utils and https://pepy.tech/projects/sqlite-utils?timeRange=threeMonth...

The field that guesses if something is running in a CI environment is particularly useful, because it helps package authors tell if their package is genuinely popular or if it's just being installed in CI thousands of times a day by one heavy user who doesn't cache their requirements.

Honestly, stripping this data and then implying that it was collected by Astral/OpenAI in a creepy way is a bad look for this new fork. They should at least clarify in their documentation what the "telemetry" does so as not to make people think Astral were acting in a negative way.

Personally I think stripping the telemetry damages the Python community's ability to understand the demographics of package consumption while not having any meaningful impact on end-user privacy at all.

Here's the original issue against uv, where the feature was requested by a PyPI volunteer: https://github.com/astral-sh/uv/issues/1958

Update: I filed an issue against fyn suggesting they improve their documentation of this: https://github.com/duriantaco/fyn/issues/1

simonw · 2026-03-22T20:11:16 1774210276

This is so upsetting. No wonder people spend more time in mobile apps than they do using the mobile web - the default web experience on so many sites is terrible.

MBCook · 2026-03-22T20:52:00 1774212720

I’ve been using the Reddit app some lately after being a longtime old.Reddit.com + blocker person.

Ignoring how [ad] navigation is kinda annoying [ad] the shear [ad] number of ads [ad] they [ad] insert [ad] is insane.

The only good thing is none of them seem to be animated/video. Which is an incredibly low bar, but most sites can’t even jump that.

dwayne_dibley · 2026-03-22T20:58:24 1774213104

I'll probably leave reddit when old.Reddit.com gets the chop

MBCook · 2026-03-22T21:38:53 1774215533

I suspect I will too. I’ve been playing with the app a bit as it’s easier for me on my phone to view subs that are mostly pictures (e.g. awuariums). But I only do it from time to time.

Apollo was much better, of course.

ericd · 2026-03-22T21:48:45 1774216125

Same, but it sounds like Lemmy still has some issues, and it'll be hard to replace some of the niche subreddits.

MBCook · 2026-03-22T22:39:58 1774219198

It kind of doesn’t matter. The thing that makes Reddit, to me, is its size. Lemmy will never get there, so it won’t be able to replace it for me.

I love Mastodon, it’s what I use, but it’s not what I lost with Twitter. Some stayed, some went to BlueSky, some Threads, some just gave up. And we’ll never have it again. Assholes destroyed a whole world out of selfishness.

qingcharles · 2026-03-22T22:00:16 1774216816

This is the problem. There's no good replacement for Reddit right now, and Digg just died again.

MBCook · 2026-03-23T01:06:47 1774228007

I’m honestly amazed they tried that. It’s been so long, it felt like a play to cache in on the name but I feel like a huge chunk of people don’t really remember it or weren’t even around for it.

chuckadams · 2026-03-22T22:27:51 1774218471

To say nothing of all the personal data the app is hoovering up. Guarantee that every last thing you granted permissions for is something they're monetizing.

simonw · 2026-03-22T22:54:17 1774220057

I had Claude Code profile the page (using headless Chrome) to see what was going on, here's the resulting report: https://github.com/simonw/research/blob/main/pcgamer-audit/R...

simonw · 2026-03-22T20:09:55 1774210195

I left that page open in Firefox on macOS (no ad blockers) and after five minutes the network devtools panel showed me it had hit 200MB transferred, 250MB total from over 2,300 requests.