Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.

I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.

This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.

 help



4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign).

It doesn't seem terribly common yet though. I think it is challenging to keep it stable.

[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...

[2] https://www.opencompute.org/documents/ocp-microscaling-forma...


mxfp4 is a block-based floating point format. The E2M1 format applies to individual values, but each 32-values block also has a shared 8-bit floating point exponent to provide scaling information about the whole block.

There's also work on ternary models that's quite interesting, because the arithmetic operations are super fast and they're extremely cache efficient. Well worth looking into if that's the sort of thing that interests you.

Mind sharing any resources? I've been thinking about trying to understand them better myself.

This is an ongoing course at CMU you can shadow.

https://modernaicourse.org/


Thats cool.

I do wonder where that extra acuity you get from 1% more shows up in practice. I hate how I have basically no way to intuitively tell that because of how much of a black box the system is


Well why would Claude know any of this? Obviously it's the wrong criteria. If you have your own dataset to benchmark, created your own calibration for quantization with it. Scientifically, you wouldn't really believe in the whole process of gradient descent if you didn't think tiny differences in these values matter. So...

I think you might be answering to a different person or misunderstanding what I said but you are right that just as I don’t have an intuition for where the acuity shows up in the corpus, I don’t think Claude does either



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: