I've worked on ML model quantization. The open source 4-bit or 8-bit quantizatio...

lolinder · on Aug 5, 2023

I'm sure there are better methods! But in this case, MKML's numbers just don't look impressive when placed alongside the prominent quantization techniques already in use. According to this chart [0] it's most similar in size to a Q6_K quantization, and if anything has slightly worse perplexity.

If their technique were better, I imagine that the company would acknowledge the existence of the open source techniques and show them in their comparisons, instead of pretending the only other option is the raw fp16 model.

[0] https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

Scene_Cast2 · on Aug 5, 2023

From what I remember, non-power-of-2 compression schemes tank inference speed (assuming Q6_k is 6-bit; I haven't actually verified if ggml q6_K llama is slow). Meanwhile, the site claims a speed-up.

But I do actually agree with you - they should really be benchmarking against popular competitors. In my experience, fancier quantization is a _lot_ of work for fairly little gain (at least for neural nets). I also think that ML techniques such as quantization (or fancy param sweeps, feature pruning, that kind of stuff) tend to either get in-housed (i.e. the model will come quantized from the source) or get open-sourced.

In-housing of ML techniques tends to happen more often if there's a money-making model where the hardware running the model costs money, but running the model brings in money.

KRAKRISMOTT · on Aug 5, 2023

What about Unum's quantization methods?

https://github.com/unum-cloud/usearch

Scene_Cast2 · on Aug 5, 2023

Not familiar with Unum. From a quick glance, it seems that they truncate Least Significant Bits, which is the simplest but fastest quantization method.