IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.
Outside of that the SSD is idling.
Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.
Not only you understanding the how, but you not understanding the goal.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
LLMs massively reduce the cost of "let's just try this". I think trying to migrate your entire repo is usually a fool's errand. Figure out a way to break the load-bearing part of the problem out into a sub-project, solve it there, iterate as much as you like. Claude can give you a test gui in one or two minutes, as often as you like. When you have it reliably working there, make Claude write up a detailed spec and bring that back to the main project.
Claude is surprisingly good at GUI work I've been learning, not just getting stuff working but also creating reasonably tasteful and practical designs. Asking claude in the browser to mock up a GUI and then having claude code implement it is a surprisingly powerful workflow.
I’m far away from a web developer or a web designer. But I think I intuitively understand how to put myself in the shoes of the end user when it comes to UX.
I noticed that Claude is awful at understanding what makes good UX even as simple as something as if you have a one line input box and button that lets you submit the line of text, you should wire it up so a user can press return instead of pressing the button or thinking about them being able to tab through inputs in a decent order
yeah as it's not using its own flow you have to give it a bit of feedback. so it goes with any dev work... I think you underestimate how bad programmer uis are.
Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.
They are explicitly not assuming anything about the content of the auxiliary space (full hard drive).
So the data might be incompressible and thus compressing it and restoring it afterwards would not work.
Edit:
From the paper:
> One natural approach is to compress the data on the hard
disk as much as possible, use the freed-up space for your computation and finally uncompress the data, restoring it to its original setting. But suppose that the data is not compressible. In other words, your scheme has to always work no matter the contents of the hard drive. Can you still
make good use of this additional space?
So the trick is to do the computation forwards, but take care to only use reversible operations, store the result outside of the auxiliary "full" memory and then run the computation backwards, reversing all instructions and thus undoing their effect on the auxiliary space.
Which is called catalytic, because it wouldn't be able to do the computation in the amount of clean space it has, but can do it by temporarily mutating auxiliary space and then restoring it.
What I haven't yet figured out is how to do reversible instructions on auxiliary space. You can mutate a value depending on your input, but how do you use that value, since you can't assume anything about the contents of the auxiliary space and just overwriting with a constant (e.g. 0) is not reversible.
Maybe there is some xor like trick, where you can store two values in the same space and you can restore them, as long as you know one of the values.
Edit:
After delving into the paper linked in another comment, which is rather mathy (or computer sciency in the original meaning of the phrase), I'd like to have a simple example of a program that can not run in it's amount of free space and actually needs to utilize the auxiliary space.
> FWIU from "Quantum knowledge cools computers", if the deleted data is still known, deleting bits can effectively thermally cool, bypassing the Landauer limit of electronic computers? Is that reversible or reversibly-knotted or?
> So the trick is to do the computation forwards, but take care to only use reversible operations, store the result outside of the auxiliary "full" memory and then run the computation backwards, reversing all instructions and thus undoing their effect on the auxiliary space.
If the results were stored outside the auxiliary "full" memory, there wouldn't be any need to reverse storing the results. So you probably meant the opposite: store the result inside of the auxiliary "full" memory.
What even is an artificial neuron in an Artificial Neural Network executed on "normal" (non-neuromorphic) hardware? It is a set of weights and an activation function.
And you evaluate all neurons of a layer at the same time by multiplying their weights in a matrix by the incoming activations in a vector. Then you apply the activation function to get the outgoing activations.
Viewing this from a hardware perspective, there are no individual neurons, just matrix multiplications followed by activation functions.
I'm going out of my area of expertise here, I just started studying bioinformatics, but neurological neurons can't simply hold an activation because they communicate by depolarising their membrane. So they have to be spiking by their very nature of being a cell.
This depolarization costs a lot of energy, so they are incentived to do more with less activations.
Computer hardware doesn't have a membrane and thus can hold activations, it doesn't need spiking and these activations cost very little on their own.
So I'm not sure what we stand to gain from more complicated artificial neurons.
On the other hand, artificial neutral networks do need a lot of memory bandwidth to load in these weights. So an approach that better integrates storage and execution might help. If that is memristor tech or something else.
Cerebras uses SRAM integrated into a giant chip I think. It is extremely fast inference -- they say 70 X faster than GPU clouds, over 2000 tokens per second output of a 70b model. But still uses a ton of energy as far as I know. And the chips are, I assume, expensive to produce.
Memristors might work, to get the next 10 X or 100 X in efficiency from where Cerebras is.
As far as more complex neurons, I was thinking that if each unit was on a similar order of magnitude in size but somehow could do more work, then that could be more efficient.
I think it is a little unusual, but nothing unheard of.
I started as a junior IT Consultant in Hamburg, Germany with a 3 month notice, but could only quit every half year. So for quitting at the end of june I had to give notice in march, and then the next opportunity was september for quitting at the end of december.
But since I'm going back to university now, we made a "Aufhebungsvertrag" so that I quit at the end of september. You have no right to it, but I have a good relation to my employer so they accepted.
"with a 3 month notice, but could only quit every half year"
I know little about German law, but this sounds insane. I also certainly wouldn't call it 3 months notice, I'd call it 3-9 months notice.
The real question is whether the company could only fire you at these two times of the year as well, or whether it only had to give 3 months notice. If the latter, you might find the courts agree that the contract is unfair if you took it to a tribunal.
This looks like german to me, but i haven't ever seen these words used together. I thought it might be some old german and googled it, but searching "bewunderungeifersucht" just yields your comment and "bewunderungseifersucht", which i deemed more likely, yields nothing.
Also a German speaker, just a clarification for non-German speakers, it appears to be a compound word neologism (combining two words that have not been combined before, similar to "frenemy" being derived from "friend"+"enemy" in English) that combines "Bewunderung"(admiration) and "Eifersucht"(jealousy). This sort of stuff does work in German, but it looks odd (and is a noun and should be capitalized anyway).
Rolls off the tongue a little better. I dunno if such shortening would happen. For fun, for example, in a friendly conversation where you make up your own words like "frenemy". I'm an English speaker with some German training but no speaking. :)
Edit: Looks like this is used in one place on the Internet and Google actually translates it "admiration" if used in a sentence.
"können mich mit ihrer eifernden Sehn-, Wunder-, Macht-, Bewundersucht"
"I can with their zealous yearning, wonder, power, admiration"
Sucht without the Eifer just means greed or desire so I don’t think it works.
I think generally English words are pretty quirky in their etymology and pronunciation which makes them easier to abbreviate. In a more regular, organized language when you remove part of a word you just get another word.
minimally corrected google translation of the (short) article:
Lidl is wasting 500 million euros
Because the introduction of a new data system did not work out, Deutsche Post already had to record a high loss several years ago. The same thing happened to Lidl. After seven years and costs of more than half a billion euros, the planned system is still not running smoothly. Now the discounter has pulled the ripcord.
Lidl has been on an expansion course for years. The discount store from Neckarsulm now has branches in almost every country in Europe and is now also growing in the USA. A new merchandise management system was needed to easily keep track of the increasingly complex business processes and to control branches, purchasing and logistics. Therefor the decision in 2011.
System is not good for high-turnover countries
Software from the Walldorf-based software company SAP was to be adapted to the needs of Lidl. So far, however, the new system has only been introduced in some small agencies in Austria, Northern Ireland and the USA. It has been shown that the SAP version developed by over one hundred IT specialists is not suitable for high-turnover countries. Now Lidl has stopped the project. In one of the newspaper "Heilbronner Stimme" present letter to the coworkers it is said that the actual "goals" are "not reachable with justifiable effort". So far, according to expert opinion, the project has consumed more than half a billion euros - for expensive IT consultants and SAP licenses, for example. Now Lidl wants to further develop its own inventory management system.
Outside of that the SSD is idling.
Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
Edit: Typos