Hacker Newsnew | past | comments | ask | show | jobs | submit | maurelius2's commentslogin

I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?


If in short, for many inference tasks the bottleneck is memory bandwidth. Suppose you have a machine with a memory bandwidth of 256 GB/s, and let's say you want to do inference for 4B model (model with 4 billion parameters). If you will load the model in BF16 format (16 bits), each forward pass (i.e. each token generated) will require roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the generation speed you will be strictly capped at even if your processing power is measured in exaFLOPS. But let's say now that you have decided to instead quantize the model and then run the quantized version. Suppose you have made a Q4_K_M version (4 bits + some weights will take more). Now each of your forward passes will take roughly 2-3 GB (rough approximations, reality is different) of memory bandwith (actually, it will be around 2 GB), and even in the worst case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality of the model and lower it's performance, but in most modern quantization methods those losses are usually negligible (although, of course, they're still present). So, as you can see, it can be concluded that quantization "widens" (it's not removing it fully) memory bottleneck while still preserving (not always though) acceptable quality.

(Sorry for my terrible English, it's not my native language)


The paper is about vector quantization, which affects KV cache not model weights/sizes.


So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps).

The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.


So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.


Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.


Reposting it here ... I wrote this more intuitive explanation. I think you might find it helpful too!

https://prabal.ca/posts/google-long-context-cheaper/


I can certainly see the appeal of distributing the context with vc. However, I have always imagined this to be integrated into an existing kanban workflow, similar to a Jira or gh issue board. Perhaps agent specific, perhaps not.

Furthermore, an existing kanban (ticket) workflow will expect you to refine the context into something more ... concentrated, or at least something that we are used to seeing as developers working with tickets, at least more so than the chat history that seem to be favored.

Have you put any thought into how this would integrate into such a process?


I did - GitHub and Trello (and I expect Jira) have APIs that could be used to hook up an MCP server. I liked the idea of conversing with the agent in the ticket, but I decided against that because I'd have to keep refreshing the issues, and it seemed a bit janky moving in and out of the IDE.

I also considered a full harness that could stream / sync the responses, but as per my comment below, implementing a full harness meant loosing a lot of the IDE integration features that come with the hand off to GitHub Copilot.

> I went down the route of implementing a full harness for a while like Vibe Kanban, but the issue was that it was unlikely (without significant effort) to be as good as Github Copilot chat, and it meant forfeiting all of the IDE integrations etc (like diff visualisation for the agents actions etc).

Having worked with a flow similar to this for a while - the markdown files become quite valuable as a history of planning and decisions for features. I didn't want to loose that. I just needed some help with managing the plan files I was maintaining - which the kanban board tooling does. A few command shortcuts via @kanban help too

Regarding what goes into the files, the agent tends to be quite concise - you don't see the whole train of thought that some of the harnesses surface.


> I'm trying to make Instagram be what my parents said Facebook was. Christ, I'm old.


(I missed adding my question and now I don'tknow how to edit my previous reply)

What I'm curious about and something I'm struggling with myself: are you feeling that the productive uplift of offloading work to an LLM limits the gain of managing a second brain like this?


I'm not sure if I am understanding your question,

Is it something like: if we delegate management, we could lose awareness of the knowledge contained in the notes?


If that is the question,

I agree, the LLM support we lose some level of manual work, maybe that is useful to be more familiar with the information, although I am not sure if that is a critical step. The LLM gives us management support. We can use a command to collect incomplete tasks, the LLM is doing that with extra steps, and it shows us the collected information, but in the end, we decide which tasks to prioritize.


Thanks for sharing! Easy to follow instructions.

I can see the PARA structure and perhaps a bit of Zettelkasten in the peripheral (but that might be my own bias).

I'm using a similar system (here's a link) https://blog.hampusadamsson.com/blog/How%20I%20Manage%20Note...

And also a custom plugin to do parts of the LLM (here's another shameless link) lifting: https://github.com/hampusadamsson/modai


A tiny broker to experience trading with fake money


I had a similar idea once :). How do you get the stock prices? it was difficult for me to find an affordable (free?) source of this information


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: