Is this still the case for sliding window attention/streaming LLMs, where you ha... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		dnnssl2 on Nov 30, 2023 \| parent \| context \| favorite \| on: Accelerating Generative AI with PyTorch II: GPT, F... Is this still the case for sliding window attention/streaming LLMs, where you have a fixed length attention window rather than infinitely passing in new tokens for quadratic scaling? You even get better performance due to purposely downsampling non-meaningful attention sink tokens.

chillee on Nov 30, 2023 [–]

I cover it a bit in the blog post, but unless you have a really long context length (like 32k+), your primary computational cost doesn't come from attention but rather from loading your weights from VRAM into registers.

I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact