For You is based on your likes. If you get an empty feed then you probably haven't liked anything yet. Try liking a couple of posts in Discover feed and get back to For You.
To help me debug the algorithm I built a simple web UI that allows you to see the feed for any user by plugging their account id: https://linklonk.com/bluesky
You can switch perspective to other users and explore how the would experience the feed.
Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.
What you are describing is similar to how https://LinkLonk.com works (my side project) - when you "like" a link you get connected to the RSS feeds that posted that link and other users that also liked it. Then you get content from feeds and users that you are connected to. The more links in common you have with a feed or a user the more weight their other links have.
My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.
This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.
I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.
And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.
It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.
How the algorithm works: it finds people who liked the same posts as you, and shows you what else they’ve liked recently.
Launched the feed a little over a year ago and it has become the most liked feed.