Hacker Newsnew | past | comments | ask | show | jobs | submit | lonk11's commentslogin

Building a custom feed for Bluesky which uses collaborative filtering over the likes data: https://foryou.club

How the algorithm works: it finds people who liked the same posts as you, and shows you what else they’ve liked recently.

Launched the feed a little over a year ago and it has become the most liked feed.


My original title was: "Serving For You from my living room"


For You is based on your likes. If you get an empty feed then you probably haven't liked anything yet. Try liking a couple of posts in Discover feed and get back to For You.


Since For You is based on likes I would suggest liking more posts that you want to appear in your For You feed.


"likes by people you follow" is "Popular with Friends": https://bsky.app/profile/bsky.app/feed/with-friends

The For You one uses only your likes: it finds people who liked the same posts as you, and shows you what else they've liked recently.


This is definitely doable and anyone can build such a feed using Bluesky's APIs.

As an example, I built a "For You" feed https://bsky.app/profile/did:plc:3guzzweuqraryl3rdkimjamk/fe... that finds the posts you liked, finds other people who liked the same posts and shows you what else they liked.

To help me debug the algorithm I built a simple web UI that allows you to see the feed for any user by plugging their account id: https://linklonk.com/bluesky

You can switch perspective to other users and explore how the would experience the feed.


Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.


Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.


What you are describing is similar to how https://LinkLonk.com works (my side project) - when you "like" a link you get connected to the RSS feeds that posted that link and other users that also liked it. Then you get content from feeds and users that you are connected to. The more links in common you have with a feed or a user the more weight their other links have.


My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.


I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.

And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.

It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: