Hacker Newsnew | past | comments | ask | show | jobs | submit | toa697's commentslogin

Love the idea behind this.

Generally any new way of attempting to find signal among the noise of the internet is good, and I think RSS is niche enough to automatically remove a ton of noise from the dataset, and the fact this will mostly target articles and blogs gives it a distinct flavor.

Gonna put this with the marginalia astrolabe on my "small web search" bookmark folder.


What other stuff do you have in this 'folder'?


Not to disappoint, but currently Marginalia and now this. (It used to just be Marginalia, no idea why it was a folder)

Though now that there's two I'm probably gonna pick up a few more from here that I like: https://seirdy.one/posts/2021/03/10/search-engines-with-own-...


Out of curiosity, whats the crawl speed of both marginalia crawlers?

I had an inspiration to try and spin my own crawler after reading some other posts on the marginalia search. (it runs very dumbly, just pulling links from an ever increasing in-memory set) And on a single thread with asynchronous web requests and a massive pool of async workers (10k, ram is cheap on a personal machine). I've been able to reach around 300-400 requests per second, pulling the page, parsing for <a> tags, and throwing the href on the stack to search. I find the use of that many bespoke threads to be really surprising. Both because of the increased complexity of threads over async code, and my (possibly naive) expectation that web traffic will always out-bottleneck cpu bound tasks like HTML parsing/lexing/tagging etc.

I'll admit that I've been dragging my feet on implementing any proper parsing of my own, so I don't have any comparison to draw from. (Tried SQLite, clogged up my async code too much with blocking ops and im not excited to try a second time yet)


In practice, maybe 40-50rps (peaking at 100) for the first design, and 300rps for the second.

Although I'm serving search engine traffic from the same machine, so I'm trying to leave ample bandwidth for that. If I go too fast the NAT starts dropping packets and refusing connections, and that's not great for crawling or serving.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: