Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A crawler has two high level options: parse the page, or render the page.

Most of our parser-based crawling is done by Heritrix (crawler.archive.org) and most of our render-based crawling is done by a proxy-based recorder similar to what you theorize (https://github.com/internetarchive/brozzler).



Thanks for sharing. That lets me sleep a bit easier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: