Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let me add: Since the purpose of your bot is to verify links and protect/serve your users, consider removing the links from your site if robots.txt prohibits you from checking them. That's what I would prefer as a webmaster who explicitly set that policy on a site, since I have no control over who posts the links.


That doesn't make sense.

The point of blocking a link with robots.txt is to say "Hey, web crawlers, please don't load and index this page". it does not mean "Hey, users, please don't come and load and read this page".

So the script written, for all intents and purposes, is just the same as a regular old user clicking the link and reading the page then keeping a list of the links that work and those that don't. It's not a crawler, it's an automated user.

If you are a webmaster than wants to block people from posting links to your page all around the web allowing others to come and read it, make the page 403.


So stackoverflow should remove all links to github? Unwise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: