Whilst nice, how is this going to handle the changing nature of the web? It's nice that it detects "lists" and such, but a few changes to CSS is going to trash that automation right?
I'm also fairly sure you'll break (either directly, or on a user's behalf) a few EULA's that really specifically ban scraping.
This might be true in the USA, but the EU has a thing called database rights[0]. Essentially, any collection of data can under certain circumstances be protected under database rights, which prevents other parties from copying (parts of) it. This originally was created to protect such things as phone books and other directories, but when I was a student (I don't remember the context anymore), they specifically warned us that scraping certain websites would violate their database rights, and thus be illegal. So using scrapers in the EU is something you should be very careful with, especially if your business depends on it.
You pedantic piece of... nah I'm just kidding, Thank you. I actually learned English by watching Clint Eastwood, Charles Bronson and Sylvester Stallone movies, so my grammar might be slightly off from time to time, but google actually agrees with me when I say: irregardless == regardless.
Ah, so people've been making this mistake for over two hundred years but thanks to people like you, this misuse of language has been all but eradicated?
I tend to remind people who think that this is an error that, although I share their disliking...
- There is a case for the word and it predates us
- Languages are dynamic and today's "correct" spelling is yesterday's "erroneous" spelling.
I thought until recently that the spelling was "simply incorrect" until I found out there was more to it. It therefore is a reminder to myself as well.
I'm also fairly sure you'll break (either directly, or on a user's behalf) a few EULA's that really specifically ban scraping.