Thursday, November 19, 2009

wget and robots.txt

Well, the webmasters are trying to ward off robots... meanwhile robots are getting smarter and smarter. A natural competition, it seems.

Here's what you do to bypass the "robot police":

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:

wget -r -p -e robots=off http://www.example.com


Using wget To Download Entire Websites
courtesy Jam's Ubuntu Linux Blog.