TL;DR server is getting shut off tomorrow and you need to back up the contents? Grab this:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com

Want to know what this does? Read on!

Warning

Remember to follow local regulations regarding web scraping. Don’t download too much at once so that you don’t accidentally bring down the server.

Also, if you’re downloading websites in order to publish them on Wayback Machine, don’t. They already have an archiving option online where you can just input the URL you want archived. If that tool rejects your URL, it’s possible that the site has been blacklisted from Wayback Machine due to requests from the website admin. In that case, they probably won’t accept your scraped version.

Let’s get started!

wget

The tool we’re going to use today is wget. You may have used this tool before when downloading .isos from Ubuntu’s website, or when downloading source tarballs from websites. Whatever the case may be, the tool is very powerful and can do much more than downloading .tar.gzs off of the Internet.

So this is probably the most common usage you saw:

wget https://www.ubuntu.com/path/to/iso/Ubuntu-18.04-desktop.iso

And this saves the file Ubuntu-18.04-desktop.iso in your current directory. Not bad, right?

But what if you want to go… deeper?

Downloading an entire website

First, we gotta tell wget to download everything. Recursively download everything:

wget --recursive http://example.com

All right, not bad, what’s next? Well, what if you’re downloading under a certain URL and you don’t want wget going up to the parent directory? The next parameter solves that:

wget --recursive  --no-parent http://example.com

Now, what if your download was interrupted? Simple. Continue the download!

wget --recursive  --no-parent --continue http://example.com

What if there are multiple links to a single page? We don’t want to download the page multiple times. So we use:

wget --recursive  --no-parent --continue --no-clobber http://example.com

Finally, sometimes, websites use “robots.txt” to stop scrapers. But today, we want to download everything. So turn off the robots check. This makes wget non-compliant with good scraping protocols, but in this case we’re not an automated crawling bot like Google’s Googlebot is.

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com

There we go! This is the final command you get, and it is the same one we have at the top of the blog post.

Being a good citizen

Now if you just rampantly download from a website, the servers will get overloaded and crash, the website would go down, and everyone will be pissed off. To be a good citizen, restrict your downloads.

Let’s wait a couple seconds before downloading the next page.

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 http://example.com

20 seconds is pretty reasonable between page loads. Let’s restrict wget a bit more. Maybe restrict the bandwidth:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --limit-rate=50k http://example.com

This restricts the bandwidth to 50KB/s which is very reasonable. If you’re downloading large files you might want to bump this up.

Finally, to jitter the download, wait randomly:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --random-wait --limit-rate=50k http://example.com

When websites are terrible

Sometimes, the network admins detect scrapers by looking at their User Agent string and block any non-browser programs from accessing the website. So even when we try to be good citizens with tips from the above section, it’s just impossible to scrape the website.

Or is it? Fortunately, changing the User Agent string is pretty easy:

wget --user-agent="Mozilla/5.0 Firefox/4.0.1" ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --random-wait --limit-rate=50k http://example.com

You can change it however you want. Firefox is just used as an example.

So that’s pretty much it on the subject of downloading websites! Hopefully you learned a bit more about wget today and how to use it to download entire websites!