TL;DR server is getting shut off tomorrow and you need to back up the contents? Grab this:
Want to know what this does? Read on!
Remember to follow local regulations regarding web scraping. Don’t download too much at once so that you don’t accidentally bring down the server.
Also, if you’re downloading websites in order to publish them on Wayback Machine, don’t. They already have an archiving option online where you can just input the URL you want archived. If that tool rejects your URL, it’s possible that the site has been blacklisted from Wayback Machine due to requests from the website admin. In that case, they probably won’t accept your scraped version.
Let’s get started!
The tool we’re going to use today is
wget. You may have used this tool before when downloading
.isos from Ubuntu’s website, or when downloading source tarballs from websites. Whatever the case may be, the tool is very powerful and can do much more than downloading
.tar.gzs off of the Internet.
So this is probably the most common usage you saw:
And this saves the file
Ubuntu-18.04-desktop.iso in your current directory. Not bad, right?
But what if you want to go… deeper?
Downloading an entire website
First, we gotta tell
wget to download everything. Recursively download everything:
All right, not bad, what’s next? Well, what if you’re downloading under a certain URL and you don’t want
wget going up to the parent directory? The next parameter solves that:
Now, what if your download was interrupted? Simple. Continue the download!
What if there are multiple links to a single page? We don’t want to download the page multiple times. So we use:
Finally, sometimes, websites use “robots.txt” to stop scrapers. But today, we want to download everything. So turn off the robots check. This makes
wget non-compliant with good scraping protocols, but in this case we’re not an automated crawling bot like Google’s Googlebot is.
There we go! This is the final command you get, and it is the same one we have at the top of the blog post.
Being a good citizen
Now if you just rampantly download from a website, the servers will get overloaded and crash, the website would go down, and everyone will be pissed off. To be a good citizen, restrict your downloads.
Let’s wait a couple seconds before downloading the next page.
20 seconds is pretty reasonable between page loads. Let’s restrict
wget a bit more. Maybe restrict the bandwidth:
This restricts the bandwidth to 50KB/s which is very reasonable. If you’re downloading large files you might want to bump this up.
Finally, to jitter the download, wait randomly:
When websites are terrible
Sometimes, the network admins detect scrapers by looking at their User Agent string and block any non-browser programs from accessing the website. So even when we try to be good citizens with tips from the above section, it’s just impossible to scrape the website.
Or is it? Fortunately, changing the User Agent string is pretty easy:
You can change it however you want. Firefox is just used as an example.
So that’s pretty much it on the subject of downloading websites! Hopefully you learned a bit more about
wget today and how to use it to download entire websites!