WARC me up, Scotty

2020-01-17

To say that I’m paranoid about backups is like saying…well, it’s hard to find a point of comparison that’s fulsome enough. I keep everything. I have archives of old MacOS installers on a RAID drive at home. Yes, I’m aware you can get them from Apple, the question being how long that’ll last. I have TimeMachine Backups going back to ~2009.

Why? As another post I’m working on will suggest, it’s about privacy and about control. Privacy in the sense that I don’t feel like it’s possible to trust the platforms that the present age seems to mandate we rely on. And, for the same reasons the good folks at ArchiveTeam have outlined:

Because they don’t give a fuck about your data. Except insofar as they can monetize it.
Because they will delete your wedding photos. This is a reference to Anil Dash’s The Web We Lost which everyone should watch. Do it. Right now. Then read the rest of this, please.

My interest in preserving my stuff—no matter how worthless it may seem—also dovetails nicely with the web archiving projects I have going at work, e.g., the blog of our long time provost, Peter Stearns.

Web archiving as I practice it, the process of making a snapshot of a website in a format that preserves its accessibility and usability. The folks at ArchiveTeam have done a lot on this front, e.g., recent projects to grab as much of Yahoo! Groups as possible before it was too late. Verizon blocked them, just as they did with Tumblr. Just thought I’d post a little note about this for convenience sake.

I use wget to do this because it’s super simple and baked right in to standard Linux/MacOS. It does have its limitations. These wget commands will produce a warc file for which you’ll need a warc player. You can grab something like Webrecorder or Webrecorder Player to browse the captured site. You can also browse the site files directly: wget will capture the individual files in addition to making the warc.

It’ll take however long it takes, obviously it just depends on how big the site is. Something like this site mostly text and built with a static site generator like Hugo takes less than 30 seconds. Database driven sites like those which use Wordpress can be much larger. There’s also the speed of your internet connection and computer to factor in to the mix. One recent example—the Washington Metro’s site took about a week and came out to ~36GB. And that was running 24/7 on a 1 Tbps connection on a pretty snappy machine at work.

Using `wget` for web archiving

Generally, you can use:
wget -pkrm --warc-cdx --warc-file=foo -e robots=off https://foo.org

-p collects the prerequisites for the page/site

-k converts the links to make them work locally

-r captures the site recursively
this is especially useful when you have a site with a funky configuration that throws wget off. Not thinking of any Netscape founders/famous devs, JAMIE. Should that happen while using foo.org might not work, but using foo.org/bar/ might.

-m mirrors the site/structure etc.

-e executes an option as if it was in your .wgetrc

--warc-cdx creates an index file for the WARC file

You can also add this stuff to your .wgetrc in your home folder.

progress=bar
robots=off
random_wait=on
mirror=on
recursive=on
verbose=on
user_agent=Mozilla

If you’re using Windows for this, I’ll refrain from giving you the JWZ treatment but I’ve got no fucking clue how to use wget on Windows. Sorry. Maybe switch to something that’s not as fucking creepy as Windows has become.¹

Not sorry even a little tiny bit. ↩︎

Using wget for web archiving

Using `wget` for web archiving