Building a Personal Memex

2020-01-03

Life in the Devonian Era

I am a data hoarder. I have personal emails going back to 2004. Some papers from undergrad, lots of stuff from my time in Austria, and more stuff besides. This is all to say nothing of my grad school work, thesis, etc. etc. etc. There’s a reason my 8TB NAS only has 1.5TB still free. I also have a lot of current stuff to manage and in a growing number of text and non-text formats.

Until recently, I relied pretty heavily on a 200gb Google Drive for storage of “current” materials. For notes I used Joplin or Markor as appropriate. But my increasing determination to liberate myself from the FAANGS meant that it was time shake things up.¹

So I moved my cloud storage to Tresorit and my email to Fastmail (referral link). But this still left something to be desired.

Enter DevonThink. DevonThink is a personal database system akin to Yojimbo or Symphytum, etc.. Except on steroids and a liter of RedBull.

DevonThink has some powerful tools baked in, e.g., ABBYY FineReader for OCR. It does a really good job indexing and identifying connections between documents and emails, between tags and keywords. It can capture local copies of web pages, RSS feeds, etc. Neat, right?

But there’s something missing: web history.

Send in the [History] Hounds

Web history is a funny thing: we don’t want anyone else to have it (hi Google) but it can be extraordinarily useful for keeping track of things you forgot to bookmark or send to Pinboard or whatever.

History Hound creates a searchable index of your web history across multiple browsers. If you sync this stuff e.g., via Firefox Sync, it gets that stuff too.² It also integrates with NetNewsWire to index your RSS feeds! This is a brilliant idea.

But if you’ve got DevonThink on one hand and HistoryHound on the other, how can you make them talk to one another? Well, a quick email to History Hound’s developer reveals there is a way to get it to dump out your web history index. Hooray! But wait, the next issue is that History Hound’s output is in UTF-16LE and the resulting file doesn’t render well. My first export was after a month of using it and the file was ~50mb of text. Three or so months in and the export was 120mb. Yikes!

So I noticed two things that should’ve been obvious at first, 1. HH exports to HTML and 2. the file encoding is wrong or at least it’s not UTF-8. But there’s a solution.

Bless This Mess

Here’s how to export then cleanup your History Hound index for archiving:

Window>Index Status (or just ⌘3)
Option+click Clear Index
Save the file as foo.html
Running file foo.html should produce something like HTML document text, Little-endian UTF-16 Unicode text, with very long lines
Then iconv -f UTF-16LE -t UTF-8 foo.html >> foo_utf8.html

Et voila! Dump into DevonThink. The resulting file should be ~50% smaller and opening it in DevonThink (or a plain old browser) will give you a delightful and searchable plain-text index you can file away in DevonThink.

The Software Freedom Conservancy, as seen by History Hound

It’s a nice way to make two great pieces of software work together even better.³ And another way to add to my growing Memex.

2020-01-22 Update: UTF-8!

Promise kept.

As noted by @StClairSoft after this was posted a few weeks ago, the new version now uses UTF-8 so you can save a bunch of steps above. Give their stuff a look. It’s good stuff. And indie developers like them need/deserve all the support they can get.

Facebook, Apple, Amazon, Netflix, Google and their ilk. ↩︎
Caution should be taken to use History Hound’s domain filter settings, especially if you sync between a work and personal machine. ↩︎
Other examples include MailMate’s very powerful integration with DT. ↩︎