UNB/ CS/ David Bremner/ blog/ posts/ Using git-annex for email and notmuch metadata

Introducing git-remote-notmuch

Based on an idea and ruby implementation by Felipe Contreras, I have been developing a git remote helper for notmuch. I will soon post an updated version of the patchset to the notmuch mailing list (I wanted to refer to this post in my email). In this blog post I'll outline my experiments with using that tool, along with git-annex to store (and sync) a moderate sized email store along with its notmuch metadata.

WARNING

The rest of this post describes some relatively complex operations using (at best) alpha level software (namely git-remote-notmuch). git-annex is good at not losing your files, but git-remote-notmuch can (and did several times during debugging) wipe out your notmuch database. If you have a backup (e.g. made with notmuch-dump), this is much less annoying, and in particular you can decide to walk away from this whole experiment and restore your database.

Why git-annex?

I currently have about 31GiB of email, spread across more than 830,000 files. I want to maintain the ability to search and read my email offline, so I need to maintain a copy on several workstations and at least one server (which is backed up explicitly). I am somewhat commited to maintaining synchronization of tags to git since that is how the notmuch bug tracker works. Commiting the email files to git seems a bit wasteful: by design notmuch does not modify email files, and even with compression, the extra copy adds a fair amount of overhead (in my case, 17G of git objects, about 57% overhead). It is also notoriously difficult to completely delete files from a git repository. git-annex offers potential mitigation for these two issues, at the cost of a somewhat more complex mental model. The main idea is that instead of committing every version of a file to the git repository, git-annex tracks the filename and metadata, with the file content being stored in a key-value store outside git. Conceptually this is similar to git-lfs. From our current point, the important point is that instead of a second (compressed) copy of the file, we store one copy, along with a symlink and a couple of directory entries.

What to annex

For sufficiently small files, the overhead of a symlink and couple of directory entries is greater than the cost of a compressed second copy. When this happens depends on several variables, and will probably depend on the file content in a particular collection of email. I did a few trials of different settings for annex.largefiles to come to a threshold of largerthan=32k 1. For the curious, my experimental results are below. One potentially surprising aspect is that annexing even a small fraction of the (largest) files yields a big drop in storage overhead.

Threshold fraction annexed overhead
0 100% 30%
8k 29% 13%
16k 12% 9.4%
32k 7% 8.9%
48k 6% 8.9%
100k 3% 9.1%
256k 2% 11%
∞ (git) 0 % 57%

In the end I chose to err on the side of annexing more files (for the flexibility of deletion) rather than potentially faster operations with fewer annexed files at the same level of overhead.

Summarizing the configuration settings for git-annex (some of these are actually defaults, but not in my environment).

$ git config annex.largefiles largerthan=32k
$ git config annex.dotfiles true
$ git config annex.synccontent true

Delivering mail

To get new mail, I do something like

# compute a date based folder under $HOME/Maildir
$ dest = $(folder)
# deliver mail to ${dest} (somehow).
$ notmuch new
$ git -C $HOME/Maildir add ${folder}
$ git -C $HOME/Maildir diff-index --quiet HEAD || git -C $HOME/Maildir commit -m 'mail delivery'

The call to diff-index is just an optimization for the case when nothing was delivered. The default configuration of git-annex will automagically annex any files larger than my threshold. At this point the git-annex repo knows nothing about tags.

There is some git configuration that can speed up the "git add" above, namely

$ git config core.untrackedCache true
$ git config core.fsmonitor true

See git-status(1) under "UNTRACKED FILES AND PERFORMANCE"

Defining notmuch as a git remote

Assuming git-remote-notmuch is somewhere in your path, you can define a remote to connect to the default notmuch database.

$ git remote add database notmuch::
$ git fetch database
$ git merge --allow-unrelated database

The --allow-unrelated should be needed only the first time.

In my case the many small files used to represent the tags (one per message), use a noticeable amount of disk space (in my case about the same amount of space as the xapian database).

Once you start merging from the database to the git repo, you will likely have some conflicts, and most conflict resolution tools leave junk lying around. I added the following .gitignore file to the top level of the repo

*.orig
*~

This prevents our cavalier use of git add from adding these files to our git history (and prevents pushing random junk to the notmuch database.

To push the tags from git to notmuch, you can run

$ git push database master

You might need to run notmuch new first, so that the database knows about all of the messages (currently git-remote-notmuch can't index files, only update metadata).

git annex sync should work with the new remote, but pushing back will be very slow 2. I disable automatic pushing as follows

$ git config remote.database.annex-push false

Unsticking the database remote

If you are debugging git-remote-notmuch, or just unlucky, you may end up in a sitation where git thinks the database is ahead of your git remote. You can delete the database remote (and associated stuff) and re-create it. Although I cannot promise this will never cause problems (because, computers), it will not modify your local copy of the tags in the git repo, nor modify your notmuch database.

$ git remote rm database
$ git update-rf -d notmuch/master
$ rm -r .git/notmuch

Fine tuning notmuch config


  1. I also had to set annex.dotfiles to true, as many of my maildirs follow the qmail style convention of starting with a .
  2. I'm not totally clear on why it so slow, but certainly git-annex tries to push several more branches, and these are ignored by git-remote-annex.