Introducing git-remote-notmuch
Based on an idea and ruby implementation by Felipe Contreras, I have been developing a git remote helper for notmuch. I will soon post an updated version of the patchset to the notmuch mailing list (I wanted to refer to this post in my email). In this blog post I'll outline my experiments with using that tool, along with git-annex to store (and sync) a moderate sized email store along with its notmuch metadata.
WARNING
The rest of this post describes some relatively complex operations
using (at best) alpha level software (namely
git-remote-notmuch
). git-annex
is good at not losing your files,
but git-remote-notmuch
can (and did several times during debugging)
wipe out your notmuch database. If you have a backup (e.g. made with
notmuch-dump
), this is much less annoying, and in particular you can
decide to walk away from this whole experiment and restore your
database.
Why git-annex?
I currently have about 31GiB of email, spread across more than 830,000
files. I want to maintain the ability to search and read my email
offline, so I need to maintain a copy on several workstations and at
least one server (which is backed up explicitly). I am somewhat
commited to maintaining synchronization of tags to git since that is
how the notmuch bug tracker works. Commiting the email files to
git seems a bit wasteful: by design notmuch does not modify email
files, and even with compression, the extra copy adds a fair amount of
overhead (in my case, 17G of git objects, about 57% overhead). It is
also notoriously difficult to completely delete files from a git
repository. git-annex
offers potential mitigation for these two
issues, at the cost of a somewhat more complex mental model. The main
idea is that instead of committing every version of a file to the git
repository, git-annex
tracks the filename and metadata, with the
file content being stored in a key-value store outside
git. Conceptually this is similar to git-lfs
. From our current
point, the important point is that instead of a second (compressed)
copy of the file, we store one copy, along with a symlink and a couple
of directory entries.
What to annex
For sufficiently small files, the overhead of a symlink and couple of
directory entries is greater than the cost of a compressed second
copy. When this happens depends on several variables, and will
probably depend on the file content in a particular collection of
email. I did a few trials of different settings for annex.largefiles
to come to a threshold of largerthan=32k
1. For the curious, my
experimental results are below. One potentially surprising aspect is
that annexing even a small fraction of the (largest) files yields a
big drop in storage overhead.
Threshold | fraction annexed | overhead |
---|---|---|
0 | 100% | 30% |
8k | 29% | 13% |
16k | 12% | 9.4% |
32k | 7% | 8.9% |
48k | 6% | 8.9% |
100k | 3% | 9.1% |
256k | 2% | 11% |
∞ (git) | 0 % | 57% |
In the end I chose to err on the side of annexing more files (for the flexibility of deletion) rather than potentially faster operations with fewer annexed files at the same level of overhead.
Summarizing the configuration settings for git-annex
(some of these
are actually defaults, but not in my environment).
$ git config annex.largefiles largerthan=32k
$ git config annex.dotfiles true
$ git config annex.synccontent true
Delivering mail
To get new mail, I do something like
# compute a date based folder under $HOME/Maildir
$ dest = $(folder)
# deliver mail to ${dest} (somehow).
$ notmuch new
$ git -C $HOME/Maildir add ${folder}
$ git -C $HOME/Maildir diff-index --quiet HEAD || git -C $HOME/Maildir commit -m 'mail delivery'
The call to diff-index
is just an optimization for the case when
nothing was delivered. The default configuration of git-annex
will
automagically annex any files larger than my threshold. At this point
the git-annex
repo knows nothing about tags.
There is some git configuration that can speed up the "git add" above, namely
$ git config core.untrackedCache true
$ git config core.fsmonitor true
See git-status(1)
under "UNTRACKED FILES AND PERFORMANCE"
Defining notmuch
as a git remote
Assuming git-remote-notmuch
is somewhere in your path, you can define
a remote to connect to the default notmuch database.
$ git remote add database notmuch::
$ git fetch database
$ git merge --allow-unrelated database
The --allow-unrelated
should be needed only the first time.
In my case the many small files used to represent the tags (one per message), use a noticeable amount of disk space (in my case about the same amount of space as the xapian database).
Once you start merging from the database to the git repo, you will
likely have some conflicts, and most conflict resolution tools leave
junk lying around. I added the following .gitignore
file to the top
level of the repo
*.orig
*~
This prevents our cavalier use of git add
from adding these files to
our git history (and prevents pushing random junk to the notmuch
database.
To push the tags from git
to notmuch
, you can run
$ git push database master
You might need to run notmuch new
first, so that the database knows
about all of the messages (currently git-remote-notmuch
can't index
files, only update metadata).
git annex sync
should work with the new remote, but pushing back
will be very slow 2. I disable automatic pushing as follows
$ git config remote.database.annex-push false
Unsticking the database
remote
If you are debugging git-remote-notmuch
, or just unlucky, you may
end up in a sitation where git thinks the database
is ahead of your
git remote. You can delete the database remote (and associated stuff)
and re-create it. Although I cannot promise this will never cause
problems (because, computers), it will not modify your local copy of
the tags in the git repo, nor modify your notmuch database.
$ git remote rm database
$ git update-rf -d notmuch/master
$ rm -r .git/notmuch
Fine tuning notmuch
config
In order to avoid dealing with file renames, I have
notmuch config maildir.synchronize_flags false
I have added the following to
new.ignore
:.git;_notmuch_metadata;.gitignore