This feed contains pages with tag "email".
Introducing git-remote-notmuch
Based on an idea and ruby implementation by Felipe Contreras, I have been developing a git remote helper for notmuch. I will soon post an updated version of the patchset to the notmuch mailing list (I wanted to refer to this post in my email). In this blog post I'll outline my experiments with using that tool, along with git-annex to store (and sync) a moderate sized email store along with its notmuch metadata.
WARNING
The rest of this post describes some relatively complex operations
using (at best) alpha level software (namely
git-remote-notmuch
). git-annex
is good at not losing your files,
but git-remote-notmuch
can (and did several times during debugging)
wipe out your notmuch database. If you have a backup (e.g. made with
notmuch-dump
), this is much less annoying, and in particular you can
decide to walk away from this whole experiment and restore your
database.
Why git-annex?
I currently have about 31GiB of email, spread across more than 830,000
files. I want to maintain the ability to search and read my email
offline, so I need to maintain a copy on several workstations and at
least one server (which is backed up explicitly). I am somewhat
commited to maintaining synchronization of tags to git since that is
how the notmuch bug tracker works. Commiting the email files to
git seems a bit wasteful: by design notmuch does not modify email
files, and even with compression, the extra copy adds a fair amount of
overhead (in my case, 17G of git objects, about 57% overhead). It is
also notoriously difficult to completely delete files from a git
repository. git-annex
offers potential mitigation for these two
issues, at the cost of a somewhat more complex mental model. The main
idea is that instead of committing every version of a file to the git
repository, git-annex
tracks the filename and metadata, with the
file content being stored in a key-value store outside
git. Conceptually this is similar to git-lfs
. From our current
point, the important point is that instead of a second (compressed)
copy of the file, we store one copy, along with a symlink and a couple
of directory entries.
What to annex
For sufficiently small files, the overhead of a symlink and couple of
directory entries is greater than the cost of a compressed second
copy. When this happens depends on several variables, and will
probably depend on the file content in a particular collection of
email. I did a few trials of different settings for annex.largefiles
to come to a threshold of largerthan=32k
1. For the curious, my
experimental results are below. One potentially surprising aspect is
that annexing even a small fraction of the (largest) files yields a
big drop in storage overhead.
Threshold | fraction annexed | overhead |
---|---|---|
0 | 100% | 30% |
8k | 29% | 13% |
16k | 12% | 9.4% |
32k | 7% | 8.9% |
48k | 6% | 8.9% |
100k | 3% | 9.1% |
256k | 2% | 11% |
∞ (git) | 0 % | 57% |
In the end I chose to err on the side of annexing more files (for the flexibility of deletion) rather than potentially faster operations with fewer annexed files at the same level of overhead.
Summarizing the configuration settings for git-annex
(some of these
are actually defaults, but not in my environment).
$ git config annex.largefiles largerthan=32k
$ git config annex.dotfiles true
$ git config annex.synccontent true
Delivering mail
To get new mail, I do something like
# compute a date based folder under $HOME/Maildir
$ dest = $(folder)
# deliver mail to ${dest} (somehow).
$ notmuch new
$ git -C $HOME/Maildir add ${folder}
$ git -C $HOME/Maildir diff-index --quiet HEAD ${folder} || git -C $HOME/Maildir commit -m 'mail delivery'
The call to diff-index
is just an optimization for the case when
nothing was delivered. The default configuration of git-annex
will
automagically annex any files larger than my threshold. At this point
the git-annex
repo knows nothing about tags.
There is some git configuration that can speed up the "git add" above, namely
$ git config core.untrackedCache true
$ git config core.fsmonitor true
See git-status(1)
under "UNTRACKED FILES AND PERFORMANCE"
Defining notmuch
as a git remote
Assuming git-remote-notmuch
is somewhere in your path, you can define
a remote to connect to the default notmuch database.
$ git remote add database notmuch::
$ git fetch database
$ git merge --allow-unrelated database
The --allow-unrelated
should be needed only the first time.
In my case the many small files used to represent the tags (one per message), use a noticeable amount of disk space (in my case about the same amount of space as the xapian database).
Once you start merging from the database to the git repo, you will
likely have some conflicts, and most conflict resolution tools leave
junk lying around. I added the following .gitignore
file to the top
level of the repo
*.orig
*~
This prevents our cavalier use of git add
from adding these files to
our git history (and prevents pushing random junk to the notmuch
database.
To push the tags from git
to notmuch
, you can run
$ git push database master
You might need to run notmuch new
first, so that the database knows
about all of the messages (currently git-remote-notmuch
can't index
files, only update metadata).
git annex sync
should work with the new remote, but pushing back
will be very slow 2. I disable automatic pushing as follows
$ git config remote.database.annex-push false
Unsticking the database
remote
If you are debugging git-remote-notmuch
, or just unlucky, you may
end up in a sitation where git thinks the database
is ahead of your
git remote. You can delete the database remote (and associated stuff)
and re-create it. Although I cannot promise this will never cause
problems (because, computers), it will not modify your local copy of
the tags in the git repo, nor modify your notmuch database.
$ git remote rm database
$ git update-rf -d notmuch/master
$ rm -r .git/notmuch
Fine tuning notmuch
config
In order to avoid dealing with file renames, I have
notmuch config maildir.synchronize_flags false
I have added the following to
new.ignore
:.git;_notmuch_metadata;.gitignore
1 Background
Apparently motivated by recent phishing attacks against @unb.ca
addresses, UNB's Integrated Technology Services unit (ITS) recently
started adding banners to the body of email messages. Despite
(cough) several requests, they have been unable and/or unwilling to
let people opt out of this. Recently ITS has reduced the size of
banner; this does not change the substance of what is discussed here.
In this blog post I'll try to document some of the reasons this
reduces the utility of my UNB email account.
2 What do I know about email?
I have been using email since 1985 1. I have administered my own UNIX-like systems since
the mid 1990s. I am a Debian Developer 2.
Debian is a mid-sized organization (there are more Debian Developers
than UNB faculty members) that functions mainly via email (including
discussions and a bug tracker). I maintain a mail user agent
(informally, an email client) called notmuch
3. I administer my own (non-UNB) email
server. I have spent many hours reading RFCs 4.
In summary, my perspective might be different than an enterprise email
adminstrator, but I do know something about the nuts and bolts of how
email works.
3 What's wrong with a helpful message?
3.1 It's a banner ad.
I don't browse the web without an ad-blocker and I don't watch TV with advertising in it. Apparently the main source of advertising in my life is a service provided by my employer. Some readers will probably dispute my description of a warning label inserted by an email provider as "advertising". Note that is information inserted by a third party to promote their own (well intentioned) agenda, and inserted in an intentionally attention grabbing way. Advertisements from charities are still advertisements. Preventing phishing attacks is important, but so are an almost countless number of priorities of other units of the University. For better or worse those units are not so far able to insert messages into my email. As a thought experiment, imagine inserting a banner into every PDF file stored on UNB servers reminding people of the fiscal year end.
3.2 It makes us look unprofessional.
Because the banner is contained in the body of email messages, it almost inevitably ends up in replies. This lets funding agencies, industrial partners, and potential graduate students know that we consider them as potentially hostile entities. Suggesting that people should edit their replies is not really an acceptable answer, since it suggests that it is acceptable to download the work of maintaining the previous level of functionality onto each user of the system.
3.3 It doesn't help me
I have an archive of 61270 email messages received since 2003. Of
these 26215 claim to be from a unb.ca
address 5. So historically
about 42% of the mail to arrive at my UNB mailbox is internal 6. This means that warnings will occur
in the majority of messages I receive. I think the onus is on the
proposer to show that a warning that occurs in the large majority of
messages will have any useful effect.
3.4 It disrupts my collaboration with open-source projects
Part of my job is to collaborate with various open source projects. A
prominent example is Eclipse OMR 7,
the technological driver for a collaboration with IBM that has brought
millions of dollars of graduate student funding to UNB. Git is now
the dominant version control system for open source projects, and one
popular way of using git is via git-send-email
8
Adding a banner breaks the delivery of patches by this method. In the a previous experiment I did about a month ago, it "only" caused the banner to end up in the git commit message. Those of you familiar with software developement will know that this is roughly the equivalent of walking out of the bathroom with toilet paper stuck to your shoe. You'd rather avoid it, but it's not fatal. The current implementation breaks things completely by quoted-printable re-encoding the message. In particular '=' gets transformed to '=3D' like the following
-+ gunichar *decoded=g_utf8_to_ucs4_fast (utf8_str, -1, NULL); -+ const gunichar *p = decoded; ++ gunichar *decoded=3Dg_utf8_to_ucs4_fast (utf8_str, -1, NULL);
I'm not currently sure if this is a bug in git or some kind of failure in the re-encoding. It would likely require an investment of several hours of time to localize that.
3.5 It interferes with the use of cryptography.
Unlike many people, I don't generally read my email on a phone. This
means that I don't rely on the previews that are apparently disrupted
by the presence of a warning banner. On the other hand I do send and
receive OpenPGP signed and encrypted messages. The effects of the
banner on both signed and encrypted messages is similar, so I'll stick
to discussing signed messages here. There are two main ways of signing
a message. The older method, still unfortunately required for some
situations is called "inline PGP". The signed region is re-encoded,
which causes gpg to issue a warning about a buggy MTA 9, namely gpg: quoted printable character in armor -
probably a buggy MTA has been used
. This is not exactly confidence
inspiring. The more robust and modern standard is PGP/MIME. Here the
insertion of a banner does not provoke warnings from the cryptography
software, but it does make it much harder to read the message (before
and after screenshots are given below). Perhaps more importantly it
changes the message from one which is entirely signed or encrypted
10, to
one which is partially signed or encrypted. Such messages were
instrumental in the EFAIL exploit 11 and will
probably soon be rejected by modern email clients.

Figure 1: Intended view of PGP/MIME signed message

Figure 2: View with added banner
Footnotes:
On Multics, when I was a high school student
IETF Requests for Comments, which define most of the standards used by email systems.
possibly overcounting some spam as UNB originating email
In case it's not obvious dear reader, communicating with the world outside UNB is part of my job.
Some important projects function exclusively that way. See https://git-send-email.io/ for more information.
Mail Transfer Agent
Created: 2019-05-22 Wed 17:04
The Ottawa Citizen reports on a study showing that
people are less honest, and ruder in email than in person.
The snarkier among us may want to note that the study was restricted to MBA students :-).
For a project involving ikiwiki. I am thinking about the right way (or at least a reasonable way) to encode things into email address. At the moment I am leaning towards a hybrid quoted printable/base64 style approach.
You can have a look at my current implementation of a module Convert::YText to do this conversion.