UNB/ CS/ David Bremner/ tags/ email

This feed contains pages with tag "email".

Introducing git-remote-notmuch

Based on an idea and ruby implementation by Felipe Contreras, I have been developing a git remote helper for notmuch. I will soon post an updated version of the patchset to the notmuch mailing list (I wanted to refer to this post in my email). In this blog post I'll outline my experiments with using that tool, along with git-annex to store (and sync) a moderate sized email store along with its notmuch metadata.

WARNING

The rest of this post describes some relatively complex operations using (at best) alpha level software (namely git-remote-notmuch). git-annex is good at not losing your files, but git-remote-notmuch can (and did several times during debugging) wipe out your notmuch database. If you have a backup (e.g. made with notmuch-dump), this is much less annoying, and in particular you can decide to walk away from this whole experiment and restore your database.

Why git-annex?

I currently have about 31GiB of email, spread across more than 830,000 files. I want to maintain the ability to search and read my email offline, so I need to maintain a copy on several workstations and at least one server (which is backed up explicitly). I am somewhat commited to maintaining synchronization of tags to git since that is how the notmuch bug tracker works. Commiting the email files to git seems a bit wasteful: by design notmuch does not modify email files, and even with compression, the extra copy adds a fair amount of overhead (in my case, 17G of git objects, about 57% overhead). It is also notoriously difficult to completely delete files from a git repository. git-annex offers potential mitigation for these two issues, at the cost of a somewhat more complex mental model. The main idea is that instead of committing every version of a file to the git repository, git-annex tracks the filename and metadata, with the file content being stored in a key-value store outside git. Conceptually this is similar to git-lfs. From our current point, the important point is that instead of a second (compressed) copy of the file, we store one copy, along with a symlink and a couple of directory entries.

What to annex

For sufficiently small files, the overhead of a symlink and couple of directory entries is greater than the cost of a compressed second copy. When this happens depends on several variables, and will probably depend on the file content in a particular collection of email. I did a few trials of different settings for annex.largefiles to come to a threshold of largerthan=32k 1. For the curious, my experimental results are below. One potentially surprising aspect is that annexing even a small fraction of the (largest) files yields a big drop in storage overhead.

Threshold fraction annexed overhead
0 100% 30%
8k 29% 13%
16k 12% 9.4%
32k 7% 8.9%
48k 6% 8.9%
100k 3% 9.1%
256k 2% 11%
∞ (git) 0 % 57%

In the end I chose to err on the side of annexing more files (for the flexibility of deletion) rather than potentially faster operations with fewer annexed files at the same level of overhead.

Summarizing the configuration settings for git-annex (some of these are actually defaults, but not in my environment).

$ git config annex.largefiles largerthan=32k
$ git config annex.dotfiles true
$ git config annex.synccontent true

Delivering mail

To get new mail, I do something like

# compute a date based folder under $HOME/Maildir
$ dest = $(folder)
# deliver mail to ${dest} (somehow).
$ notmuch new
$ git -C $HOME/Maildir add ${folder}
$ git -C $HOME/Maildir diff-index --quiet HEAD ${folder} || git -C $HOME/Maildir commit -m 'mail delivery'

The call to diff-index is just an optimization for the case when nothing was delivered. The default configuration of git-annex will automagically annex any files larger than my threshold. At this point the git-annex repo knows nothing about tags.

There is some git configuration that can speed up the "git add" above, namely

$ git config core.untrackedCache true
$ git config core.fsmonitor true

See git-status(1) under "UNTRACKED FILES AND PERFORMANCE"

Defining notmuch as a git remote

Assuming git-remote-notmuch is somewhere in your path, you can define a remote to connect to the default notmuch database.

$ git remote add database notmuch::
$ git fetch database
$ git merge --allow-unrelated database

The --allow-unrelated should be needed only the first time.

In my case the many small files used to represent the tags (one per message), use a noticeable amount of disk space (in my case about the same amount of space as the xapian database).

Once you start merging from the database to the git repo, you will likely have some conflicts, and most conflict resolution tools leave junk lying around. I added the following .gitignore file to the top level of the repo

*.orig
*~

This prevents our cavalier use of git add from adding these files to our git history (and prevents pushing random junk to the notmuch database.

To push the tags from git to notmuch, you can run

$ git push database master

You might need to run notmuch new first, so that the database knows about all of the messages (currently git-remote-notmuch can't index files, only update metadata).

git annex sync should work with the new remote, but pushing back will be very slow 2. I disable automatic pushing as follows

$ git config remote.database.annex-push false

Unsticking the database remote

If you are debugging git-remote-notmuch, or just unlucky, you may end up in a sitation where git thinks the database is ahead of your git remote. You can delete the database remote (and associated stuff) and re-create it. Although I cannot promise this will never cause problems (because, computers), it will not modify your local copy of the tags in the git repo, nor modify your notmuch database.

$ git remote rm database
$ git update-rf -d notmuch/master
$ rm -r .git/notmuch

Fine tuning notmuch config

  • In order to avoid dealing with file renames, I have

      notmuch config maildir.synchronize_flags false
    
  • I have added the following to new.ignore:

       .git;_notmuch_metadata;.gitignore
    

  1. I also had to set annex.dotfiles to true, as many of my maildirs follow the qmail style convention of starting with a .
  2. I'm not totally clear on why it so slow, but certainly git-annex tries to push several more branches, and these are ignored by git-remote-annex.
Posted Tags: /tags/email

1 Background

Apparently motivated by recent phishing attacks against @unb.ca addresses, UNB's Integrated Technology Services unit (ITS) recently started adding banners to the body of email messages. Despite (cough) several requests, they have been unable and/or unwilling to let people opt out of this. Recently ITS has reduced the size of banner; this does not change the substance of what is discussed here. In this blog post I'll try to document some of the reasons this reduces the utility of my UNB email account.

2 What do I know about email?

I have been using email since 1985 1. I have administered my own UNIX-like systems since the mid 1990s. I am a Debian Developer 2. Debian is a mid-sized organization (there are more Debian Developers than UNB faculty members) that functions mainly via email (including discussions and a bug tracker). I maintain a mail user agent (informally, an email client) called notmuch 3. I administer my own (non-UNB) email server. I have spent many hours reading RFCs 4. In summary, my perspective might be different than an enterprise email adminstrator, but I do know something about the nuts and bolts of how email works.

3 What's wrong with a helpful message?

3.1 It's a banner ad.

I don't browse the web without an ad-blocker and I don't watch TV with advertising in it. Apparently the main source of advertising in my life is a service provided by my employer. Some readers will probably dispute my description of a warning label inserted by an email provider as "advertising". Note that is information inserted by a third party to promote their own (well intentioned) agenda, and inserted in an intentionally attention grabbing way. Advertisements from charities are still advertisements. Preventing phishing attacks is important, but so are an almost countless number of priorities of other units of the University. For better or worse those units are not so far able to insert messages into my email. As a thought experiment, imagine inserting a banner into every PDF file stored on UNB servers reminding people of the fiscal year end.

3.2 It makes us look unprofessional.

Because the banner is contained in the body of email messages, it almost inevitably ends up in replies. This lets funding agencies, industrial partners, and potential graduate students know that we consider them as potentially hostile entities. Suggesting that people should edit their replies is not really an acceptable answer, since it suggests that it is acceptable to download the work of maintaining the previous level of functionality onto each user of the system.

3.3 It doesn't help me

I have an archive of 61270 email messages received since 2003. Of these 26215 claim to be from a unb.ca address 5. So historically about 42% of the mail to arrive at my UNB mailbox is internal 6. This means that warnings will occur in the majority of messages I receive. I think the onus is on the proposer to show that a warning that occurs in the large majority of messages will have any useful effect.

3.4 It disrupts my collaboration with open-source projects

Part of my job is to collaborate with various open source projects. A prominent example is Eclipse OMR 7, the technological driver for a collaboration with IBM that has brought millions of dollars of graduate student funding to UNB. Git is now the dominant version control system for open source projects, and one popular way of using git is via git-send-email 8

Adding a banner breaks the delivery of patches by this method. In the a previous experiment I did about a month ago, it "only" caused the banner to end up in the git commit message. Those of you familiar with software developement will know that this is roughly the equivalent of walking out of the bathroom with toilet paper stuck to your shoe. You'd rather avoid it, but it's not fatal. The current implementation breaks things completely by quoted-printable re-encoding the message. In particular '=' gets transformed to '=3D' like the following

-+    gunichar *decoded=g_utf8_to_ucs4_fast (utf8_str, -1, NULL);
-+    const gunichar *p = decoded;
++    gunichar *decoded=3Dg_utf8_to_ucs4_fast (utf8_str, -1, NULL);

I'm not currently sure if this is a bug in git or some kind of failure in the re-encoding. It would likely require an investment of several hours of time to localize that.

3.5 It interferes with the use of cryptography.

Unlike many people, I don't generally read my email on a phone. This means that I don't rely on the previews that are apparently disrupted by the presence of a warning banner. On the other hand I do send and receive OpenPGP signed and encrypted messages. The effects of the banner on both signed and encrypted messages is similar, so I'll stick to discussing signed messages here. There are two main ways of signing a message. The older method, still unfortunately required for some situations is called "inline PGP". The signed region is re-encoded, which causes gpg to issue a warning about a buggy MTA 9, namely gpg: quoted printable character in armor - probably a buggy MTA has been used. This is not exactly confidence inspiring. The more robust and modern standard is PGP/MIME. Here the insertion of a banner does not provoke warnings from the cryptography software, but it does make it much harder to read the message (before and after screenshots are given below). Perhaps more importantly it changes the message from one which is entirely signed or encrypted 10, to one which is partially signed or encrypted. Such messages were instrumental in the EFAIL exploit 11 and will probably soon be rejected by modern email clients.

 signature-clean.png

Figure 1: Intended view of PGP/MIME signed message

 signature-dirty.png

Figure 2: View with added banner

Footnotes:

1

On Multics, when I was a high school student

4

IETF Requests for Comments, which define most of the standards used by email systems.

5

possibly overcounting some spam as UNB originating email

6

In case it's not obvious dear reader, communicating with the world outside UNB is part of my job.

8

Some important projects function exclusively that way. See https://git-send-email.io/ for more information.

9

Mail Transfer Agent

Author: David Bremner

Created: 2019-05-22 Wed 17:04

Validate

Posted Tags: /tags/email

The Ottawa Citizen reports on a study showing that people are less honest, and ruder in email than in person.
The snarkier among us may want to note that the study was restricted to MBA students :-).

Posted Tags: /tags/email

For a project involving ikiwiki. I am thinking about the right way (or at least a reasonable way) to encode things into email address. At the moment I am leaning towards a hybrid quoted printable/base64 style approach.

You can have a look at my current implementation of a module Convert::YText to do this conversion.

Posted Tags: /tags/email