UNB/ CS/ David Bremner/ blog/ posts/ Converting PDFs to DJVU

Today I was wondering about converting a pdf made from scan of a book into djvu, hopefully to reduce the size, without too much loss of quality. My initial experiments with pdf2djvu were a bit discouraging, so I invested some time building gsdjvu in order to be able to run djvudigital.

Watching the messages from djvudigital I realized that the reason it was achieving so much better compression was that it was using black and white for the foreground layer by default. I also figured out that the default 300dpi looks crappy since my source document is apparently 600dpi.

I then went back an compared djvudigital to pdf2djvu a bit more carefully. My not-very-scientific conclusions:

Perhaps most compellingly, the output from pdf2djvu has sensible metadata and is searchable in evince. Even with the --words option, the output from djvudigital is not. This is possibly related to the error messages like

Can't build /Identity.Unicode /CIDDecoding resource. See gs_ciddc.ps .

It could well be my fault, because building gsdjvu involved guessing at corrections for several errors.

Some of these issues have to do with building software from 2009 (the instructions suggestion building with ghostscript 8.64) in a modern toolchain; others I'm not sure. There was an upload of gsdjvu in February of 2015, somewhat to my surprise. AT&T has more or less crippled the project by licensing it under the CPL, which means binaries are not distributable, hence motivation to fix all the rough edges is minimal.

Version kilobytes per page position in figure
Original PDF 80.9 top
pdf2djvu --dpi=450 92.0 not shown
pdf2djvu --monochrome --dpi=450 27.5 second from top
pdf2djvu --monochrome --dpi=600 --loss-level=50 21.3 second from bottom
djvudigital --dpi=450 29.4 bottom