tags/pdf
David Bremner
by-nc-sa-2.5
Copyright 2020, David Bremner
https://www.cs.unb.ca/~bremner//tags/pdf/
David Bremner
ikiwiki
2015-12-29T23:35:05Z
Converting PDFs to DJVU
https://www.cs.unb.ca/~bremner//blog/posts/pdf-to-djvu/
<a href="../../whyCC/">by-nc-sa-2.5</a>
Copyright 2020, David Bremner
2015-12-29T23:35:05Z
2015-12-29T16:57:00Z
<p>Today I was wondering about converting a pdf made from scan of a book
into djvu, hopefully to reduce the size, without too much loss of
quality. My initial experiments with
<a href="http://jwilk.net/software/pdf2djvu">pdf2djvu</a> were a bit
discouraging, so I invested some time building
<a href="http://djvu.sourceforge.net/gsdjvu.html">gsdjvu</a> in order to be able
to run <code>djvudigital</code>.</p>
<p>Watching the messages from <code>djvudigital</code> I realized that the reason it
was achieving so much better compression was that it was using black
and white for the foreground layer by default. I also figured out that
the default 300dpi looks crappy since my source document is apparently
600dpi.</p>
<p>I then went back an compared <code>djvudigital</code> to <code>pdf2djvu</code> a bit more
carefully. My not-very-scientific conclusions:</p>
<ul>
<li>monochrome at higher resolution is better than coloured foreground</li>
<li>higher resolution and (a little) lossy beats lower resolution</li>
<li>at the same resolution, <code>djvudigital</code> gives nicer output, but at the
same bit rate, comparable results are achievable with <code>pdf2djvu</code>.</li>
</ul>
<p>Perhaps most compellingly, the output from <code>pdf2djvu</code> has sensible
metadata and is searchable in evince. Even with the --words option,
the output from djvudigital is not. This is possibly related to the
error messages like</p>
<pre><code>Can't build /Identity.Unicode /CIDDecoding resource. See gs_ciddc.ps .
</code></pre>
<p>It could well be my fault, because building <code>gsdjvu</code> involved guessing at corrections for several errors.</p>
<ul>
<li><p>comparing <code>GS_VERSION</code> to 900 doesn't work well, when <code>GS_VERSION</code> is a 5 digit number. <code>GS_REVISION</code> seems to
be what's wanted there.</p></li>
<li><p>extra declaration of struct timeval deleted</p></li>
<li><p>-lz added to command to build mkromfs</p></li>
</ul>
<p>Some of these issues have to do with building software from 2009 (the
instructions suggestion building with ghostscript 8.64) in a modern
toolchain; others I'm not sure. There was an upload of <code>gsdjvu</code> in
February of 2015, somewhat to my surprise. AT&T has more or less
crippled the project by licensing it under the CPL, which means
binaries are not distributable, hence motivation to fix all the rough
edges is minimal.</p>
<table>
<thead>
<tr>
<th>Version</th>
<th> kilobytes per page</th>
<th> position in figure</th>
</tr>
</thead>
<tbody>
<tr>
<td>Original PDF</td>
<td> 80.9</td>
<td> top</td>
</tr>
<tr>
<td>pdf2djvu --dpi=450</td>
<td> 92.0</td>
<td> not shown</td>
</tr>
<tr>
<td>pdf2djvu --monochrome --dpi=450</td>
<td> 27.5</td>
<td> second from top</td>
</tr>
<tr>
<td>pdf2djvu --monochrome --dpi=600 --loss-level=50</td>
<td> 21.3</td>
<td> second from bottom</td>
</tr>
<tr>
<td>djvudigital --dpi=450</td>
<td> 29.4</td>
<td> bottom</td>
</tr>
</tbody>
</table>
<p><img src="https://www.cs.unb.ca/~bremner//blog/files/djvu-compare.png" alt="djvu-compare.png" /></p>
Extracting text from pdf with pdfedit
https://www.cs.unb.ca/~bremner//blog/posts/pdf2text/
<a href="../../whyCC/">by-nc-sa-2.5</a>
Copyright 2020, David Bremner
2010-11-05T22:31:59Z
2010-10-31T03:49:00Z
<p>It turns out that pdfedit is pretty good at extracting text from pdf
files. Here is a script I wrote to do that in batch mode.</p>
<div class="highlight-sh"><pre class="hl"><span class="hl slc">#!/bin/sh</span>
<span class="hl slc"># Print the text from a pdf document on stdout</span>
<span class="hl slc"># Copyright: (c) 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit></span>
<span class="hl slc"># Copyright: (c) 2010, David Bremner <david@tethera.net></span>
<span class="hl slc"># Licensed under version 2 or later of the GNU GPL</span>
<span class="hl kwb">set -e</span>
<span class="hl kwa">if</span> <span class="hl opt">[</span> <span class="hl kwd">$#</span> <span class="hl kwb">-lt</span> <span class="hl num">1</span> <span class="hl opt">];</span> <span class="hl kwa">then</span>
<span class="hl kwb">echo</span> usage<span class="hl opt">:</span> <span class="hl kwd">$0</span> <span class="hl kwc">file</span> <span class="hl opt">[</span>pageSep<span class="hl opt">]</span>
<span class="hl kwb">exit</span> <span class="hl num">1</span>
<span class="hl kwa">fi</span>
<span class="hl slc">#!/bin/sh</span>
<span class="hl slc"># Print the text from a pdf document on stdout</span>
<span class="hl slc"># Copyright: © 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit></span>
<span class="hl slc"># Copyright: © 2010, David Bremner <david@tethera.net></span>
<span class="hl slc"># Licensed under version 2 or later of the GNU GPL</span>
<span class="hl kwb">set -e</span>
<span class="hl kwa">if</span> <span class="hl opt">[</span> <span class="hl kwd">$#</span> <span class="hl kwb">-lt</span> <span class="hl num">1</span> <span class="hl opt">];</span> <span class="hl kwa">then</span>
<span class="hl kwb">echo</span> usage<span class="hl opt">:</span> <span class="hl kwd">$0</span> <span class="hl kwc">file</span> <span class="hl opt">[</span>pageSep<span class="hl opt">]</span>
<span class="hl kwb">exit</span> <span class="hl num">1</span>
<span class="hl kwa">fi</span>
<span class="hl opt">/</span>usr<span class="hl opt">/</span>bin<span class="hl opt">/</span>pdfedit <span class="hl kwb">-console -eval</span> <span class="hl sng">'</span>
<span class="hl sng">function onConsoleStart() {</span>
<span class="hl sng"> var inName = takeParameter();</span>
<span class="hl sng"> var pageSep = takeParameter();</span>
<span class="hl sng"> var doc = loadPdf(inName,false);</span>
<span class="hl sng"></span>
<span class="hl sng"> pages=doc.getPageCount();</span>
<span class="hl sng"> for (i=1;i<=pages;i++) {</span>
<span class="hl sng"> pg=doc.getPage(i);</span>
<span class="hl sng"> text=pg.getText(); </span>
<span class="hl sng"> print(text);</span>
<span class="hl sng"> print("</span><span class="hl esc">\n</span><span class="hl sng">");</span>
<span class="hl sng"> print(pageSep);</span>
<span class="hl sng"> }</span>
<span class="hl sng">}</span>
<span class="hl sng">'</span> <span class="hl kwd">$1 $2</span>
</pre></div>
<p>Yeah, I wish <code>#!/usr/bin/pdfedit</code> worked too. Thanks to Aaron M Ucko for pointing out that
-eval could replace the use of a temporary file.</p>
<p>Oh, and pdfedit will be even better when the authors release a new version that fixes <a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601910">truncating wide text</a></p>
filling in forms with pdftk
https://www.cs.unb.ca/~bremner//blog/posts/filling_in_forms_with_pdftk/
<a href="../../whyCC/">by-nc-sa-2.5</a>
Copyright 2020, David Bremner
2008-07-06T00:02:58Z
2008-01-06T20:09:00Z
<p>So you have a pdf form, and you want to fill it in on linux. You hate
acrobat reader. Ok, so all six of you read on.</p>
<p>First install pdftk. If you are using debian,</p>
<pre><code>apt-get install pdftk
</code></pre>
<p>If you are not using debian, first install debian :-).</p>
<p>Now you need a pdf file with form data. We suppose for the sake of
argument that your file is <code>foo.pdf</code>. Try</p>
<pre><code>pdftk foo.pdf dump_data_fields
</code></pre>
<p>Yes, the order of arguments is goofy. You should get some output that
looks like</p>
<pre><code>FieldType: Text
FieldName: M3
FieldFlags: 4194304
FieldJustification: Left
---
FieldType: Text
FieldName: D3
FieldFlags: 4194304
FieldJustification: Left
</code></pre>
<p>M3 and D3 are your field names.
Now get my <a href="https://www.cs.unb.ca/~bremner/blog/files/fields2pl.pl">script</a> which can convert this output into something
useful. At this point you may want to reconsider how much you hate
acrobat. Or investigate okular. Assuming you are still here, run</p>
<pre><code>pdftk foo.pdf dump_data_fields | perl fields2pl.pl > foo.pl
</code></pre>
<p>This will give you a template that you can fill in. If you have to
fill out the same form many times (e.g. an expense form), save this
template somewhere. Now to fill in your form, you need a <code>FDF</code> file.
One way to make one is to edit the template I made you create above,
and then convert it to <code>FDF</code>. First install the <code>FDF</code> converter.</p>
<pre><code>apt-get install libpdf-fdf-simple-perl
</code></pre>
<p>Now use something like <a href="https://www.cs.unb.ca/~bremner/blog/files/genfdf.pl">genfdf.pl</a> to make an fdf file.</p>
<pre><code>perl genfdf.pl foo.pl > foo.fdf
</code></pre>
<p>You are almost there. To actually fill in the form, you use the
command</p>
<pre><code>pdftk foo.pdf fill_form foo.fdf output filled.pdf
</code></pre>
<p>If you do this all many times, consider making a Makefile. Here is a
fragment</p>
<pre><code>.SUFFIXES: .pdf .fdf .csv .gnumeric .pl
.fdf.pdf:
pdftk Expenses.pdf fill_form $< output $@
.pl.fdf:
genfdf.pl $< > $@
example.pdf: example.fdf
example.fdf: example.pl
</code></pre>