Thursday, August 7, 2008

A new mysterious strain of PDF

Just to spice up my life and possibly yours too... here's this out-of-the blue surprise.

This looks, at the first glance, like a regular PDF file. file command on a Linux box identifies it as "PDF document, version 1.4". It opens fine in pretty much any PDF reader (Acroread, KPDF, evince, whatever). However, if you decide to print it then it becomes a whole different ballgame.

It prints very slowly when it does. From Acroread 8 it does not print at all. That was checked on both OpenSuSE Linux 10.3 and MS Windows Vista so I have reason to believe that the problem at hand is most likely OS-agnostic.

It does print using the system print (lp) on OpenSuSE Linux 10.3. It prints on OpenSuSE Linux in Acroread 7, CentOS 5 Linux under Acroread 5 as well as on MacOS 10 under Preview. When converted to Postscript via pdftops it yields a humongous (100+ MB) Postscript file which is quite impressive given that the PDF file being converted is only a less-than-a-megabyte 13-page document.

If you know what this mystery PDF file is about or have encountered this mutation of PDF yourself - shout, and together we shall prevail!


L said...
This comment has been removed by the author.
lornix said...

The PDF was created by Adobe's InDesign CS2. They're actually pretty amazing files internally.

In this particular file (nri2294.pdf), there are:

48 embedded fonts
13 pages
0 embedded images!!

It seems that InDesign CS2 doesn't, in this case at least, embed images in the normal manner. The images seen in the file were likely created using InDesign, thus known to the application and stored/embedded as vectors.

I used the PDF Toolkit (pdftk) to burst the pages into individual pdf files, then converted those files into postscript. The biggest files (17M, 47M, 76M) correspond to the pages with the biggest images, pages 6, 4 and 2 respectively.

Examining page 2 (76M) revealed several fonts being embedded, and over 262 thousand, 5 to 8 line 'groups' of postscript drawing/positioning commands.

This is where the size is coming from...

And the sheer volume of commands may be overwhelming some print pre-processors.

The slowness is due to the fact that almost all 'behind the scenes' work is done in postscript in linux, and then if the output device doesn't speak postscript, converted to a raster image which is then sent for printing.


(Grrr, Google/Blogger is being a pain today!)

Boris Epstein said...

Hi Lorni,

Thank you very much... Excellent analysis!

More lately.