Posted: Sun Sep 19, 2004 11:00 am Post subject: pdf sizes - observations + a possible bug
I was curious to compare the size of pdf file generated using the Export to PDF function in Ooo on different platforms (or on the same platform in different wrappers, really - I haven't tried it on linux, for example).
I used a 2-sheet spreadsheet that prints as a 4-page landscape pdf file. On my tibook I've got Ooo 1.1.2 running in X11, OroborOSX, NeoOffice/J, and Win98SE (in Virtual PC v6) - all under 10.3.5. The results were interesting:
The last two represent using Print (and then the 'Save as PDF' option in the OSX print dialogue), and NeoJ's File->Export to PDF option, respectively.
The only obvious difference in quality between the resulting pdf files is that the 'NeoJ (print)' version (using OSX-native pdf generation) has nice pale grid lines and is more smoothed - all the rest have black grid lines. Otherwise, they're all pretty much similar to look at.
This experiment raised two questions for me (and these are asked in an intrigued tone of voice, not an accusatory one, by the way!):
1) Why are the X11/OroborOSX files so compact compared to the rest? Is the pdf code really that much more optimised in native unix (if that's what's being used there)?
2) Why is the NeoJ output so bloomin' enormous compared to the others? Is it generating the PDF using java? What's the fundamental difference between it and the others when it comes to exporting PDFs?
I'm asking these questions because I'm really curious about it. If someone who knows the answers to these questions can take the time to explain the issues to me, I'd be grateful - and I'm sure others would be intrerested too.
Meanwhile, one obvious conclusion is that though I might like best working on my docs in NeoJ, when the time comes to make PDFs to email to my colleagues I need to open the spreadsheet in X11 or OroborOSX Ooo in order to get the size optimisation.
==========
A little experimentation later...
Question 2 was really bothering me - why would a PDF file be so big? So I tried the experiment with a text doc (2 pages of 2-col text) with NeoJ Export-to-PDF, and with X11.
I initially started with just the 2-page text doc, and got files of 500Kb and 50Kb respectively. Then I added a 35Kb gif on the end of the file and tried again - 536Kb and 56Kb respectively. Now it gets interesting - I was wondering if there was some kind of doc size threshold below which the NeoJ PDF would come out a lot smaller - so I started by deleting the gif again - and the file size of the NeoJ PDF stayed the same! Then a penny dropped and I tried knocking all of the Options->Memory settings down as far as they could go, including the number of undo levels. I typed a few words and deleted them again (to clear any big stuff out of undo - you can't set it below 1) and then resaved the doc (still as 2 pages of 2-col text). Then I re-exported to PDF in NeoJ, and the new file was only 184Kb!
So my conclusion here: even though the NeoJ PDF generator still makes bigger files than the other versions do in general, one thing that makes the files REALLY big is if the Ooo file itself has undo data saved in it . This is a bug, I think, and needs looking at. I'd be interested to know whether other people can confirm this behaviour - you'll need an Ooo doc that's already had a number of edits done on it (with the undo levels set high) to be able to tell the difference between one with a lot of undo data and one without. (actually, I don't know whether the undo data is saved in the doc file itself, or somewhere in Ooo, but that's probably irrelevant, as it's making its way into the bowels of the PDF file somehow in any case.)
Sorry this is such a long post, but the questions I came up with and my conclusions all seemed worth passing on, as well as how I came to them.
Just a bit more info - I tried the same trick with the original spreadsheet, and could only get it down to about 460Kb - it seems that spreadsheets just do make really big PDF files using Neo/J's Export to PDF option.
But the reduced memory/undo thing still made a 40Kb difference to the PDF file...
These are very interesting results. I think there are two different issues that may contribute to larger PDF sizes with Neo/J:
1. Both printing and saving PDF in Neo/J uses the native Mac OS X "draw text to PDF" APIs. Like the X11 code, these native APIs embed the font glyphs into the PDF file. Since X11 uses non-native fonts and Neo/J uses native fonts, it is likely that the difference between the X11 and the Neo/J printed PDF is due to differences in font size. By changing the font in your document and reprinting, you will probably see some fluctuation in size.
2. The difference between printed PDF and saved PDF in Neo/J is most likely due to me trying squeeze the Mac OS X "draw text to PDF" APIs into the OOo code. Mac OS X's APIs make the smallest files when you to draw all of your text at once. Unfortunately, the OOo code assumes that you draw text, then later you can create the embedded font glyphs. As a result, the OOo code processes font glyphs word by word which can cause duplication of the embedded font glyphs. This is the most likely cause of the difference.
My guess is that reducing the number of undo levels causes the OOo code to draw the text in larger chunks which would reduce the number of duplicate embedded font glyphs.
Clearly, the save to PDF code needs to be optimized. Right now it works and appears stable so I will add optimization to my long "to-do" list.
Fascinating - yes, I tried messing around with font choices to replace the Arial I'd used in the spreadsheet, and unexpected things happened: substituting a font I use a lot (URW Palladio: a unicode Palatino clone, stuffed with indic characters for writing in romanised pali, sanskrit etc.) whose font files come in bigger than Arial in the /Library/Fonts folder, produced a PDF file half the size of the same doc using Arial. Weird.
As you say, at least it works at the moment. And there's always the X11 fallback for writing compact PDF files - assuming one has the appropriate fonts on both platforms (URW Palladio is a ttf font, fortunately, and so migrates seamlessly... lucky me).
I have noticed, though, that pagination in Neo/J is a little different from in X11 or Win98 - even using the URW Palladio font (exactly the same font files on both platforms), my spreadsheet breaks page on a different line in Neo/J from in X11. And in calc, there's no option to insert a manual page break - pity. It's too much to hope they'd be exactly the same, innit?
Oh, and by the way, how many posts do I have to make before I ascend from 'sentinal' to 'sentinel'? Or is this some cool reference I'm missing because I didn't see Matrix Revelation?
- yoxi
(Sorry, the proofreaders' gift/curse, making up for a misspelt yuoth...)
Joined: May 31, 2003 Posts: 219 Location: French Alps
Posted: Sun Sep 19, 2004 1:39 pm Post subject: TTrasterize
Your testing confirm my own.
Two cents more:
When I was on the edge testing OOo/X11, I remember there were consideration about TrueType rasterizer.
On the files generated by OO/X11, IF the printer you are using is know as able to rasterize the glyph, the PDF file is MUCH smaller thince it only include characters code and not glyphs (the Poscript code to draw the chars). I remember that adding the following code :
Code:
*TTRasterizer: Type42
to my printer PPD did the trick and thinned the PDF a lot.
Edit: the relevant posts are probably still there, in the OOodocs forum
What is puzzling me is that OS X API to generate PDF from Java does not use this ability.
Maybe it depends of the data type ?
Or is it because this API is designed to optimize effective printing (where the chars must be rasterized anyway) rather than PDF generation?
Have you tried View>Page Break Preview and dragged the blue lines around? Or is that what fluctuates?
Thanks very much - that does exactly what I need it to do, given that the page breaks aren't consistent between platforms - I'd never noticed that menu option before.
Posted: Mon Sep 20, 2004 4:31 am Post subject: Re: pdf sizes - observations + a possible bug
yoxi wrote:
So my conclusion here: even though the NeoJ PDF generator still makes bigger files than the other versions do in general, one thing that makes the files REALLY big is if the Ooo file itself has undo data saved in it . This is a bug, I think, and needs looking at.
I don't know if anyone has confirmed this, but it would be serious. I believe it was with the Hutton Inquiry that the British government decided to switch from Word documents to PDF as their platform for giving out statements and such. The reason was that the Word documents hold a lot of undo / track changes data that made for some pretty embarassing reading.
Having undo data in the PDF would be bad. From what I gather, the problem is with text section being fragmented in the document because the text is interspersed by undo data. Forking the document and cleaning out this stuff before converting to PDF might do the trick. In any case, is there a way to clean out all undo (and similar) data from documents? If not, I might have to field a feature request, either here or for OOo in general...
Posted: Mon Sep 20, 2004 7:13 am Post subject: Re: pdf sizes - observations + a possible bug
ovvldc wrote:
Having undo data in the PDF would be bad. From what I gather, the problem is with text section being fragmented in the document because the text is interspersed by undo data. Forking the document and cleaning out this stuff before converting to PDF might do the trick. In any case, is there a way to clean out all undo (and similar) data from documents? If not, I might have to field a feature request, either here or for OOo in general...
No undo data is being saved in the PDF document. Instead, the larger files are due the OOo code passing smaller text strings to be rendered. My code has a known lack of performance in that passing it lots of shorts strings will create lots of small embedded font files. Since each embedded font file has a fixed amount of size (TrueType headers, etc.), the result is a larger file.
What I need to do is to gather up all of the small strings that OOo passes to my code and delay processing the strings until I have a large enough string to make it worth processing.
Posted: Mon Sep 20, 2004 1:55 pm Post subject: Re: pdf sizes - observations + a possible bug
pluby wrote:
What I need to do is to gather up all of the small strings that OOo passes to my code and delay processing the strings until I have a large enough string to make it worth processing.
Right, I gathered as much. But is there currently any cleanup facility for OOo documents at all (i.e. throw out all undo and track changes data)? I couldn't find it in 1.1-alpha-10. Some people might find it useful (say, Alistair Campbell).
This is getting to be a problem because of limitations on mail messages and mailbox sizes. Yesterday, I created a PDF on Neo/J of 26 MB. The same file on StarOffice for Linux weighed in at a measly 512 KB.
Posted: Sun Oct 10, 2004 8:32 am Post subject: still here
Well, I tried it out.
SXW document: 132 kB (no pictures)
X11 PDF export: 376 kB initial
X11 PDF export: 372 kB after loading and saving in Acrobat 6.0
NeoOffice/J PDF export: 3,9 MB initial
NeoOffice/J PDF export: 856 kB after loading and saving in Acrobat 6.0
NeoOffice/J print to PDF: failed, got nasty spinning wheel on save dialog
Adobe Acrobat was particularly concerned with duplicate fonts. It took a little while in the X11 version but took minutes on the NeoOffice/J version. Even then, the NeoOffice/J version is over twice the size, but at least I can e-mail it .
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum