Welcome to NeoOffice developer notes and announcements
NeoOffice
Developer notes and announcements
 
 

This website is an archive and is no longer active
NeoOffice announcements have moved to the NeoOffice News website


Support
· Forums
· NeoOffice Support
· NeoWiki


Announcements
· Twitter @NeoOffice


Downloads
· Download NeoOffice


  
NeoOffice :: View topic - pdf sizes - observations + a possible bug
pdf sizes - observations + a possible bug
 
   NeoOffice Forum Index -> NeoOffice Testing
View previous topic :: View next topic  
Author Message
yoxi
Cipher


Joined: Sep 07, 2004
Posts: 1799
Location: Dawlish, Devon

PostPosted: Sun Sep 19, 2004 11:00 am    Post subject: pdf sizes - observations + a possible bug

I was curious to compare the size of pdf file generated using the Export to PDF function in Ooo on different platforms (or on the same platform in different wrappers, really - I haven't tried it on linux, for example).

I used a 2-sheet spreadsheet that prints as a 4-page landscape pdf file. On my tibook I've got Ooo 1.1.2 running in X11, OroborOSX, NeoOffice/J, and Win98SE (in Virtual PC v6) - all under 10.3.5. The results were interesting:

X11 - 36Kb
Or'OSX - 36Kb
Win98 - 104Kb
NeoJ (print) - 124Kb
NeoJ (export) - 508Kb

The last two represent using Print (and then the 'Save as PDF' option in the OSX print dialogue), and NeoJ's File->Export to PDF option, respectively.

The only obvious difference in quality between the resulting pdf files is that the 'NeoJ (print)' version (using OSX-native pdf generation) has nice pale grid lines and is more smoothed - all the rest have black grid lines. Otherwise, they're all pretty much similar to look at.

This experiment raised two questions for me (and these are asked in an intrigued tone of voice, not an accusatory one, by the way!):

1) Why are the X11/OroborOSX files so compact compared to the rest? Is the pdf code really that much more optimised in native unix (if that's what's being used there)?

2) Why is the NeoJ output so bloomin' enormous compared to the others? Is it generating the PDF using java? What's the fundamental difference between it and the others when it comes to exporting PDFs?

I'm asking these questions because I'm really curious about it. If someone who knows the answers to these questions can take the time to explain the issues to me, I'd be grateful - and I'm sure others would be intrerested too.

Meanwhile, one obvious conclusion is that though I might like best working on my docs in NeoJ, when the time comes to make PDFs to email to my colleagues I need to open the spreadsheet in X11 or OroborOSX Ooo in order to get the size optimisation.

==========

A little experimentation later...

Question 2 was really bothering me - why would a PDF file be so big? So I tried the experiment with a text doc (2 pages of 2-col text) with NeoJ Export-to-PDF, and with X11.

I initially started with just the 2-page text doc, and got files of 500Kb and 50Kb respectively. Then I added a 35Kb gif on the end of the file and tried again - 536Kb and 56Kb respectively. Now it gets interesting - I was wondering if there was some kind of doc size threshold below which the NeoJ PDF would come out a lot smaller - so I started by deleting the gif again - and the file size of the NeoJ PDF stayed the same! Then a penny dropped and I tried knocking all of the Options->Memory settings down as far as they could go, including the number of undo levels. I typed a few words and deleted them again (to clear any big stuff out of undo - you can't set it below 1) and then resaved the doc (still as 2 pages of 2-col text). Then I re-exported to PDF in NeoJ, and the new file was only 184Kb!

So my conclusion here: even though the NeoJ PDF generator still makes bigger files than the other versions do in general, one thing that makes the files REALLY big is if the Ooo file itself has undo data saved in it . This is a bug, I think, and needs looking at. I'd be interested to know whether other people can confirm this behaviour - you'll need an Ooo doc that's already had a number of edits done on it (with the undo levels set high) to be able to tell the difference between one with a lot of undo data and one without. (actually, I don't know whether the undo data is saved in the doc file itself, or somewhere in Ooo, but that's probably irrelevant, as it's making its way into the bowels of the PDF file somehow in any case.)

Sorry this is such a long post, but the questions I came up with and my conclusions all seemed worth passing on, as well as how I came to them.

- yoxi
Back to top
yoxi
Cipher


Joined: Sep 07, 2004
Posts: 1799
Location: Dawlish, Devon

PostPosted: Sun Sep 19, 2004 11:32 am    Post subject:

Just a bit more info - I tried the same trick with the original spreadsheet, and could only get it down to about 460Kb - it seems that spreadsheets just do make really big PDF files using Neo/J's Export to PDF option.

But the reduced memory/undo thing still made a 40Kb difference to the PDF file... Wink

- yoxi
Back to top
pluby
The Architect
The Architect


Joined: Jun 16, 2003
Posts: 11949

PostPosted: Sun Sep 19, 2004 12:08 pm    Post subject:

These are very interesting results. I think there are two different issues that may contribute to larger PDF sizes with Neo/J:

1. Both printing and saving PDF in Neo/J uses the native Mac OS X "draw text to PDF" APIs. Like the X11 code, these native APIs embed the font glyphs into the PDF file. Since X11 uses non-native fonts and Neo/J uses native fonts, it is likely that the difference between the X11 and the Neo/J printed PDF is due to differences in font size. By changing the font in your document and reprinting, you will probably see some fluctuation in size.

2. The difference between printed PDF and saved PDF in Neo/J is most likely due to me trying squeeze the Mac OS X "draw text to PDF" APIs into the OOo code. Mac OS X's APIs make the smallest files when you to draw all of your text at once. Unfortunately, the OOo code assumes that you draw text, then later you can create the embedded font glyphs. As a result, the OOo code processes font glyphs word by word which can cause duplication of the embedded font glyphs. This is the most likely cause of the difference.

My guess is that reducing the number of undo levels causes the OOo code to draw the text in larger chunks which would reduce the number of duplicate embedded font glyphs.

Clearly, the save to PDF code needs to be optimized. Right now it works and appears stable so I will add optimization to my long "to-do" list.

Patrick
Back to top
yoxi
Cipher


Joined: Sep 07, 2004
Posts: 1799
Location: Dawlish, Devon

PostPosted: Sun Sep 19, 2004 12:48 pm    Post subject:

Fascinating - yes, I tried messing around with font choices to replace the Arial I'd used in the spreadsheet, and unexpected things happened: substituting a font I use a lot (URW Palladio: a unicode Palatino clone, stuffed with indic characters for writing in romanised pali, sanskrit etc.) whose font files come in bigger than Arial in the /Library/Fonts folder, produced a PDF file half the size of the same doc using Arial. Weird.

As you say, at least it works at the moment. And there's always the X11 fallback for writing compact PDF files - assuming one has the appropriate fonts on both platforms (URW Palladio is a ttf font, fortunately, and so migrates seamlessly... lucky me).

I have noticed, though, that pagination in Neo/J is a little different from in X11 or Win98 - even using the URW Palladio font (exactly the same font files on both platforms), my spreadsheet breaks page on a different line in Neo/J from in X11. And in calc, there's no option to insert a manual page break - pity. It's too much to hope they'd be exactly the same, innit? Sad

- yoxi
Back to top
yoxi
Cipher


Joined: Sep 07, 2004
Posts: 1799
Location: Dawlish, Devon

PostPosted: Sun Sep 19, 2004 1:03 pm    Post subject:

Oh, and by the way, how many posts do I have to make before I ascend from 'sentinal' to 'sentinel'? Or is this some cool reference I'm missing because I didn't see Matrix Revelation? Wink

- yoxi

(Sorry, the proofreaders' gift/curse, making up for a misspelt yuoth...)
Back to top
Max_Barel
Oracle


Joined: May 31, 2003
Posts: 219
Location: French Alps

PostPosted: Sun Sep 19, 2004 1:39 pm    Post subject: TTrasterize

Your testing confirm my own.
Two cents more:
When I was on the edge testing OOo/X11, I remember there were consideration about TrueType rasterizer.
On the files generated by OO/X11, IF the printer you are using is know as able to rasterize the glyph, the PDF file is MUCH smaller thince it only include characters code and not glyphs (the Poscript code to draw the chars). I remember that adding the following code :
Code:
*TTRasterizer: Type42
to my printer PPD did the trick and thinned the PDF a lot.
Edit: the relevant posts are probably still there, in the OOodocs forum

What is puzzling me is that OS X API to generate PDF from Java does not use this ability.
Maybe it depends of the data type ?
Or is it because this API is designed to optimize effective printing (where the chars must be rasterized anyway) rather than PDF generation?
Back to top
sardisson
Town Crier
Town Crier


Joined: Feb 01, 2004
Posts: 4588

PostPosted: Sun Sep 19, 2004 2:47 pm    Post subject:

This answer really doesn't fit with this thread, but the question was there, so...

yoxi wrote:
And in calc, there's no option to insert a manual page break


Have you tried View>Page Break Preview and dragged the blue lines around? Or is that what fluctuates?

Smokey
Back to top
yoxi
Cipher


Joined: Sep 07, 2004
Posts: 1799
Location: Dawlish, Devon

PostPosted: Sun Sep 19, 2004 11:58 pm    Post subject:

sardisson wrote:
Have you tried View>Page Break Preview and dragged the blue lines around? Or is that what fluctuates?

Thanks very much - that does exactly what I need it to do, given that the page breaks aren't consistent between platforms - I'd never noticed that menu option before.

Forums are great, aren't they?

- yoxi
Back to top
ovvldc
Captain Naiobi


Joined: Sep 13, 2004
Posts: 2352
Location: Zürich, CH

PostPosted: Mon Sep 20, 2004 4:31 am    Post subject: Re: pdf sizes - observations + a possible bug

yoxi wrote:
So my conclusion here: even though the NeoJ PDF generator still makes bigger files than the other versions do in general, one thing that makes the files REALLY big is if the Ooo file itself has undo data saved in it . This is a bug, I think, and needs looking at.


I don't know if anyone has confirmed this, but it would be serious. I believe it was with the Hutton Inquiry that the British government decided to switch from Word documents to PDF as their platform for giving out statements and such. The reason was that the Word documents hold a lot of undo / track changes data that made for some pretty embarassing reading.

Having undo data in the PDF would be bad. From what I gather, the problem is with text section being fragmented in the document because the text is interspersed by undo data. Forking the document and cleaning out this stuff before converting to PDF might do the trick. In any case, is there a way to clean out all undo (and similar) data from documents? If not, I might have to field a feature request, either here or for OOo in general...
Back to top
pluby
The Architect
The Architect


Joined: Jun 16, 2003
Posts: 11949

PostPosted: Mon Sep 20, 2004 7:13 am    Post subject: Re: pdf sizes - observations + a possible bug

ovvldc wrote:
Having undo data in the PDF would be bad. From what I gather, the problem is with text section being fragmented in the document because the text is interspersed by undo data. Forking the document and cleaning out this stuff before converting to PDF might do the trick. In any case, is there a way to clean out all undo (and similar) data from documents? If not, I might have to field a feature request, either here or for OOo in general...


No undo data is being saved in the PDF document. Instead, the larger files are due the OOo code passing smaller text strings to be rendered. My code has a known lack of performance in that passing it lots of shorts strings will create lots of small embedded font files. Since each embedded font file has a fixed amount of size (TrueType headers, etc.), the result is a larger file.

What I need to do is to gather up all of the small strings that OOo passes to my code and delay processing the strings until I have a large enough string to make it worth processing.

Patrick
Back to top
ovvldc
Captain Naiobi


Joined: Sep 13, 2004
Posts: 2352
Location: Zürich, CH

PostPosted: Mon Sep 20, 2004 1:55 pm    Post subject: Re: pdf sizes - observations + a possible bug

pluby wrote:
What I need to do is to gather up all of the small strings that OOo passes to my code and delay processing the strings until I have a large enough string to make it worth processing.


Right, I gathered as much. But is there currently any cleanup facility for OOo documents at all (i.e. throw out all undo and track changes data)? I couldn't find it in 1.1-alpha-10. Some people might find it useful (say, Alistair Campbell).
Back to top
jimlaurent
Captain


Joined: Jun 23, 2003
Posts: 55

PostPosted: Tue Sep 21, 2004 5:04 am    Post subject:

This is getting to be a problem because of limitations on mail messages and mailbox sizes. Yesterday, I created a PDF on Neo/J of 26 MB. The same file on StarOffice for Linux weighed in at a measly 512 KB.

As a result, I was unable to mail it from my Mac.
Back to top
ovvldc
Captain Naiobi


Joined: Sep 13, 2004
Posts: 2352
Location: Zürich, CH

PostPosted: Sun Oct 10, 2004 8:32 am    Post subject: still here

Well, I tried it out.

SXW document: 132 kB (no pictures)
X11 PDF export: 376 kB initial
X11 PDF export: 372 kB after loading and saving in Acrobat 6.0
NeoOffice/J PDF export: 3,9 MB initial
NeoOffice/J PDF export: 856 kB after loading and saving in Acrobat 6.0
NeoOffice/J print to PDF: failed, got nasty spinning wheel on save dialog

Adobe Acrobat was particularly concerned with duplicate fonts. It took a little while in the X11 version but took minutes on the NeoOffice/J version. Even then, the NeoOffice/J version is over twice the size, but at least I can e-mail it Wink.
Back to top
Display posts from previous:   
   NeoOffice Forum Index -> NeoOffice Testing All times are GMT - 7 Hours
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum

Powered by phpBB © 2001, 2005 phpBB Group

All logos and trademarks in this site are property of their respective owner. The comments are property of their posters, all the rest © Planamesa Inc.
NeoOffice is a registered trademark of Planamesa Inc. and may not be used without permission.
PHP-Nuke Copyright © 2005 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.