Friday, March 02, 2007

Focus on the logic behind a file-format, not on the container used

Diving into the Office Open XML file-format and in the same run into all other MS Office file-formats (if one makes abstraction of the container, content is the same), one realizes the different ways that can be used to store the same thing. Unless designed with interoperability as a pricipal aim, a file-format's logic reveals nicely the way the application that produces it operates. And the fact that it be binary file, a compressed structured document composed by several xml streams, or a RTF-like file is not making much difference (apart of the difference in human-readability). And it is exactly the difference in the underlying logic that renders impossible to reach an optimal result in translation by using a descriptive tool like XSL transformations. Let's enjoy together the main difference between MS Office, WordPerfect and ODT/HTML file-formats:

MS Office

The main idea behind the MS Office word processing file-formats (.doc, .rtf, .docx and other xml-based attempts) is that the node (paragraph, section, ...) properties are "stored" in the "character" that marks the end of the node. I.e.: The paragraph properties will be all properties that the paragraph break marks finds set. And these properties will be properties of the paragraph whose end this paragraph break marks. From the point of view of the document flow, with some nuances, the paragraph properties are applied retroactively to the current paragraph. And it is similar with the section properties.

WordPerfect

The WordPerfect file-formats reflect also well the underlying conceptual model of its application: typewriter. As well as in MS file-formats, a WP file can have the paragraph or page properties defined anywhere in the document. And the codes are applied as soon as the formater is able to do it. Nevertheless (with the exception of headers/footers that appear in certain WP file-formats on the same page as the code that defines them), they are never applied retroactivelly. The properties are crossing the boundaries of hierarchically superiour elements. I.e.: If one sets direct code of "bold" somewhere in the document, this code will be valid and have effect untill another code that cancels it. The number of page, column, ... breaks between the two does not matter. If one changes a paragraph alignment in the middle of the paragraph, for instance, newer versions of WordPerfect will add a "temporary hard return" code before the alignment code to hint that a paragraph break is necessary in order to render this change. Nonetheless, older versions of WordPerfect (still using the same file-format) will leave the task to realize this fact to the formater/layout engine itself.

A direct result of the typewriter conceptual model in WordPerfect are those special tabs that disregard completely the tab-table and that contain themselves the information about their position and alignment (hard tabs, center on margins, flush right,...). A reader that did not yet get lost in this entry will appreciate how nice it is to translate these beasts into a file-format in which all tab properties are defined in a tab-table that has to be known at latest at the moment of beginning of the paragraph.

ODT/HTML

These file-formats have the particularity that the information about an element is enclosed in the element's opening tag. The alignment, indentation or spacing of a given paragraph is known at the begining of the paragraph, better said at the moment of the paragraph opening. This is the same with the tab-table for those file-formats where it makes sense (a tab-table has no sense in the html file-format although some extensions are trying to implement it in some way).

Document conversion

Given the fact that a word processing document is read sequentially, the conversion between these different formats can be quite a pain in some posteriour place of one's body. For instance, if you want to convert directly from a MS Office file-format to HTML/ODT, you will have to parse the document two times: first to collect the styles of the nodes (that become manifest at the end of the corresponding node) and second to parse the content of the nodes. For a proper conversion of WordPerfect "special" tabs and headers/footers, one has to proceed in a similar way, besided a deep magic to incorporate the "special" tabs' position and alignment into a tab-table of the given paragraph.

New OpenOffice Writer filter API

As I already mentined, the direct conversion is likely to be a pain in some posteriour place, unless...

...an import filter does not need to do the whole conversion and can feed the information into a structure where one can modify the properties of a given paragraph/section/text-run aka text-span during the content parsing. And this is exactly what the new filter API that is being implemented in the writerfilter2 CWS is designed for. Stay tuned for some more exciting adventures concerning the file-formats!

Just to add that the libwpd API gives the "node" properties in the openFoo callbacks. And it is more or less trivial to plug this logic into the new Writer API logic. Because it is much easier to have some information early and keep it for later on then the opposite case.

Work @ Novell

I am owing thanks to the Lord God Almighty for being merciful and gracious with me. He extended his hand over me and my family and accompanied us through the desert of the year 2006. He blessed us abundantly and exceedingly above all we could ever hope for. I am writing this because I want to give all the honour and all the glory to The One Who deserves it in first place.

I want to thank also the Novell OpenOffice.org Team for having considered me with favour for a position of Software engineer working on interoperability issues between MS Office and OpenOffice.org. I am excited to see my hobby activity becoming my regular job. I cannot thank you enough, guys, for giving me an opportunity to work full-time on this great FOSS project.

I want to give my thanks also to other people who were ready to help me throughout 2006: Sophie Gauthier, the lead of OOo French native language project, Mathias Bauer, former framework and current sw lead (and friend), Thorsten Behrens, the gsl co-lead (and a great person and friend), Marc "uwog" Maurer and Dom Lachowitz, AbiWord admins and fine hackers (and helpful friends), and to all those people from OpenOffice.org, AbiWord and libwp* projects that were very supportive in action and in kind words: Rob, Sum1 (really SomeOne), Will, Andrew, Ariya, Pavel, Caolan, Heiner, Stefan, Eric and all those whose names I might have forgotten to put down here. Thanks, good pexcitingeople, for your willingness to put hand in fire for a TrainedMonkey and to help and encourage him. I really appreciate!

Last, but in no way least, I want to thank particulary Michael Meeks, Distinguished Engineer and "hacker extraordinaire", for injecting in me the uncurable virus of OpenOffice.org. He helped me to enter the community. Thanks, dudie, for guiding and supporting me and for being a friend inspite of my big mouth!

I am looking forward to advancing the cause of the Free Desktop inside of a company that believes and puts in practice what it preaches.