Friday, March 02, 2007

Focus on the logic behind a file-format, not on the container used

Diving into the Office Open XML file-format and in the same run into all other MS Office file-formats (if one makes abstraction of the container, content is the same), one realizes the different ways that can be used to store the same thing. Unless designed with interoperability as a pricipal aim, a file-format's logic reveals nicely the way the application that produces it operates. And the fact that it be binary file, a compressed structured document composed by several xml streams, or a RTF-like file is not making much difference (apart of the difference in human-readability). And it is exactly the difference in the underlying logic that renders impossible to reach an optimal result in translation by using a descriptive tool like XSL transformations. Let's enjoy together the main difference between MS Office, WordPerfect and ODT/HTML file-formats:

MS Office

The main idea behind the MS Office word processing file-formats (.doc, .rtf, .docx and other xml-based attempts) is that the node (paragraph, section, ...) properties are "stored" in the "character" that marks the end of the node. I.e.: The paragraph properties will be all properties that the paragraph break marks finds set. And these properties will be properties of the paragraph whose end this paragraph break marks. From the point of view of the document flow, with some nuances, the paragraph properties are applied retroactively to the current paragraph. And it is similar with the section properties.

WordPerfect

The WordPerfect file-formats reflect also well the underlying conceptual model of its application: typewriter. As well as in MS file-formats, a WP file can have the paragraph or page properties defined anywhere in the document. And the codes are applied as soon as the formater is able to do it. Nevertheless (with the exception of headers/footers that appear in certain WP file-formats on the same page as the code that defines them), they are never applied retroactivelly. The properties are crossing the boundaries of hierarchically superiour elements. I.e.: If one sets direct code of "bold" somewhere in the document, this code will be valid and have effect untill another code that cancels it. The number of page, column, ... breaks between the two does not matter. If one changes a paragraph alignment in the middle of the paragraph, for instance, newer versions of WordPerfect will add a "temporary hard return" code before the alignment code to hint that a paragraph break is necessary in order to render this change. Nonetheless, older versions of WordPerfect (still using the same file-format) will leave the task to realize this fact to the formater/layout engine itself.

A direct result of the typewriter conceptual model in WordPerfect are those special tabs that disregard completely the tab-table and that contain themselves the information about their position and alignment (hard tabs, center on margins, flush right,...). A reader that did not yet get lost in this entry will appreciate how nice it is to translate these beasts into a file-format in which all tab properties are defined in a tab-table that has to be known at latest at the moment of beginning of the paragraph.

ODT/HTML

These file-formats have the particularity that the information about an element is enclosed in the element's opening tag. The alignment, indentation or spacing of a given paragraph is known at the begining of the paragraph, better said at the moment of the paragraph opening. This is the same with the tab-table for those file-formats where it makes sense (a tab-table has no sense in the html file-format although some extensions are trying to implement it in some way).

Document conversion

Given the fact that a word processing document is read sequentially, the conversion between these different formats can be quite a pain in some posteriour place of one's body. For instance, if you want to convert directly from a MS Office file-format to HTML/ODT, you will have to parse the document two times: first to collect the styles of the nodes (that become manifest at the end of the corresponding node) and second to parse the content of the nodes. For a proper conversion of WordPerfect "special" tabs and headers/footers, one has to proceed in a similar way, besided a deep magic to incorporate the "special" tabs' position and alignment into a tab-table of the given paragraph.

New OpenOffice Writer filter API

As I already mentined, the direct conversion is likely to be a pain in some posteriour place, unless...

...an import filter does not need to do the whole conversion and can feed the information into a structure where one can modify the properties of a given paragraph/section/text-run aka text-span during the content parsing. And this is exactly what the new filter API that is being implemented in the writerfilter2 CWS is designed for. Stay tuned for some more exciting adventures concerning the file-formats!

Just to add that the libwpd API gives the "node" properties in the openFoo callbacks. And it is more or less trivial to plug this logic into the new Writer API logic. Because it is much easier to have some information early and keep it for later on then the opposite case.