Thursday, October 18, 2007

ITO, Sweet ITO

I love this! I misunderstood a bit what the ITO was and at certain point I realized that I have accumulated quite a load of it. So, I had to spend it somehow. Here is the account of what it was useful for:

Finishing off what was started in Barcelona

Naturally, if you speak to me about fun hacking, I will always have the image of libwpd or libwpg coming to my little brain. As I blogged earlier, the first thing I used some time for was to finish off the work on tabulator conversion for WP 2.x - 3.5e for Mac and WP 5.x for DOS/Windows file-formats. It was started in Barcelona and the principle of release soon, release often dictated that it should not sleep on my disk for long time. I got even a very positive echo of the overwhelming part of our Mac OSX user community after the 0.8.12 release (Yes, Smokey, you are about 87.49% of our active Mac OSX community :-)). So, if you did not do it yet, throw yourselves on the binaries and sources while they are still available.

Buffered stream implementation

An old proverb says that one should try not to put off useful people. This was also one of the reasons why I deprecated in latest releases of libwpd our libgsf-based stream implementation. Understand me well, libgsf is a fine piece of software and is doing a lot of good to the world. Nevertheless, since libwpd is used by the Koffice people along with some other office suites, I did not want to keep it depending on a platform specific libraries. Moreover if the functionality that we use from these libraries is really very very very very tiny. So, we used a hacked version of Ariya's pole (one header and one source file) to provide two additional WPXInputStream implementations that do not depend on anything besides a working STL implementation.

So far, so good, but... As long as the C++ streams (as I call them) became the default for the internal libwpd tools, we came to the point where the above mentioned proverb was going to be put into task again. Suddenly, Sum1's QA run started to take several days instead of a night with the libgsf-based stream implementation. And putting off Sum1 is something that every hacker with some rudiments of sane mind should avoid at all cost.

In the same token, some other performance issues were reported from other contexts and the fact that libwpd reads at most 4 bytes in a single read started to show as a real problem. So, some time ago, inspired by Kendy's intent to implement read-ahead to WPXSvStream, I hacked something similar to the sample WPXFileStream. But, there was a problem. Very probably a hideous bug in libstdc++ is causing in some specific cases a seek backwards by 2 bytes with a consecutive read of 1kB result in a badbit being set. And this mainly on x86 with gcc 4.x (x86_64 was working well). Even x86 with gcc 3.4.x worked well as well as our distinguished competitor's Visual Studio compilers. Since I did not have much time down there to look into it, I left the idea of buffered read hanging.

Now, during my ITO, I gave it a quick second shot and by a sheer luck I found a workaround that is not really penalizing in terms of performance, and makes the buffered stream work well. The solution is simple: before the read, seek to the end of the block that we want to read and then back to the position where we want to start to read. With this incantation, the subsequent read does not set any badbit anymore and everything is nice in the best of the worlds. Nonetheless, if a libstc++ hacker is interested in investigating into this bug, I have a historical version of the stream with a sample document that always triggers the above mentioned behaviour.

Positioned objects in WP6+ and ODF reference implementation woes

The main task I was prepared to accomplish since our dear hackweek already, and waited only for an enough long contiguous free time, was to get the size, the position and the anchoring of the positioned objects (images, text boxes,...) right. The problem was in multiple places. First of all the way the box information is encoded in WP6+ files is really only remotely connected to an idea I have about a fun hacking. The box/frame style is an old good way binary encoded information (if the bit 15 is set, the following information is available... etc.). The code that is inserting the box/frame can override style and whether it does so or not is again given by some bits of a number at the begining of the code. Added to it an acquired mistrust towards the Corel provided documentations, I was not really hot to spend my week-ends on this. Nevertheless, some recent user demands and the vision of a well used ITO made me to bite the bullet. And surprisingly, the main difficulty was not where I expected it to be. This time the documentation was right (at least for the information that I currently use). This time the tool I trusted always the most, wplook.exe from the WordPerfect Office 11 SDK, got some things wrong and segfaulted on accessing the box/frame information of certain documents. But once understood where to put my trust, the information I needed was extracted and ready to be processed.

And it is there where the real problems started. On one hand the WP6+ file-format is much more expressive then ODF concerning the object position. Thinks like shifted one inch left from the center of the page text area is not mapping directly to anything in ODF. And given also the fact that in WordPerfect the page margins can be set different for each paragraph, the position is not as easy to compute as one could expect. Nevertheless, the biggest surprise came from the way the reference implementation of ODF handles page anchored objects that correspond to the second case of text:anchor-type="page" referenced at page 294 of this file as well as at the page 303 of this one. According the specifications, if one omits the text:anchor-page-number property, one would expect that the frame appears at the same page as the character immediately following the draw:frame closing element. And this is exactly what happens inside WordPerfect, so I was happy like a child. Premature happiness though. When our dearest reference implementation finds a page-anchored frame without text:anchor-page-number property, it assumes apparently that this is just a forgetful mind of the filter writer and adds a text:anchor-page-number="0" attribute. This invariably places all the page-anchored frames outside the specified range for page numbers (1..N) and thus outside the document (although they still remain in the file fortunately).

So, there are two possible workarounds that I managed to find. Although none of them is satisfactory for different reasons. The first consists in using the hints WordPerfect formatter leaves in the file for itself to improve the speed of rendering. One can use the soft page breaks and hard page breaks to count on which pages we are in the given moment positioned. It is a possible workaround, but does not do good job unless the documents are reasonably short. With long documents, font substitution and not completely correctly converted spaces between headers, footers and the text area, one will find oneself with frames drawn on completely unrelated pages. So, this solution has little of my sympathy. The second workaround is to use the text:anchor-type="char" with which the anchoring is also relative to the position of the first character that follows the draw:frame closing element. And it works well as a workaround. My principal caveat is that although our reference implementation of ODF standard accepts in such cases the values "page" and "page-content" for style:vertical-rel attribute, according to the tables on the page 660 of this file as well as at the page 672 of this one, they are not conforming to the specifications given the anchoring. So, although the files I generate using this workaround do the right thingTM for me, they are strictly speaking not a correct ODF (although the validator is not detecting it) and other implementations might refuse them or render them completely incorrectly.

Frankly, I did not have time to dig into the code that implements this frame-related stuff, but someone who knows where to look might maybe find it rather trivial to fix. I said, "maybe", right?

The people wants screenshots!

I know, I know, and here is one.

I want hereby also thank my distinguished employer for making this possible by giving us the privilege of ITO (Innovation Time Off). It is really sweet to be able to accomplish things like this in the company time. And for those who face customer requests for converted WordPerfect documents with images and text boxes, the std::answer you will receive from me will be from now on to use the CVS HEAD of libwpd, libwpg and writerperfect to produce the wpd2odt tool that knows it all.