Happy third birthday, Patrick!
Thursday, October 25, 2007
Thursday, October 18, 2007
I love this! I misunderstood a bit what the ITO was and at certain point I realized that I have accumulated quite a load of it. So, I had to spend it somehow. Here is the account of what it was useful for:
Finishing off what was started in Barcelona
Naturally, if you speak to me about fun hacking, I will always have the image of libwpd or libwpg coming to my little brain. As I blogged earlier, the first thing I used some time for was to finish off the work on tabulator conversion for WP 2.x - 3.5e for Mac and WP 5.x for DOS/Windows file-formats. It was started in Barcelona and the principle of release soon, release often dictated that it should not sleep on my disk for long time. I got even a very positive echo of the overwhelming part of our Mac OSX user community after the 0.8.12 release (Yes, Smokey, you are about 87.49% of our active Mac OSX community :-)). So, if you did not do it yet, throw yourselves on the binaries and sources while they are still available.
Buffered stream implementation
An old proverb says that one should try not to put off useful people. This was also one of the reasons why I deprecated in latest releases of libwpd our libgsf-based stream implementation. Understand me well, libgsf is a fine piece of software and is doing a lot of good to the world. Nevertheless, since libwpd is used by the Koffice people along with some other office suites, I did not want to keep it depending on a platform specific libraries. Moreover if the functionality that we use from these libraries is really very very very very tiny. So, we used a hacked version of Ariya's pole (one header and one source file) to provide two additional WPXInputStream implementations that do not depend on anything besides a working STL implementation.
So far, so good, but... As long as the C++ streams (as I call them) became the default for the internal libwpd tools, we came to the point where the above mentioned proverb was going to be put into task again. Suddenly, Sum1's QA run started to take several days instead of a night with the libgsf-based stream implementation. And putting off Sum1 is something that every hacker with some rudiments of sane mind should avoid at all cost.
In the same token, some other performance issues were reported from other contexts and the fact that libwpd reads at most 4 bytes in a single read started to show as a real problem. So, some time ago, inspired by Kendy's intent to implement read-ahead to WPXSvStream, I hacked something similar to the sample WPXFileStream. But, there was a problem. Very probably a hideous bug in libstdc++ is causing in some specific cases a seek backwards by 2 bytes with a consecutive read of 1kB result in a
badbit being set. And this mainly on x86 with gcc 4.x (x86_64 was working well). Even x86 with gcc 3.4.x worked well as well as our distinguished competitor's Visual Studio compilers. Since I did not have much time down there to look into it, I left the idea of buffered read hanging.
Now, during my ITO, I gave it a quick second shot and by a sheer luck I found a workaround that is not really penalizing in terms of performance, and makes the buffered stream work well. The solution is simple: before the read, seek to the end of the block that we want to read and then back to the position where we want to start to read. With this incantation, the subsequent read does not set any
badbit anymore and everything is nice in the best of the worlds. Nonetheless, if a libstc++ hacker is interested in investigating into this bug, I have a historical version of the stream with a sample document that always triggers the above mentioned behaviour.
Positioned objects in WP6+ and ODF reference implementation woes
The main task I was prepared to accomplish since our dear hackweek already, and waited only for an enough long contiguous free time, was to get the size, the position and the anchoring of the positioned objects (images, text boxes,...) right. The problem was in multiple places. First of all the way the box information is encoded in WP6+ files is really only remotely connected to an idea I have about a fun hacking. The box/frame style is an old good way binary encoded information (if the bit 15 is set, the following information is available... etc.). The code that is inserting the box/frame can override style and whether it does so or not is again given by some bits of a number at the begining of the code. Added to it an acquired mistrust towards the Corel provided documentations, I was not really hot to spend my week-ends on this. Nevertheless, some recent user demands and the vision of a well used ITO made me to bite the bullet. And surprisingly, the main difficulty was not where I expected it to be. This time the documentation was right (at least for the information that I currently use). This time the tool I trusted always the most,
wplook.exe from the WordPerfect Office 11 SDK, got some things wrong and segfaulted on accessing the box/frame information of certain documents. But once understood where to put my trust, the information I needed was extracted and ready to be processed.
And it is there where the real problems started. On one hand the WP6+ file-format is much more expressive then ODF concerning the object position. Thinks like shifted one inch left from the center of the page text area is not mapping directly to anything in ODF. And given also the fact that in WordPerfect the page margins can be set different for each paragraph, the position is not as easy to compute as one could expect. Nevertheless, the biggest surprise came from the way the reference implementation of ODF handles page anchored objects that correspond to the second case of
text:anchor-type="page" referenced at page 294 of this file as well as at the page 303 of this one. According the specifications, if one omits the
text:anchor-page-number property, one would expect that the frame appears at the same page as the character immediately following the
draw:frame closing element. And this is exactly what happens inside WordPerfect, so I was happy like a child. Premature happiness though. When our dearest reference implementation finds a page-anchored frame without
text:anchor-page-number property, it assumes apparently that this is just a forgetful mind of the filter writer and adds a
text:anchor-page-number="0" attribute. This invariably places all the page-anchored frames outside the specified range for page numbers (1..N) and thus outside the document (although they still remain in the file fortunately).
So, there are two possible workarounds that I managed to find. Although none of them is satisfactory for different reasons. The first consists in using the hints WordPerfect formatter leaves in the file for itself to improve the speed of rendering. One can use the soft page breaks and hard page breaks to count on which pages we are in the given moment positioned. It is a possible workaround, but does not do good job unless the documents are reasonably short. With long documents, font substitution and not completely correctly converted spaces between headers, footers and the text area, one will find oneself with frames drawn on completely unrelated pages. So, this solution has little of my sympathy. The second workaround is to use the
text:anchor-type="char" with which the anchoring is also relative to the position of the first character that follows the
draw:frame closing element. And it works well as a workaround. My principal caveat is that although our reference implementation of ODF standard accepts in such cases the values "
page" and "
style:vertical-rel attribute, according to the tables on the page 660 of this file as well as at the page 672 of this one, they are not conforming to the specifications given the anchoring. So, although the files I generate using this workaround do the right thingTM for me, they are strictly speaking not a correct ODF (although the validator is not detecting it) and other implementations might refuse them or render them completely incorrectly.
Frankly, I did not have time to dig into the code that implements this frame-related stuff, but someone who knows where to look might maybe find it rather trivial to fix. I said, "maybe", right?
The people wants screenshots!
I know, I know, and here is one.
I want hereby also thank my distinguished employer for making this possible by giving us the privilege of ITO (Innovation Time Off). It is really sweet to be able to accomplish things like this in the company time. And for those who face customer requests for converted WordPerfect documents with images and text boxes, the
std::answer you will receive from me will be from now on to use the CVS HEAD of libwpd, libwpg and writerperfect to produce the
wpd2odt tool that knows it all.
Thursday, October 11, 2007
Because we live in very interesting times, some information might be pushed to background. Even though it is very important one. Last week-end an e-mail of Jaroslav Fojtik announced me his success in finding a way to decrypt password protected WP 3.x for Mac files. Just to precise, it is not about the protection cracking, just about the ability to read password protected documents using FOSS once you know the password. This masterpiece of reverse engineering (which is completely legal in some parts of the world) adds this file-format to the WP 4.x and WP 5.x for DOS/Windows that FOSS world was already able to decrypt.
The only file-formats that we are still not able to decrypt are WP 1.x for Mac and WP 6+. For the former, work is currently being done, for the later, it looks like it is a bit harder nut. If anybody of people reading this entry has any useful information about the encryption of the WP 6+ documents, please read here. We would like to stress that we are not interested in the password cracking and other illegal activity. We want simply to make it possible to a user of FOSS to read its own WordPerfect documents that she once protected using a password. So, abstain from communicating information whose publication could be illegal in your jurisdiction. OTOH, a genuine help with the decryption is most welcome.
The API for decrypting WP documents will most probably be part of libwpd 0.9.x series.
libwpd 0.8.12, codename "Barcelona no s'acaba a Catalunya", left its mother's womb yesterday. Besides build fixes for the upcoming gcc 4.3, this release features initial (although not really so lame) support of tabulators for WP 2.x - 3.5e for Mac and WP 5.x for DOS/Windows file-formats, tabset conversion in WP 5.x, as well as some font ID to font name mappings for WP 1.x for Mac file-format. The cross-compilation framework was drastically updated and a compilation with Sun Studio 12 for Linux was also tested and this release is reputed to work with it.
As the name indicates, the major part of the code for this release was done during the wife-less nights spent at the OOoCon2007 in Barcelona. Although, as the name also indicates, some code was added even once outside the territory of the Paisos Catalans.
It is a free software released under LGPL and it is free as beer, so take it, taste it and enjoy it. The rpms for different SuSE and Fedora flavours can be found in my openSuSE build service home repository. The cross-compiled win32 binaries are to be found on the libwpd download page.
Besides the normal authors and contributors, the thanks for this release go to our very own Smokey Ardisson, to Rob Staudinger, Hub Figuiere and, last but not least, to my dearest employer, Novell, Inc. and its Innovation Time Off program.
Thursday, October 04, 2007
John, I completely agree with your point that one can assign copyrights to an entity if one can trust it. Where I don't agree is in your conclusion whether Sun Microsystems is such a honest broker. My own experience from the project is that a lot of lip service is payed to improvements of all kind, but that more the things change more they are the same. And I have a tendency to distrust entities that are full of good intentions at long term, but when the long term becomes shorter one is seeing basically no result.
Not that I am expecting anything good to happen. Between helping to build a strong developer community and trying to accumulate other people's code for proprietary licensing, Sun already decided anyway.