Thursday, December 21, 2006

Uberdedicated TrainedMonkey?

For those that do not want to wait for the integration of the CWS wpsimport01, the Uberdedicated TrainedMonkey cloned the standalone writerperfect from the libwpd CVS repository and created a command-line tool alowing you to convert now your stack of MS Works documents into OpenOffice.org 1.0 file-format. The cloning was done mainly for debuging purposes, because the com.sun.star.comp.Writer.XMLImporter seems more picky than the normal document-loading process, but it is usable for converting documents too. The module is called "wps2sxw" and lives in the libwps' subversion repository.

So run there to pick it while it is hot! And do not forget to subscribe the project's mailing list and send some encouraging words to Andrew. Believe me, he does an excellent work!

Tuesday, December 19, 2006

One word that matters

One word can change many things. And sometimes the words that one is not used to use so much might have a kind of magic in them.

Here is the problem we were facing yesterday. This piece of code was refusing to compile:

1    #include <vector>
2    template< class key , class hashImpl , class equalImpl >
3    class OMultiTypeInterfaceContainerHelperVar {
4        typedef ::std::vector< std::pair < key , void* > > InterfaceMap;
5        InterfaceMap *m_pMap;
6        inline void * find() {
7            InterfaceMap::iterator iter;
8            return NULL;
9        }
10   };

The complain was about an expected ';' before 'iter' on line 7. One can conclude by looking close that the code is correct and start to file a bug in the gcc bugzilla :-) And yet, inspite of the fact that it looks very reasonable for anybody who has a basic knowledge of C++, this code is not valid C++ and the compiler is right. For the line 7 to compile correctly, InterfaceMap::iterator has to be a type. But the compiler cannot be 100% sure about this at the moment when it is compiling the template class. It will know more at the moment of the instantiation of the template, but not before. There is still a tiny little probability that there will be a specialization for a certain class key where the InterfaceMap::iterator will not be a type. Because the compiler is not 100% sure that the InterfaceMap::iterator is a type, it assumes that it is not.

But now what? How can one make the compiler accept the code? Simply, by telling it that the InterfaceMap::iterator is a type. Here it is where a word that is not so often used (In fact, your servant never had to use it for the while) becomes handy. If you precede the statement at the line 7 by the keyword typename, everything compiles without a quirk. The presence of this keyword takes the compiler out of the uncertainty and, with its mind settled, it accepts the code.

So, the rule of the thumb is to precede the use of the dependent nested types in templates always by the magic word typename.

Having said this, I am quite happy that the long winter evenings spent with Scott Meyers were not amiss.

Monday, December 18, 2006

Microsoft Works import filter for OpenOffice.org

Some sweating was necessary but the result is worth it. To avoid that OpenOffice.org be too far behind AbiWord your servant came today with a Microsoft Works import filter for OpenOffice.org based on libwps library that has been blogged about last week. I would even gladly provide screenshots, but the CWS wpsimport01, where the new filter lives, is based on milestone m196 that has some serious antialiasing issues on X11 platforms and thus the pictures would be just not nice at all. Nevertheless, it will be resynced several times as one writes the specifications, so stay tuned for the screenshots.

The integration strategy would be following: first integrate the CWS fs08 that fixes some memory problems and crasher issues with the WordPerfect import filter. Make, in the same CWS, build the writerperfect module with as little warnings as it is humanly possible and refactor it to make it much easier for other libraries reusing libwpd's public API to be plugged into it. After that, if a release of libwps happens and if inclusion of the tarball passes through all the necessary paperwork, integrate the CWS wpsimport01.

As a bottom-line, the refactoring in fs08 should make a possible shift to the use of the new filter API, that Henning blogged about, almost transparent.

So, now it is the time to study the new API in writerfilter2 and see whether the libwpd-family libraries can use it directly, or some modifications should be proposed

Wednesday, December 13, 2006

On clean reusable interfaces and the pleasure to see things happen

As Dom wrote it already, if you are having your documents mainly in MS Works format (yes, it is normally what comes often pre-installed on a new purchased computer along with some version of Windows) you are not excluded from the free desktop any more.

Focus on libwps, a Microsoft Works file word processor format import filter library

There is a nice neat tiny library, libwps, that allows you to read those documents and convert them to other formats. A plug-in for HEAD version of AbiWord exists and its integration into OpenOffice.org should not take a lot of (coding, the QA and paperwork is other beast) time.

The library (not yet released, but ready to be soon) deserves to have some spot-lights on it. Not only it fills a gap in file-format coverage by FOSS word processing applications, it's author Andrew Ziem is an example of a hacker that your servant likes and gets encouraged by. Being one of the great applicants for Google Summer of Code with OpenOffice.org, his proposal was finally refused. It was a close call and the refusal was only due to the ridiculous number of 6 slots allocated for 69 eligible applications and 20 mentors standing on starting blocks. Many other would get discouraged and abandon the idea of contribution to the FOSS. Not Andrew! This great hacker started to code for his pleasure and libwps is the wonderful result of his work. Your servant has had the privilege to be assisting to the birth of the project and feels lucky.

The number of reinvented wheels should be (almost) zero

For a FOSS project, the beginning is often important. The hacking has to be fun and the results have to come in a tangible manner quite quickly. Otherwise, one is more likely to reallocate one's free time for other activities and the nascent FOSS project experiences a cot-death. On the other hand, to get a library design right can be quite difficult especially if one does it for the first time.

The design choice of libwps was the right one: take and reuse the interface and design of a tried and true library used by ALL free and open-source word processing applications, the libwpd. This allowed to avoid several possible iterations of bad designed interface and focus on the MS Works files conversion itself. A nice side effect of this choice is also that was super-simple to integrate a filter based on libwps into AbiWord and it will not be much more difficult to integrate it into OpenOffice.org or kWord as well. Well, from the design point of view, that is.

Clean and reusable interface design

That this was possible is a deliberate choice of libwpd maintainers. The thanks go especially to Marc “uwog” Maurer and William Lachance for the original design. The idea would be to extract the reusable interfaces from libwpd itself and provide it as a framework for developing import (and possibly export) filters from (to) binary file-formats at least. The libwps experience shows that it is possible. The advantage of the interface is that is does NOT expose STL types to user, so it is usable for applications that use different STL implementations with different signatures from the STL implementation used by the underlying system. As it is the case of OpenOffice.org that uses STLport internally. A library developed using this interface would be usable both as internal or system library for the given application. So, filter developers of all countries, unite!

Thursday, December 07, 2006

Benefits of getting lost

I do not know whether it happened to you to get lost only to discover a new exciting landscape of your town, parts that you have never entered before. And so, in spite of the fact that you wasted some time, you feel lucky.

A document crashing OpenOffice.org

This is exactly what happened to me last week. Everything started with the issue 71487. OpenOffice.org was crashing with the document, but the internal libwpd's wpd2raw tool was spitting out a document that was well formed. AbiWord was opening the document without shouting loud, so the quick conclusion was that the culprit would be the writerperfect code. Writerperfect is the the base of a nice tool, wpd2sxw, that uses libwpd to read a WordPerfect document and produces a SXW. The core code of writerperfect is used also in KWord's and OpenOffice.org Writer's import filters.

Nevertheless, writerperfect produced with this document a nice well formed flat XML in SXW file-format. Since the generated flat XMLs never really validated against the OpenOffice.org 1.0 DTD, the well formed-test is so far the only tool that we are able to use to track regressions.

Valgrinding writerperfect

The things started to be a little bit more tough. Gdb was giving a partially corrupted trace and the resulting crash was outside the writerperfect code. So, in a desperate attempt to get the issue fixed, I tried to use valgrind and check whether writerperfect does not have somewhere a jump or a branch dependent on an uninitialized variable. I run all documents of our regression suite through valgrind and realized that writerperfect had a lot a lot a lot of memory problems. It took me whole weekend and more to remove all the memory leaks (partially destroyed objects due to non-virtual destructor in base classes, a container of pointers going out of scope,...). This work lead to the inclusion of writerperfect valgrind test in our regression test-suite and in removal of all detected memory problems.

Problems with SXW generation on x86_64

After all this cleanup work, I tried to load the document again. And OpenOffice.org crashed again exactly like the past times. Moreover, on my home machine that runs Ubuntu Edgy amd64, the command-line tool wpd2sxw produced a clean XML in SXW format on standard output, but the content.xml in the zip file contained some garbage characters. On other machine, running Ubuntu Edgy i386, the garbage was not there.

The amazing permissivity of the WordPerfect file-format

A bit disgusted, I asked Mathias Bauer whether he cannot see anything. He took the document and confirmed that the trace (the part I could not see because of the corruption) originates from the WordPerfect importer. So, the last desperate move was to boot the Windows partition and examine the document in WordPerfect itself. I know, would be maybe the first step, but given that I have only one Windows partition and that one is on my wife's laptop... Side-by-side examination of the original document opened in WordPerfect and the converted document opened in AbiWord showed that it was true that AbiWord opened the document, but it did import only half of it. So, the time came to have some nice eye-to-eye session with ghex2. The close examination of the document using hexadecimal viewer showed that the document contains a footnote that itself contains 3 footnotes. WordPerfect reacts to such cases by ignoring completely the nested footnotes, but it leaves the functions in the stream unchanged. I am always puzzled seeing the way WordPerfect preserves user's errors for future generations. So the solution was simply to instruct libwpd to ignore foot/endnotes if it is already parsing a foot/endnote. Simple fix of some lines and the document loads correctly in OpenOffice.org, AbiWord and KWord.

gsf_output_printf and long strings on x86_64

wpd2sxw command-line tool uses libgsf abstraction for writing different streams into the the SXW zip-file. So, after having audited our code thoroughly and not finding anything that could be wrong, my attention turned to libgsf, more exactly to the gsf_output_printf function. I discovered that this function uses among others g_vsnprintf call. Knowing the woes that one can have due to different implementations of vsnprintf out there, I replaced the gsf_output_printf calls by gsf_output_puts calls and...

... the resulting SXW loaded into the OpenOffice.org without any problem. Instead of bugging Jody who should have by now enough of me for some weeks (so much I tried to suck knowledge out of his brain in my H.Opeless phase), I spoke to Dom Lachowicz about the problem and, eventually, Morten fixed it in libgsf CVS. It turned out to be the same problem that affected about a year ago the GsfOutputMemory.

Positive externalities of being lost

It is true that the real fix for issue 71487 did not take more than few lines of code, but the fact of being lost and H.Opeless for some time had nice positive externalities. Thanks to this situation, it came to my mind to start to valgrind the writerperfect code (it was not on my todo list, at this time at least) and the memory problems got solved. A bug in libgsf code was triggered and fixed.

And so, in spite of the fact that I lost quite a lot of time with this, I feel lucky when I look at the positive externalities.