Tuesday, July 11, 2006

Upcomming libwpd-0.8.6 - more robust than ever

Your servant was working last two-three weeks together with AbiWord's QA expert sum1 extensively on improving libwpd stability. Here are the result of the work (comparison between libwpd-0.8.5 and upcomming libwpd-0.8.6)

                   libwpd 0.8.5                libwpd 0.8.6

Total tested:      46675                       46675
Unsupported:         122 ( 0.26%)                 15 ( 0.03%)
                   -----                       -----
Actually tested:   46553 (99.74% import rate)  46660 (99.97% import rate)

Parse exceptions:    660                           0
File exceptions:     555                           8
Unknown errors:        5                           0
Crashes:              13                           0
                   -----                       -----
Total failures:     1233 (2.65% failure rate)      8 (0.02% failure rate)

From the 15 unrecognised documents, some are not WordPerfect documents at all, some are corrupted to the extend that they contain only 4 or 8 bytes which is too little even for a very intelligent corruption repair utility. Some are corrupted WP 4.2 documents that we cannot repair because of reasons shown below.

The 8 documents that fail to be converted are mainly WP 6 documents with corrupted prefix area.

The improvement are due to:

  • Handling of impossible situations that are indeed possible with Corel WordPerfect file format, like tables with unequal number of cells in each row. Now we are completing this kind of tables to produce a valid output. Before, we would throw ParseException in such a case.

  • Trying to handle situation of any missing data due to corruption (i.e. prefix packet containing footnote missing, but still referenced by the text body)

  • A new corruption repair mechanism that manages to detect corrupted areas in document body, skip the inconsistent information and restart the normal document parsing after the corrupted area. Empirical tests showed that libwpd containing this heuristics will be able to open correctly some corrupted documents that make WordPerfect itself crash. Comparison between corrupted documents opened directly by libwpd and corrupted documents repaired by wplook.exe, a corruption repair utility from WordPerfect SDK, show that libwpd's corruption repair heuristics can add some garbage characters at the corrupted place (due to the way it operates), but does not recover less information than wplook.exe. The only document format that we are not repairing for now is the WP 4.2 file format. This format does not have anyheader or magic value to detect it as such. It is simply a plain text document that has WordPerfect functions embedded. The type detection for this file format is done by dry-run parsing of the document and determining whether it is a well-formed WP 4.2 document. Adding our corruption removal heuristic in this case would mean that virtually any document on the earth would be detected as a WP 4.2 document and this is not really the result we want to achieve.

The thanks goes to sum1 who was able to collect from the web more than 46k of real-life WordPerfect documents and ran the test loop over and over again for each of my commits. This is showing how important is a dedicated, non-bureaucratic, responsive QA for the quality of an FOSS project.

We can now proudly say to the users that if they fail to open their corrupted WordPerfect document with WordPerfect itself, it is not necessarily lost. They can try to open it with one of the fine free and open source wordprocessing software, AbiWord, Kword or OpenOffice.org Writer, and statistically they will be able to recover their information.

Subliminal message: What would be the score of the proprietary, binary, stale and unmaintained WordPerfect import filters distributed with StarOffice8?