Optimizing DictionaryReader

10Jul2007

Warning: I'm talking about computer science here.

In the last days, I noticed increased activity in the DictionaryReader project, mostly caused by Yen-Ju working off the remaining bugs. Amongst others, he fixed bug 8497, which annoyed me for quite some time back when I programmed on DictionaryReader. The fix is particularly cool, so I want to share it with you.

Let's first have a look at the application: DictionaryReader is an application that allows you to look up words in local and online dictionaries. The format used for locally stored dictionaries is a seemingly undocumented format which is also used by the Dict.org server. The Dict.org server is the de-facto standard for online dictionary services. In this format, each dictionary consists of two files:

A dictionary file containing the word descriptions
An index file listing the contained words together with the positions of their descriptions in the dictionary file.

When the dict.org server program starts up, it can read in the index files and store the description positions in RAM for fast access. This is exactly what DictionaryReader does as well, but there's a small difference: Where the dict.org server is already running when the user requests a word lookup, DictionaryReader is possibly just starting up then and still needs to read in the index files. And this was terribly slow until Yen-Ju reduced the time to scan an index file by about 30%-40%.

The index file consists of lines of the following format:

term/word

TAB

startposition-coded-as-Base64

TAB

length-coded-as-Base64

The original version of DictionaryReader's algorithm to read in the index file read in the line part-by-part, parsing the Base64-coded integers, and creating an instance of the specially crafted BigRange class. Creating this class was an ugly hack that was necessary because the Cocoa/GNUstep NSRange datatype is unfortunately a C struct and thus can't be stored in the NSDictionary-hashmap, which maps dictionary entries (words, terms) to the position of their description.

Yen-Ju's clever solution is actually quite obvious: Don't parse the Base64-coded integers directly, but store the two unparsed integers directly in the hashmap. This results in higher memory usage per term, but it frees us from allocating BigRange objects in addition to the strings, which are possibly allocated anyway internally. As allocations are expensive, the speed improvement is quite good. A cool hack, and Yen-Ju even got rid of the ugly BigRange-class. :-)

I think it's amazing. :-) Look at the patch here.