Jardic Pro Implementation Details

Dictionary File Structure

Dictionary File Logical Structure

Jardic Pro dictionary file consists from the set of article records that can be referenced from different indexes. A size of one article record is unlimited. Format of one article record can vary depending on its origination (Jardic, EDICT, DSL, Eijiro and so on).

Indexes are used to store lists of keys. From the user point of view they are lists of words. Index entries point to records. Keys could be complex and they could consist from subkeys like: <reading><word><priority>.

Jardic Pro 4.3 supports following indexes:

  • Index of supplemental records with a key: <record_type>
  • Index of readings with a key: <reading><kanji_word>
  • Index of kanji with a key: <kanji_word><reading>
  • Index of words with a key: <word><weight>
  • Index of all words with a key: <word><weight>
  • Index of all kanji strings with a key: <kanji><weight>
Subkeys can have variable length. Indexes are implemented in a form of multilevel balanced trees. Jardic Pro indexes have a feature that allow fast search of keys using sequential numbers of key values. Jardic Pro uses this feature to display lists of words, to scroll lists of words, and to emulate virtual merged indexes for multiple dictionary files.

Low-level Organization of Dictionary File

Jardic Pro dictionary file consists from pages. Following types of pages are supported:

  • File Header page
  • Page Allocation Table pages
  • Data pages
  • Index pages
Pages are identified by numbers. Relationship between page number and physical placement of a page is defined using Page Allocation Table (PAT). Pages of one file have fixed length. Data pages and index pages could be compressed. Due to page compression positions of pages in the file are not easily calculated. Jardic Pro uses page numbers and PAT to convert virtual data pointers into the pair of values: position of compressed page, and data displacement on the page.

 

Data Management System

Jardic Pro Data Management System (DMS) performs functions that are typical to Index Management Systems.

Page Access

Jardic Pro DMS is implemented as a 2-tier system. Low-level tier supports an access to pages using specified page numbers. High-level tier supports external requests coming from application program.

Low-level tier supports following functions:

  • Calculation of page location
  • Compression and uncompression of pages
  • Page caching
  • Management of free space

DMS External functions

Jardic Pro DMS external functions support dictionary data search and data update. DMS functions could be divided into following groups:

  • Management
  • Search in indexes
  • Reading of data
  • Key insertion
  • Bulk key insertion
  • Update of keys and records
  • Key comparison
Following are the lists of functions for separate groups.

Management: Connect, Disconnect, CreateCursor, DestroyCursor, UseRecord, UseIndex, Compress, SetProgressProc and some other.

Search in indexes: FindEQ, FindGE, FindLE, FindLT, FindGT, FindFirst, FindLast, FindNext, FindPrevious, FindByCounter, FindByPercentage. Results of search functions are: record pointer (article pointer), and a sequence number of a found key value in the index.

Reading of data: GetRecord.

Key insertion: InsertRecord, InsertKey, BeginBulkInsertKey, EndBulkInsertKey, StopBulkInsertKey.

Key comparison: CompareKeys, GetCollator, GetCollatorVersion, CreateCollator0, CreateCollator1. Key comparison functions are exported from DMS to application program, because the application program should use the same string comparison functions as DMS.

Jardic Pro 4.3 does not support update functions. They are not currently implemented because there is no user demand, and due to the complexity of full-text search index update. Future version of Jardic Pro could have implementation of this set of function for some types of dictionaries.

Jardic Pro full text search indexes are stored in dictionary files. Depending on dictionary size such indexes could contain up to some millions of entries. To build large dictionary indexes Jardic Pro uses optimized bulk insert functions. Our bulk insert functions does not use preliminary data sorting (e.g. like in Microsoft SQL Server). That is due to the fact the time of preliminary data sorting could be comparable to the time of pure key insertion into the index. Instead of that, Jardic Pro uses optimization based on on-the-fly calculation of key distribution histogram. The histogram intervals are set to such a value that the volumes of key values hit into one interval are comparable to DMS memory cache size. Jardic Pro bulk insert functions support insertion of about 0.7 million keys per 1 minute for 1.6 GHz CPU. Such insertion includes index filling, index page compression, and writing index pages on HDD.

Sorting

DMS functions that work with indexes intensively use key comparison and key sorting functions. Sorting of keys with text strings is based on the internal implementation of Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). This implementation supports correct "dictionary" sorting of words with small and capital letters, diacritics and with internal punctuation (apostrophes, dashes and so on).

Comparison of text strings according to UCA is implemented in following steps:

  • Conversion of UTF-16 Unicode characters into array of Unicode Points
  • Creation of arrays of Collation Elements (CE)
  • Variable weights handling
  • Comparison of CE arrays
At the first step (conversion to Unicode Points) the program performs "tailoring": it handles Unicode surrogates, replaces Japanese repetition characters to proper kana and kanji characters, decomposes Hangul into Jamo and so on. The program does not perform tailoring for Western languages because it builds common word lists for all Western languages and "correct" collation for them is contradictory.

3-level Collation Elements (CE's) are built using Default Unicode Collation Element Table (DUCET).

For correct sorting of text string with blanks, dashes, apostrophes and other punctuation characters the program uses CE's with variable weights.

 

Dictionary Shell

Under the term "dictionary shell" we mean the program that interacts with a user. It is the dictionary shell that is accepted by the user as "Jardic Pro". Dictionary shell performs following base functions:

  • Loading of dictionaries and emulation of virtual merged lists of words
  • Displaying of multicolumn lists of words
  • Displaying of dictionary articles
  • Positioning of lists of words according to entered text
  • Finding word translation under the cursor on-the-fly
  • Import and conversion of dictionaries
Jardic Pro can work with multiple opened dictionaries simultaneously. A user can see merged lists for all dictionaries opened at the moment. Merged lists of words are virtual, i.e., they are not built in the memory or elsewhere, but they are emulated algorithmically.

All the following word lists displayed in Jardic Pro are virtual:

  • Lists of words by reading
  • Lists of words by kanji
  • Lists of usual (Western) words
  • Lists of all words for full text search
  • Lists of all kanji substrings for full text search
One of Jardic Pro features is its ability to display lists of words with multiple columns. For example, a list of words by reading contains 3 columns: Reading, Kanji, and Article (first characters). To display such lists the program issues DMS requests to search keys in indexes and to read articles. Due to effective DMS implementation, DMS performs such requests quickly enough so that manual scrolling of lists with millions of words runs without noticeable delay.

In one of its windows Jardic Pro displays dictionary article for the current list item. Jardic Pro can work with imported dictionaries containing articles in different format. To display articles of dictionaries with different format the program uses separate formatting functions (e.g., for EDICT, DSL, Eijiro and so on).

The main function of each electronic dictionary is searching for entered words. This function is implemented in Jardic Pro by performing a search in virtual lists that are common for all opened dictionaries.

Jardic Pro can search for translation of words under the cursor on-the-fly in Microsoft Word, Internet Explorer, HTML Help and some other programs. The access to MS Word objects is implemented using COM Automation, specifically through Running Object Table (ROT). An access to Internet Explorer and HTML objects is implemented using IHTMLDocument2 interface obtained through Microsoft Active Accessibility (MSAA).

When translating words on-the fly Jardic Pro analyzes a type of text under the cursor. If the text contains kanji, the program tries to find translation for words starting from current kanji under the cursor and with some next characters. If a text under the cursor contains kana words, the program converts word endings to "normal" form. Then the program repeats those steps for the next character to the left, then one more character to the left and so on. Search results (found word translations) are accumulated. Jardic Pro displays found word with maximum length, but it can show all intermediate results including translation of separate kanji characters.

If the text under the cursor does not contain kanji the program looks for word boundaries at the left and at the right position of the cursor. Search for word boundaries is implemented using Unicode default boundary search algorithm. Then the program searches for translation of the selected word limited by found boundaries.

Import of Dictionaries

When importing dictionaries Jardic Pro performs following steps:

  • Loads dictionary articles
  • Builds the index of words by reading
  • Builds the index of words by kanji
  • Builds the index of usual (Western) words
  • Builds the index of all words for full text search
  • Builds the index of all kanji substrings for full text search
Depending on the type of imported dictionary some of steps could be skipped. For example, the program does not build lists of kanji when importing Western dictionaries. Jardic Pro builds indexes of all words and all kanji by default, but for some dictionaries (e.g. for Chinese-Chinese dictionaries) the size of all kanji index can be very large (10-100 millions of entries). When some indexes could be huge, the program advices a user to refuse from them.

Jardic pro builds dictionary indexes using DMS bulk insert functions: BeginBulkInsertKey, InsertKey, EndBulkInsertKey. The program extracts words from articles for full text search indexes using Unicode default boundary search algorithm.

Jardic Pro is written in C++. Source code consists from more than 80,000 lines.

Copyright © Vitaly Zagrebelny, 2007-2013 E-mail: jardic@jardic.ru