Issues to watch out when making changes to Lucene's gap encoding scheme
Thang Luong Minh <luong.m.thang <at> gmail.com>
2007-04-01 06:27:49 GMT
Dear Lucene developers,
As part of my school project, I've proposed to implement a gap encoding
scheme, and try to replace it with the current scheme in Lucene which is the
byte-aligned scheme. The main purpose of making changes to Lucene source is
just for learning experience in dealing with open source code, conforming
with the current system, and other design issues.
The gap encoding I'm implementing is the Fixed Binary Coding scheme in which
each gap in a group are coded using the same number of bits. That scheme is
described in the paper here http://crpit.com/confpapers/CRPITV27Anh.pdf. The
advantage of the scheme is extremely fast decoding speed. In order to
efficiently index, we need to decompose a list of gaps into groups so that
the total number of bits used is minimized.
Ex: list of gaps is broken into 4 group (38, 17, 13, 34) (6) (4, 1, 3, 1)
(2, 3, 1)
A list of 4 selectors is 9, 1, 6, 9, each of which corresponds to one group.
By looking at the selector we could know how many bits used to code each gap
in the group associated with that selector, and how many gaps in that group.
Having described abt the new coding scheme, below are changes I have made to
Lucene source code:
+ I touched mostly on classes which involve proximity streams (.prx). In
DocumentWriter.writePostings(), and SegmentMerger.appendPostings(), instead
of using writeVInt for each gap, I go through the whole gap list, decompose
into groups, encode the whole list as described in the scheme, and
writeByte() to the .prx stream
+ Handle decoding gap by gap in SegmentTermPositions
+ By the nature of the scheme, I need to know the selector list in order to
decode the gaps, so I need another sel stream to store selectors. As gaps