Re: Index Size comparison
Jaguar Xiong <xiong.jaguar <at> gmail.com>
2012-05-05 15:16:28 GMT
Here is the example for diff term: 'v122-8'. The whole string is treated
as a term in lucene index. While xapian seems split the string by '-',
and store 'v122' as a term. So I would guess splitting via '-' make
xapian received less terms. In my experiment, there are about 196000
documents, the average size is about 1.5k, with a total of 287M.
For reducing the size of btree, front-coding of string key (store the
common prefix once) seems a good idea. I'll see what I could do.
On 2012/05/02 21:28, Olly Betts wrote:
> On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong wrote:
>> I did a comparison based on similar steps as in the blog
>> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
>> lucene 89M, xapian 189M (chert backend and compacted).
>> Since I'm more interested in index size, I dig a little further to dump
>> the full term list. There are about 360000 terms from lucene index, and
>> about 285000 terms from xapian index.
> What are the additional terms lucene has indexed?
>> But surprisingly, the termlist.DB of xapian index is already 122M.
> It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
> usually much less - I guess if you are indexing tweets then that's a
> lot of very small documents, and the front coding used in the termlist
> entries works better for larger documents.
> The termlist table stores the list of terms each document contains (and
> if you are storing any document values, also the value slots used in
> each document).
> This information allows Xapian to delete or update a document correctly,
> and also allows query expansion. My understanding is that Lucene
> doesn't store this information, and handles deletion by adding the
> document id to a "deleted" list, which has to be excluded from query
> results; this also means the frequency statistics will tend to be
> increasingly inaccurate as more documents are deleted or modified.
> That's the trade-off in exchange for not having to store the termlist
> Xapian doesn't currently support a "deleted" list, but if you don't
> want to be able to delete or modify documents, you can just delete
> this table from your database ("rm termlist.*") and pretty much
> everything else will continue to work. The other things which rely
> on the termlist table are listed in the ticket for this issue:
> If you delete the termlist, then it looks like Xapian would be ~67M vs
> Lucene's 89M.
>> Is tmere some idea/plan on reducing the index size? I'll glad if I could
> Brass should be a little smaller than chert, but it's not going to be
> There are a few ideas we have to reduce the size - if you're wanting to
> help work on this, here are a couple:
> * Posting list encodings could be more compact (probably in exchange for
> being more expensive to update, so supporting several encodings and
> picking the appropriate one via heuristics and/or user hints would
> probably be best):
> * The Btree keys are currently stored in full each time, but within
> almost all blocks, the keys will share a common prefix, so it would
> reduce the spaced used and allow us to fit more in a block if we just
> stored that prefix once. This would help tables with a lot of small
> entries especially (like the position table).