Olly Betts | 2 May 2012 15:28
Favicon
Gravatar

Re: Index Size comparison

On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong wrote:
> I did a comparison based on similar steps as in the blog
> (zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter),
> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
> lucene 89M, xapian 189M (chert backend and compacted).
> Since I'm more interested in index size, I dig a little further to dump
> the full term list. There are about 360000 terms from lucene index, and
> about 285000 terms from xapian index.

What are the additional terms lucene has indexed?

> But surprisingly, the termlist.DB of xapian index is already 122M.

It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
usually much less - I guess if you are indexing tweets then that's a
lot of very small documents, and the front coding used in the termlist
entries works better for larger documents.

The termlist table stores the list of terms each document contains (and
if you are storing any document values, also the value slots used in
each document).

This information allows Xapian to delete or update a document correctly,
and also allows query expansion.  My understanding is that Lucene
doesn't store this information, and handles deletion by adding the
document id to a "deleted" list, which has to be excluded from query
results; this also means the frequency statistics will tend to be
increasingly inaccurate as more documents are deleted or modified.
That's the trade-off in exchange for not having to store the termlist
data.
(Continue reading)

Marius Tibeica | 2 May 2012 15:32
Picon
Gravatar

Re: GSoC xapian node binding

Finished the design of the sync methods:  https://github.com/mtibeica/node-xapian/blob/master/docs.md

I will probably continue with the creation of a test framework and porting the tests from the Perl binding.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
David Jeske | 2 May 2012 19:14
Picon
Gravatar

Re: Index Size comparison

On Wed, May 2, 2012 at 6:28 AM, Olly Betts <olly <at> survex.com> wrote:
My understanding is that Lucene doesn't store [a list of all terms in each document], and handles deletion by adding the document id to a "deleted" list, which has to be excluded from query results;

Yes, though these entries get cleaned up during merge/optimize, so there isn't really a cumulative error like you implied. (i.e. whenever you scan over all terms it's easy to remove terms for items in the "deleted" list)

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Liam | 2 May 2012 22:14

Re: GSoC xapian node binding

List Admin: this list really needs a reply-to header, to prevent accidental off-list replies!

On Wed, May 2, 2012 at 12:32 PM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
On Wed, May 2, 2012 at 9:36 PM, Liam <xapian <at> networkimprov.net> wrote:


On Wed, May 2, 2012 at 6:32 AM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
Finished the design of the sync methods:  https://github.com/mtibeica/node-xapian/blob/master/docs.md
I will probably continue with the creation of a test framework and porting the tests from the Perl binding.

Can you look for other places where we can combine multiple methods into a single one with an object argument, as with Query::Query? For instance Enquire::set_sort_*

Is is possible to set multiple sort types with Enquire? The method names seem to suggest otherwise to me.
We could do a set_sort with an array of objects like {  by: 'relevance' }, { by: 'value', sort_key: uint32, reverse: 'bool'}, and if a succession of these objects is not supported (more than 2 elements, etc), to throw a not yet supported exception. 

I think an Enquire parameters object could include collapse-key, docid-order, cutoff, value, and a relevance field which can be:
  0 or undefined - value ? set_sort_by_value : noop
  1 - value ? set_sort_by_relevance_then_value : set_sort_by_relevance
  2 - value ? set_sort_by_value_then_relevance : set_sort_by_relevance

Also for testing, we'd benefit from a simple HTTP-fronted Node app to which a user can post documents and submit queries. We could pull an interesting corpus into that, e.g. Wikipedia...
 
Sure, that sounds great, but for the code writing I think that specific unit tests with predictable answers are more useful to me. The HTTP-fronted Node app looks more like a great "getting started" app, which I'll add to my todo list.

Yes, you need unit tests ported from Perl, for sure. The Node app is to test the whole system, evaluate performance, etc

Liam
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Marius Tibeica | 3 May 2012 11:16
Picon
Gravatar

Re: GSoC xapian node binding

Updated Enquire to have a single set method:
set_parameters_sync( object_parameters) Set the parameters to be used for queries. The object parameter can have one or more of the following: { collapse_key: { key: uint32, max: uint32=1}, docid_order: uint32, cutoff: { percent: int32, weight: number=0 }, sort: [ sort_by_info_1, ... ] } The sort_by_info object can be: RELEVANCE - sorting by relevance { key: string, reverse: bool } - sorting by value (with reverse) string_value_key - sorting by value The valid sort arrays currently are: [ RELEVANCE ] - sort_by_relevance [ { key: string_value_key, reverse: bool } ] - sort_by_value [ string_value_key ] - sort_by_value [ { key: string_value_key, reverse: bool }, RELEVANCE ] - sort_by_value_then_relevance [ string_value_key, RELEVANCE ] - sort_by_value_then_relevance [ RELEVANCE, { key: string_value_key, reverse: bool } ] - sort_by_relevance_then_value [ RELEVANCE, string_value_key ] - sort_by_relevance_then_value
On Wed, May 2, 2012 at 11:14 PM, Liam <xapian <at> networkimprov.net> wrote:
List Admin: this list really needs a reply-to header, to prevent accidental off-list replies!

On Wed, May 2, 2012 at 12:32 PM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
On Wed, May 2, 2012 at 9:36 PM, Liam <xapian <at> networkimprov.net> wrote:


On Wed, May 2, 2012 at 6:32 AM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
Finished the design of the sync methods:  https://github.com/mtibeica/node-xapian/blob/master/docs.md
I will probably continue with the creation of a test framework and porting the tests from the Perl binding.

Can you look for other places where we can combine multiple methods into a single one with an object argument, as with Query::Query? For instance Enquire::set_sort_*

Is is possible to set multiple sort types with Enquire? The method names seem to suggest otherwise to me.
We could do a set_sort with an array of objects like {  by: 'relevance' }, { by: 'value', sort_key: uint32, reverse: 'bool'}, and if a succession of these objects is not supported (more than 2 elements, etc), to throw a not yet supported exception. 

I think an Enquire parameters object could include collapse-key, docid-order, cutoff, value, and a relevance field which can be:
  0 or undefined - value ? set_sort_by_value : noop
  1 - value ? set_sort_by_relevance_then_value : set_sort_by_relevance
  2 - value ? set_sort_by_value_then_relevance : set_sort_by_relevance

Also for testing, we'd benefit from a simple HTTP-fronted Node app to which a user can post documents and submit queries. We could pull an interesting corpus into that, e.g. Wikipedia...
 
Sure, that sounds great, but for the code writing I think that specific unit tests with predictable answers are more useful to me. The HTTP-fronted Node app looks more like a great "getting started" app, which I'll add to my todo list.

Yes, you need unit tests ported from Perl, for sure. The Node app is to test the whole system, evaluate performance, etc

Liam

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Marius Tibeica | 3 May 2012 14:42
Picon
Gravatar

Re: GSoC xapian node binding

I was searching for a good testing framework and vows looks really good ( http://vowsjs.org ).

If there are no other suggestions I will use it.
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Liam | 3 May 2012 19:18

Re: GSoC xapian node binding

There has been discussion of testing frameworks on the nodejs list from time to time; see what Node folks have to say about vows and others before deciding...


On Thu, May 3, 2012 at 5:42 AM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
I was searching for a good testing framework and vows looks really good ( http://vowsjs.org ).
If there are no other suggestions I will use it.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Marius Tibeica | 4 May 2012 13:36
Picon
Gravatar

Re: GSoC xapian node binding

Well the favorite are vows, nodeunit and mocha.

Vows seems to work the best for async method testing, even though it is a bit more complicated. 
I also found a pretty good blog post about it ( http://dev.estisia.com/2012/02/unit-testing-node-js-applications/ ).

On Thu, May 3, 2012 at 8:18 PM, Liam <xapian <at> networkimprov.net> wrote:
There has been discussion of testing frameworks on the nodejs list from time to time; see what Node folks have to say about vows and others before deciding...


On Thu, May 3, 2012 at 5:42 AM, Marius Tibeica <mtibeica <at> gmail.com> wrote:
I was searching for a good testing framework and vows looks really good ( http://vowsjs.org ).
If there are no other suggestions I will use it.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel



_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
James Aylett | 5 May 2012 15:47

List configuration (was re GSoC xapian node binding)

On 2 May 2012, at 21:14, Liam <xapian <at> networkimprov.net> wrote:

> List Admin: this list really needs a reply-to header, to prevent accidental off-list replies!

There are sufficient email clients that will use reply-to without prompting, causing accidental on-list
replies (potentially far more damaging and embarrassing!) that I'm opposed to doing this. That's
ignoring the 'traditional' arguments against reply-to for lists (which aren't terribly strong these days).

(Some email clients have dedicated commands and functionality for mailing lists. Sadly most don't,
leaving us in a least of multiple evils situation.)

J
Jaguar Xiong | 5 May 2012 17:16
Picon

Re: Index Size comparison

Here is the example for diff term: 'v122-8'. The whole string is treated 
as a term in lucene index. While xapian seems split the string by '-', 
and store 'v122' as a term. So I would guess splitting via '-' make 
xapian received less terms. In my experiment, there are about 196000 
documents, the average size is about 1.5k, with a total of 287M.

For reducing the size of btree, front-coding of string key (store the 
common prefix once) seems a good idea. I'll see what I could do.

Cheers,
Jaguar

On 2012/05/02 21:28, Olly Betts wrote:
> On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong wrote:
>> I did a comparison based on similar steps as in the blog
>> (zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter),
>> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
>> lucene 89M, xapian 189M (chert backend and compacted).
>> Since I'm more interested in index size, I dig a little further to dump
>> the full term list. There are about 360000 terms from lucene index, and
>> about 285000 terms from xapian index.
> What are the additional terms lucene has indexed?
>
>> But surprisingly, the termlist.DB of xapian index is already 122M.
> It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
> usually much less - I guess if you are indexing tweets then that's a
> lot of very small documents, and the front coding used in the termlist
> entries works better for larger documents.
>
> The termlist table stores the list of terms each document contains (and
> if you are storing any document values, also the value slots used in
> each document).
>
> This information allows Xapian to delete or update a document correctly,
> and also allows query expansion.  My understanding is that Lucene
> doesn't store this information, and handles deletion by adding the
> document id to a "deleted" list, which has to be excluded from query
> results; this also means the frequency statistics will tend to be
> increasingly inaccurate as more documents are deleted or modified.
> That's the trade-off in exchange for not having to store the termlist
> data.
>
> Xapian doesn't currently support a "deleted" list, but if you don't
> want to be able to delete or modify documents, you can just delete
> this table from your database ("rm termlist.*") and pretty much
> everything else will continue to work.  The other things which rely
> on the termlist table are listed in the ticket for this issue:
>
> http://trac.xapian.org/ticket/181
>
> If you delete the termlist, then it looks like Xapian would be ~67M vs
> Lucene's 89M.
>
>> Is tmere some idea/plan on reducing the index size? I'll glad if I could
>> help.
> Brass should be a little smaller than chert, but it's not going to be
> dramatic.
>
> There are a few ideas we have to reduce the size - if you're wanting to
> help work on this, here are a couple:
>
> * Posting list encodings could be more compact (probably in exchange for
>    being more expensive to update, so supporting several encodings and
>    picking the appropriate one via heuristics and/or user hints would
>    probably be best):
>
>    http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:Postinglistencodingimprovements
>
> * The Btree keys are currently stored in full each time, but within
>    almost all blocks, the keys will share a common prefix, so it would
>    reduce the spaced used and allow us to fit more in a block if we just
>    stored that prefix once.  This would help tables with a lot of small
>    entries especially (like the position table).
>
> Cheers,
>      Olly
>

Gmane