Richard Boulton | 2 Jan 15:08

Performance tests

Just a note to say that I'm working on a performance test framework for 
Xapian, mainly targeted at analysing query speeds.  I'm hoping to make 
this into a reasonably easy system to set-up, so that we can get 
performance test results on lots of architectures.

For sample data, I'm using an XML dump of wikipedia - I've got a simple 
python script which converts this into a scriptindex input file (though 
doesn't yet do anything about understanding the wiki mark-up).  This 
results in a 25Gb database (containing 2657375 documents, average length 
492 terms), which should give a good basis for benchmarking searches in 
an IO bound situation.  I should probably also use some smaller corpuses 
to cover the CPU/memory-IO bound situation.

I've made a bug in the bug tracker to track progress on this (#107 - 
http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=107) - I'm hoping 
to eventually get this running on at least one machine on a regular 
basis, so that we can track how revisions to the code affect 
performance.  In particular, I want this in place before I work on bug #100.

At present, I'm just posting this here to let people know that I have 
some code which parses wikipedia XML dumps, so they don't waste time 
writing their own one - ask me instead.  I'll publish the code publicly 
when it's all tidied up.

--

-- 
Richard
Oleg V Obolenskiy | 9 Jan 16:19
Picon

non-snowball stemmer

Hi!

I am going to use  non-snowball russian stemmer with Xapian. There is a 
good one at http://www.aot.ru. I've found that current implementation of 
Xapian::Stem does not allow it (there is no public interface for 
Xapian::Stem::Internal). Do you apply patches? Are there any 
recommendations for writing patches?

Regards,

Oleg Obolenskiy
highpower <at> mail.ru
Richard Boulton | 9 Jan 16:30

Re: non-snowball stemmer

Oleg V Obolenskiy wrote:
> Hi!
> 
> I am going to use  non-snowball russian stemmer with Xapian. There is a 
> good one at http://www.aot.ru. I've found that current implementation of 
> Xapian::Stem does not allow it (there is no public interface for 
> Xapian::Stem::Internal). Do you apply patches? Are there any 
> recommendations for writing patches?

Yes, contributions are welcome.  See the "HACKING" file in the 
xapian-core sources for details of how to submit patches.

Your patch should be against the current Xapian subversion head, if 
possible.  Also, note that Xapian's snowball stemmers are quite out of 
date - in particular, they don't use the UTF-8 encoding.  The stemmers 
are due to be updated for the next release (ie, the 1.0 release - they 
couldn't be updated for the 0.9.x series, because this would have broken 
existing databases).

For reference, the relevant section of the HACKING file is as follows:

Submitting Patches:
===================

If you have a patch to fix a problem in Xapian, or to add a new feature,
please send it to us for inclusion.  Any major changes should be discussed
on the xapian-devel mailing list first:
<http://www.xapian.org/lists.php>

We find patches in unified diff format easiest to read.  If you're using a
(Continue reading)

Richard Boulton | 9 Jan 16:41

Re: non-snowball stemmer

Oleg V Obolenskiy wrote:
> I am going to use  non-snowball russian stemmer with Xapian. There is a 
> good one at http://www.aot.ru. I've found that current implementation of 
> Xapian::Stem does not allow it (there is no public interface for 
> Xapian::Stem::Internal). Do you apply patches? Are there any 
> recommendations for writing patches?

Purely for my own interest, do you have any particular reason for 
wanting to use this Russian stemmer instead of the snowball one?  Do you 
have some evaluations suggesting that it works better, or some other 
reason for liking it?  Can you see any obvious way in which the snowball 
stemmer could be improved?

Also, if the stemmer is available under a GPL compatible license, it 
would be nice to work out instructions on how to download it and use it. 
Perhaps you could add information to the Xapian wiki somewhere? 
Unfortunately, I don't read Russian, so couldn't get much useful 
information from the page you linked to.

--

-- 
Richard
Olly Betts | 12 Jan 19:32
Favicon
Gravatar

Re: [Xapian-commits] 7603: trunk/xapian-core/ trunk/xapian-core/backends/flint/ trunk/xapian-core/backends/quartz/

On Tue, Jan 02, 2007 at 03:55:59PM +0000, richard wrote:
> * backends/quartz/btree.cc,backends/flint/flint_io.h: Patches from
>   Charlie Hull to allow 2GB+ index files work when compiled using
>   Visual C++.

I suspect that xapian-compact.cc (and quartzcompact.cc if you can be
bothered) will also need fixing since they use off_t.  You need to make
sure that the stat() function called works for 2GB+ files.  Without this
compression statistics don't get reported for such files.

Cheers,
    Olly
Olly Betts | 13 Jan 03:31
Favicon
Gravatar

Re: non-snowball stemmer

On Tue, Jan 09, 2007 at 03:30:48PM +0000, Richard Boulton wrote:
> Your patch should be against the current Xapian subversion head, if 
> possible.  Also, note that Xapian's snowball stemmers are quite out of 
> date - in particular, they don't use the UTF-8 encoding.  The stemmers 
> are due to be updated for the next release (ie, the 1.0 release - they 
> couldn't be updated for the 0.9.x series, because this would have broken 
> existing databases).

I've mostly finished updating Xapian to use UTF-8 versions of the
snowball stemmers, and have rewritten Xapian::Stem::Internal()
completely, so current SVN HEAD isn't a good place to start for
patching this area right now.

The new design actually makes it very easy to hook in stemming
algorithms which aren't based on Snowball.

> Also, if the stemmer is available under a GPL compatible license, it 
> would be nice to work out instructions on how to download it and use it. 

It's LGPL.

> Perhaps you could add information to the Xapian wiki somewhere? 
> Unfortunately, I don't read Russian, so couldn't get much useful 
> information from the page you linked to.

Google translate does a reasonable job on the site:

http://translate.google.com/translate?u=http%3A%2F%2Fwww.aot.ru%2F&langpair=ru%7Cen

Cheers,
(Continue reading)

Charlie Hull | 29 Jan 18:31

Re: Re: [Xapian-commits] 7603: trunk/xapian-core/ trunk/xapian-core/backends/flint/ trunk/xapian-core/backends/quartz/

Olly Betts wrote:
> On Tue, Jan 02, 2007 at 03:55:59PM +0000, richard wrote:
>> * backends/quartz/btree.cc,backends/flint/flint_io.h: Patches from
>>   Charlie Hull to allow 2GB+ index files work when compiled using
>>   Visual C++.
> 
> I suspect that xapian-compact.cc (and quartzcompact.cc if you can be
> bothered) will also need fixing since they use off_t.  You need to make
> sure that the stat() function called works for 2GB+ files.  Without this
> compression statistics don't get reported for such files.
> 
> Cheers,
>     Olly
> 

I've produced patches for xapian-compact.cc, quartzcompact.cc and 
quartzcheck.cc (enclosed). Unfortunately I don't have any way of quickly 
testing them here but we should soon have some kind of Xapian test 
framework on Windows, complete with huge files, so I'll do it then.

Charlie
Index: quartzcheck.cc
===================================================================
--- quartzcheck.cc	(revision 7604)
+++ quartzcheck.cc	(working copy)
@@ -56,6 +56,21 @@

 static vector<Xapian::termcount> doclens;
(Continue reading)

Grzegorz Bogdanowicz | 30 Jan 10:41
Picon
Favicon

Re: Re: [Xapian-commits] 7603: trunk/xapian-core/trunk/xapian-core/backends/flint/ trunk/xapian-core/backends/quartz/

hi,

I'm using Xapian on Windows with large files. My index directory is about 65 607 466 693 bytes:

2007-01-30  09:28                17 position_baseA
2007-01-30  10:06                17 position_baseB
2007-01-23  14:18                 0 position_DB
2007-01-30  10:06           360 496 postlist_baseB
2007-01-30  10:06    23 623 852 032 postlist_DB
2007-01-30  10:06            88 432 record_baseB
2007-01-30  10:10     5 825 560 576 record_DB
2007-01-30  10:06           550 103 termlist_baseB
2007-01-30  10:10    36 157 054 976 termlist_DB
2007-01-30  09:28                17 value_baseA
2007-01-30  10:06                17 value_baseB
2007-01-23  14:18                 0 value_DB
              14 file(s)     65 607 466 693 bytes

Everything works fine.

Regards,
        Grzegorz

Dnia 29-01-2007 o godz. 18:31 Charlie Hull napisał(a):
> Olly Betts wrote:
> > On Tue, Jan 02, 2007 at 03:55:59PM +0000, richard wrote:
> >> * backends/quartz/btree.cc,backends/flint/flint_io.h: Patches from
> >>   Charlie Hull to allow 2GB+ index files work when compiled using
> >>   Visual C++.
> > 
(Continue reading)

Charlie Hull | 30 Jan 10:54

Re: Re: [Xapian-commits] 7603: trunk/xapian-core/trunk/xapian-core/backends/flint/ trunk/xapian-core/backends/quartz/

Grzegorz Bogdanowicz wrote:
> hi,
> 
> I'm using Xapian on Windows with large files. My index directory is about 65 607 466 693 bytes:
> 
> 2007-01-30  09:28                17 position_baseA
> 2007-01-30  10:06                17 position_baseB
> 2007-01-23  14:18                 0 position_DB
> 2007-01-30  10:06           360 496 postlist_baseB
> 2007-01-30  10:06    23 623 852 032 postlist_DB
> 2007-01-30  10:06            88 432 record_baseB
> 2007-01-30  10:10     5 825 560 576 record_DB
> 2007-01-30  10:06           550 103 termlist_baseB
> 2007-01-30  10:10    36 157 054 976 termlist_DB
> 2007-01-30  09:28                17 value_baseA
> 2007-01-30  10:06                17 value_baseB
> 2007-01-23  14:18                 0 value_DB
>               14 file(s)     65 607 466 693 bytes
> 
> Everything works fine.
> 
> Regards,
>         Grzegorz
> 
Hi Grzegorz,

Are you able to test quartzcheck, quartzcompact or xapian-compact with 
your large datasets? This will depend on which database type you're 
using, but any results would be great.

(Continue reading)

Grzegorz Bogdanowicz | 31 Jan 10:09
Picon
Favicon

Re: Re: [Xapian-commits]7603: trunk/xapian-core/trunk/xapian-core/backends/flint/trunk/xapian-core/backends/quartz/

hi,

I'm using quartz database so I'm able to test quartzcheck and quartzcompact.

If the test must be done with any special options?
By the way: how long in approximation wille take compact?

Regards,
        Grzegorz

Dnia 30-01-2007 o godz. 10:54 Charlie Hull napisał(a):
> Grzegorz Bogdanowicz wrote:
> > hi,
> > 
> > I'm using Xapian on Windows with large files. My index directory is 
> about 65 607 466 693 bytes:
> > 
> > 2007-01-30  09:28                17 position_baseA
> > 2007-01-30  10:06                17 position_baseB
> > 2007-01-23  14:18                 0 position_DB
> > 2007-01-30  10:06           360 496 postlist_baseB
> > 2007-01-30  10:06    23 623 852 032 postlist_DB
> > 2007-01-30  10:06            88 432 record_baseB
> > 2007-01-30  10:10     5 825 560 576 record_DB
> > 2007-01-30  10:06           550 103 termlist_baseB
> > 2007-01-30  10:10    36 157 054 976 termlist_DB
> > 2007-01-30  09:28                17 value_baseA
> > 2007-01-30  10:06                17 value_baseB
> > 2007-01-23  14:18                 0 value_DB
> >               14 file(s)     65 607 466 693 bytes
(Continue reading)


Gmane