1 Aug 11:13
1 Aug 11:48
FLAG_WILDCARD, add_database and performance
Oliver Flimm <flimm <at> ub.uni-koeln.de>
2008-08-01 09:48:57 GMT
2008-08-01 09:48:57 GMT
Hi, I recently started to combine several (around 140) seperate databases for a single search request with add_database. I use the xapian perl bindings. Additionally I use a match decider to implement facets. Everything works fine unless I use a wildcard in my search, eg. 'java program*. The enquire-object looks like this: my $enq = $dbh->enquire($qp->parse_query($querystring,Search::Xapian::FLAG_WILDCARD|Search::Xapian::FLAG_LOVEHATE|Search::Xapian::FLAG_BOOLEAN)); Using a wildcard in a sequential search results in search times around 0.00x to 0.x seconds for each database, but the same search request using a combined database handle takes around 200 seconds... You can test it on our public test system in the simple search form: http://kug5.ub.uni-koeln.de/portal/opac?view=kug Is there a way to improve request times for the combined search using wildcards? Regards, Oliver -- -- Universitaet zu Koeln :: Universitaets- und Stadtbibliothek IT-Dienste :: Abteilung Universitaetsgesamtkatalog Universitaetsstr. 33 :: D-50931 Koeln(Continue reading)
4 Aug 03:01
Re: FLAG_WILDCARD, add_database and performance
Olly Betts <olly <at> survex.com>
2008-08-04 01:01:51 GMT
2008-08-04 01:01:51 GMT
On Fri, Aug 01, 2008 at 11:48:57AM +0200, Oliver Flimm wrote: > I recently started to combine several (around 140) seperate databases > for a single search request with add_database. I use the xapian perl > bindings. Additionally I use a match decider to implement facets. Xapian version? Platform? > Using a wildcard in a sequential search results in search times around > 0.00x to 0.x seconds for each database, but the same search request > using a combined database handle takes around 200 seconds... A more comparable test would be against the 140 databases merged into one. But it sounds like something is O(n*n) in the number of databases - that shouldn't be necessary that I can see. If it's easy to test, see if 100 databases takes about 100 seconds, and 70 about 50 seconds. > Is there a way to improve request times for the combined search using > wildcards? Could you profile to find where the time is spent? Some tips are here: http://trac.xapian.org/wiki/ProfilingXapian Cheers, Olly(Continue reading)
4 Aug 08:57
Re: FLAG_WILDCARD, add_database and performance
Oliver Flimm <flimm <at> ub.uni-koeln.de>
2008-08-04 06:57:30 GMT
2008-08-04 06:57:30 GMT
Hi, On Mon, Aug 04, 2008 at 02:01:51AM +0100, Olly Betts wrote: > On Fri, Aug 01, 2008 at 11:48:57AM +0200, Oliver Flimm wrote: > > I recently started to combine several (around 140) seperate databases > > for a single search request with add_database. I use the xapian perl > > bindings. Additionally I use a match decider to implement facets. > > Xapian version? Platform? I tried both Xapian 1.0.5 and 1.0.7 on a 64bit Debian Linux System (stable/etch). The Debian platform is AMD64 (although we're running on Intel hardware) [...] > A more comparable test would be against the 140 databases merged into > one. > > But it sounds like something is O(n*n) in the number of databases - that > shouldn't be necessary that I can see. > > If it's easy to test, see if 100 databases takes about 100 seconds, and > 70 about 50 seconds. [...] > Could you profile to find where the time is spent? Some tips are here: > > http://trac.xapian.org/wiki/ProfilingXapian [..] Thanks for the hints. I'll try them.(Continue reading)
4 Aug 09:45
Re: FLAG_WILDCARD, add_database and performance
Oliver Flimm <flimm <at> ub.uni-koeln.de>
2008-08-04 07:45:47 GMT
2008-08-04 07:45:47 GMT
Hi, On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote: > > Could you profile to find where the time is spent? Some tips are here: > > > > http://trac.xapian.org/wiki/ProfilingXapian it looks like some routines in libc get called alot when using a wildcard search. Here are the first lines of output for the search request 'java' which took around 3.5 seconds: CPU: Core 2, speed 2666.68 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name app name symbol name ------------------------------------------------------------------------------- 62107 75.5606 libperl.so.5.8.8 libperl.so.5.8.8 (no symbols) 62107 100.000 libperl.so.5.8.8 libperl.so.5.8.8 (no symbols) [self] ------------------------------------------------------------------------------- 11012 13.3974 no-vmlinux no-vmlinux (no symbols) 11012 100.000 no-vmlinux no-vmlinux (no symbols) [self] ------------------------------------------------------------------------------- 5883 7.1574 libc-2.3.6.so libc-2.3.6.so(Continue reading)
4 Aug 10:16
Can xapian do geo search/calculations?
Joe Noon <joenoon <at> gmail.com>
2008-08-04 08:16:04 GMT
2008-08-04 08:16:04 GMT
Im currently using sphinx, which has a nice geo searching interface: If given a query and an origin (latitude, longitude, both in radians), and told which fields correspond to latitude/longitude in the index, I can get back results ordered by distance from that point (and get the actual distance as well). I would be interested in exploring xapian further, but I'm not sure if I can get this functionality or not. Ive been told it may be possible with a a value range processor? Is this something anyone else has already explored? Or know if it would even be possible? Thanks, Joe
4 Aug 10:32
Re: FLAG_WILDCARD, add_database and performance
Olly Betts <olly <at> survex.com>
2008-08-04 08:32:37 GMT
2008-08-04 08:32:37 GMT
On Mon, Aug 04, 2008 at 09:45:47AM +0200, Oliver Flimm wrote: > On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote: > > > Could you profile to find where the time is spent? Some tips are here: > > > > > > http://trac.xapian.org/wiki/ProfilingXapian > > it looks like some routines in libc get called alot when using a > wildcard search. Hmm, yes. Can you install the package libc6-dbg and repeat? That should then give us actual function names in libc. > 141 databases - 206 seconds > 104 databases - 86 seconds > 69 databases - 16 seconds > 41 databases - 4 seconds > 8 databases - 0.29 seconds Yeah, that's looking like there's something O(n*n) or worse. Cheers, Olly
4 Aug 10:45
Re: Can xapian do geo search/calculations?
Olly Betts <olly <at> survex.com>
2008-08-04 08:45:49 GMT
2008-08-04 08:45:49 GMT
On Mon, Aug 04, 2008 at 01:16:04AM -0700, Joe Noon wrote: > I would be interested in exploring xapian further, but I'm not sure if > I can get this functionality or not. Ive been told it may be possible > with a a value range processor? Is this something anyone else has > already explored? Or know if it would even be possible? You could filter with two value ranges to get results restricted to a rectangle in the coordinate system (which is in general not quite a rectangle on the ground). To actually get a circle, you'd want to use a MatchDecider subclass to select points based on coordinates stored in one or two document values. It would probably be useful to ship a standard solution, so if you implement something please consider sending a patch, even if it needs a bit of work for general use. There's also some code Richard has been working on in the geospatial branch in SVN, but he's better placed to talk about that: http://trac.xapian.org/browser/branches/geospatial Cheers, Olly
4 Aug 10:50
Re: Can xapian do geo search/calculations?
Richard Boulton <richard <at> lemurconsulting.com>
2008-08-04 08:50:20 GMT
2008-08-04 08:50:20 GMT
Olly Betts wrote: > On Mon, Aug 04, 2008 at 01:16:04AM -0700, Joe Noon wrote: >> I would be interested in exploring xapian further, but I'm not sure if >> I can get this functionality or not. Ive been told it may be possible >> with a a value range processor? Is this something anyone else has >> already explored? Or know if it would even be possible? > > You could filter with two value ranges to get results restricted to a > rectangle in the coordinate system (which is in general not quite a > rectangle on the ground). > > To actually get a circle, you'd want to use a MatchDecider subclass > to select points based on coordinates stored in one or two document > values. It would probably be useful to ship a standard solution, so if > you implement something please consider sending a patch, even if it > needs a bit of work for general use. > > There's also some code Richard has been working on in the geospatial > branch in SVN, but he's better placed to talk about that: > > http://trac.xapian.org/browser/branches/geospatial IIRC, the work I've done roughly does what Olly describes with a MatchDecider. It also uses what's called a Heirarchical Triangular Mesh algorithm to assign groups of points to sets of triangles of different sizes on the worlds surface - this allows a fast "pre-search" to be used to calculate the approximate set of results, and the MatchDecider to be used to make that pre-search into an exact search. However, it's not finished and not yet useable. It got put on hold a(Continue reading)
4 Aug 14:50
Re: FLAG_WILDCARD, add_database and performance
Oliver Flimm <flimm <at> ub.uni-koeln.de>
2008-08-04 12:50:33 GMT
2008-08-04 12:50:33 GMT
Hi, On Mon, Aug 04, 2008 at 09:32:37AM +0100, Olly Betts wrote: > On Mon, Aug 04, 2008 at 09:45:47AM +0200, Oliver Flimm wrote: > > On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote: > > > > Could you profile to find where the time is spent? Some tips are here: > > > > > > > > http://trac.xapian.org/wiki/ProfilingXapian > > > > it looks like some routines in libc get called alot when using a > > wildcard search. > > Hmm, yes. Can you install the package libc6-dbg and repeat? That > should then give us actual function names in libc. Here are the results: CPU: Core 2, speed 2666.68 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name app name symbol name ------------------------------------------------------------------------------- 18 100.000 libc-2.3.6.so libc-2.3.6.so __rpc_thread_destroy 1915840 80.4207 libc-2.3.6.so libc-2.3.6.so _int_malloc 1915840 99.9993 libc-2.3.6.so libc-2.3.6.so _int_malloc [self] 6 3.1e-04 libstdc++.so.6.0.8 libc-2.3.6.so(Continue reading)
RSS Feed