Ronan | 1 Aug 11:13

Re: can a corrupt database be fixed?

> Do you know how it became corrupted BTW?

No, I'm afraid not.

Ronan.
Oliver Flimm | 1 Aug 11:48
Picon
Favicon

FLAG_WILDCARD, add_database and performance

Hi,

I recently started to combine several (around 140) seperate databases
for a single search request with add_database. I use the xapian perl
bindings. Additionally I use a match decider to implement facets.

Everything works fine unless I use a wildcard in my search, 
eg. 'java program*. The enquire-object looks like this:

my $enq = $dbh->enquire($qp->parse_query($querystring,Search::Xapian::FLAG_WILDCARD|Search::Xapian::FLAG_LOVEHATE|Search::Xapian::FLAG_BOOLEAN));

Using a wildcard in a sequential search results in search times around
0.00x to 0.x seconds for each database, but the same search request
using a combined database handle takes around 200 seconds...

You can test it on our public test system in the simple search form:

http://kug5.ub.uni-koeln.de/portal/opac?view=kug

Is there a way to improve request times for the combined search using
wildcards?

Regards,

Oliver

--

-- 
Universitaet zu Koeln :: Universitaets- und Stadtbibliothek
IT-Dienste :: Abteilung Universitaetsgesamtkatalog
Universitaetsstr. 33 :: D-50931 Koeln
(Continue reading)

Olly Betts | 4 Aug 03:01
Favicon
Gravatar

Re: FLAG_WILDCARD, add_database and performance

On Fri, Aug 01, 2008 at 11:48:57AM +0200, Oliver Flimm wrote:
> I recently started to combine several (around 140) seperate databases
> for a single search request with add_database. I use the xapian perl
> bindings. Additionally I use a match decider to implement facets.

Xapian version?  Platform?

> Using a wildcard in a sequential search results in search times around
> 0.00x to 0.x seconds for each database, but the same search request
> using a combined database handle takes around 200 seconds...

A more comparable test would be against the 140 databases merged into
one.

But it sounds like something is O(n*n) in the number of databases - that
shouldn't be necessary that I can see.

If it's easy to test, see if 100 databases takes about 100 seconds, and
70 about 50 seconds.

> Is there a way to improve request times for the combined search using
> wildcards?

Could you profile to find where the time is spent?  Some tips are here:

http://trac.xapian.org/wiki/ProfilingXapian

Cheers,
    Olly
(Continue reading)

Oliver Flimm | 4 Aug 08:57
Picon
Favicon

Re: FLAG_WILDCARD, add_database and performance

Hi,

On Mon, Aug 04, 2008 at 02:01:51AM +0100, Olly Betts wrote:
> On Fri, Aug 01, 2008 at 11:48:57AM +0200, Oliver Flimm wrote:
> > I recently started to combine several (around 140) seperate databases
> > for a single search request with add_database. I use the xapian perl
> > bindings. Additionally I use a match decider to implement facets.
> 
> Xapian version?  Platform?

I tried both Xapian 1.0.5 and 1.0.7 on a 64bit Debian Linux System
(stable/etch). The Debian platform is AMD64 (although we're running on
Intel hardware)

[...]
> A more comparable test would be against the 140 databases merged into
> one.
> 
> But it sounds like something is O(n*n) in the number of databases - that
> shouldn't be necessary that I can see.
> 
> If it's easy to test, see if 100 databases takes about 100 seconds, and
> 70 about 50 seconds.
[...]
> Could you profile to find where the time is spent?  Some tips are here:
> 
> http://trac.xapian.org/wiki/ProfilingXapian
[..] 

Thanks for the hints. I'll try them.
(Continue reading)

Oliver Flimm | 4 Aug 09:45
Picon
Favicon

Re: FLAG_WILDCARD, add_database and performance

Hi,

On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote:
> > Could you profile to find where the time is spent?  Some tips are here:
> > 
> > http://trac.xapian.org/wiki/ProfilingXapian

it looks like some routines in libc get called alot when using a
wildcard search.

Here are the first lines of output for the search request 'java' which
took around 3.5 seconds:

CPU: Core 2, speed 2666.68 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name
symbol name
-------------------------------------------------------------------------------
62107    75.5606  libperl.so.5.8.8         libperl.so.5.8.8
(no symbols)
  62107    100.000  libperl.so.5.8.8         libperl.so.5.8.8
(no symbols) [self]
-------------------------------------------------------------------------------
11012    13.3974  no-vmlinux               no-vmlinux
(no symbols)
  11012    100.000  no-vmlinux               no-vmlinux
(no symbols) [self]
-------------------------------------------------------------------------------
5883      7.1574  libc-2.3.6.so            libc-2.3.6.so
(Continue reading)

Joe Noon | 4 Aug 10:16
Picon
Gravatar

Can xapian do geo search/calculations?

Im currently using sphinx, which has a nice geo searching interface:

If given a query and an origin (latitude, longitude, both in radians),
and told which fields correspond to latitude/longitude in the index, I
can get back results ordered by distance from that point (and get the
actual distance as well).

I would be interested in exploring xapian further, but I'm not sure if
I can get this functionality or not.  Ive been told it may be possible
with a a value range processor? Is this something anyone else has
already explored? Or know if it would even be possible?

Thanks,

Joe
Olly Betts | 4 Aug 10:32
Favicon
Gravatar

Re: FLAG_WILDCARD, add_database and performance

On Mon, Aug 04, 2008 at 09:45:47AM +0200, Oliver Flimm wrote:
> On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote:
> > > Could you profile to find where the time is spent?  Some tips are here:
> > > 
> > > http://trac.xapian.org/wiki/ProfilingXapian
> 
> it looks like some routines in libc get called alot when using a
> wildcard search.

Hmm, yes.  Can you install the package libc6-dbg and repeat?  That
should then give us actual function names in libc.

> 141 databases - 206 seconds
> 104 databases - 86 seconds
> 69 databases - 16 seconds
> 41 databases - 4 seconds
> 8 databases - 0.29 seconds

Yeah, that's looking like there's something O(n*n) or worse.

Cheers,
    Olly
Olly Betts | 4 Aug 10:45
Favicon
Gravatar

Re: Can xapian do geo search/calculations?

On Mon, Aug 04, 2008 at 01:16:04AM -0700, Joe Noon wrote:
> I would be interested in exploring xapian further, but I'm not sure if
> I can get this functionality or not.  Ive been told it may be possible
> with a a value range processor? Is this something anyone else has
> already explored? Or know if it would even be possible?

You could filter with two value ranges to get results restricted to a
rectangle in the coordinate system (which is in general not quite a
rectangle on the ground).

To actually get a circle, you'd want to use a MatchDecider subclass
to select points based on coordinates stored in one or two document
values.  It would probably be useful to ship a standard solution, so if
you implement something please consider sending a patch, even if it
needs a bit of work for general use.

There's also some code Richard has been working on in the geospatial
branch in SVN, but he's better placed to talk about that:

http://trac.xapian.org/browser/branches/geospatial

Cheers,
    Olly
Richard Boulton | 4 Aug 10:50

Re: Can xapian do geo search/calculations?

Olly Betts wrote:
> On Mon, Aug 04, 2008 at 01:16:04AM -0700, Joe Noon wrote:
>> I would be interested in exploring xapian further, but I'm not sure if
>> I can get this functionality or not.  Ive been told it may be possible
>> with a a value range processor? Is this something anyone else has
>> already explored? Or know if it would even be possible?
> 
> You could filter with two value ranges to get results restricted to a
> rectangle in the coordinate system (which is in general not quite a
> rectangle on the ground).
> 
> To actually get a circle, you'd want to use a MatchDecider subclass
> to select points based on coordinates stored in one or two document
> values.  It would probably be useful to ship a standard solution, so if
> you implement something please consider sending a patch, even if it
> needs a bit of work for general use.
> 
> There's also some code Richard has been working on in the geospatial
> branch in SVN, but he's better placed to talk about that:
> 
> http://trac.xapian.org/browser/branches/geospatial

IIRC, the work I've done roughly does what Olly describes with a 
MatchDecider.  It also uses what's called a Heirarchical Triangular Mesh 
algorithm to assign groups of points to sets of triangles of different 
sizes on the worlds surface - this allows a fast "pre-search" to be used 
to calculate the approximate set of results, and the MatchDecider to be 
used to make that pre-search into an exact search.

However, it's not finished and not yet useable.  It got put on hold a 
(Continue reading)

Oliver Flimm | 4 Aug 14:50
Picon
Favicon

Re: FLAG_WILDCARD, add_database and performance

Hi,

On Mon, Aug 04, 2008 at 09:32:37AM +0100, Olly Betts wrote:
> On Mon, Aug 04, 2008 at 09:45:47AM +0200, Oliver Flimm wrote:
> > On Mon, Aug 04, 2008 at 08:57:30AM +0200, Oliver Flimm wrote:
> > > > Could you profile to find where the time is spent?  Some tips are here:
> > > > 
> > > > http://trac.xapian.org/wiki/ProfilingXapian
> > 
> > it looks like some routines in libc get called alot when using a
> > wildcard search.
> 
> Hmm, yes.  Can you install the package libc6-dbg and repeat?  That
> should then give us actual function names in libc.

Here are the results:

CPU: Core 2, speed 2666.68 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name
symbol name
-------------------------------------------------------------------------------
  18       100.000  libc-2.3.6.so            libc-2.3.6.so
__rpc_thread_destroy
1915840  80.4207  libc-2.3.6.so            libc-2.3.6.so
_int_malloc
  1915840  99.9993  libc-2.3.6.so            libc-2.3.6.so
_int_malloc [self]
  6        3.1e-04  libstdc++.so.6.0.8       libc-2.3.6.so
(Continue reading)


Gmane