Peter A. Friend | 3 Nov 19:19
Picon

term duplication among index tables

Greetings all,

I have been looking into using Xapian to provide search for email. I  
have to be very careful about indexing overhead, and I notice that  
several of the tables use term strings as part of the key or data. I  
was wondering if there is a performance or complexity reason for not  
having a separate table mapping term strings to unique numbers, which  
could then be used in the other tables. Is this something that has  
been considered previously and discarded as unworkable, or do you  
think it may be worth pursuing?

Cheers,

Peter
richard | 3 Nov 20:47

Re: term duplication among index tables

On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
> I have been looking into using Xapian to provide search for email. I  
> have to be very careful about indexing overhead, and I notice that  
> several of the tables use term strings as part of the key or data. I  
> was wondering if there is a performance or complexity reason for not  
> having a separate table mapping term strings to unique numbers, which  
> could then be used in the other tables. Is this something that has  
> been considered previously and discarded as unworkable, or do you  
> think it may be worth pursuing?

It's a perfectly good question, and one which has been discussed several
times in the past (though maybe not on this list).  Indeed, IIRC, one of
the early backend implementations used a central lexicon of terms, and
referenced this lexicon using term IDs in the posting list, position list
and termlist tables.  This idea was discarded at a later date for
performance reasons.

The problem is that while getting the database size as small as possible is
useful, our primary goal with Xapian (at least, with the current backends)
is to make searches as fast as possible; and the speed is very much
dependent on the number of disk reads.  Use of a lexicon risks forcing
searches to perform an extra disk read for each term in the query, to look
up the term in the lexicon.

It is possible that the idea could still be useful for some tables, since
the lexicon should be relatively small, and could therefore get largely
cached in memory, but this would, of course, reduce the amount of space
available for caching other parts of the database.

I would be surprised if it could be used effectively for storing termlists,
(Continue reading)

Peter Friend | 3 Nov 21:35
Picon

Re: term duplication among index tables


On Nov 3, 2006, at 11:47 AM, richard <at> lemurconsulting.com wrote:

> On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
>> I was wondering if there is a performance or complexity reason for  
>> not
>> having a separate table mapping term strings to unique numbers, which
>> could then be used in the other tables. Is this something that has
>> been considered previously and discarded as unworkable, or do you
>> think it may be worth pursuing?
>
> The problem is that while getting the database size as small as  
> possible is
> useful, our primary goal with Xapian (at least, with the current  
> backends)
> is to make searches as fast as possible; and the speed is very much
> dependent on the number of disk reads.  Use of a lexicon risks forcing
> searches to perform an extra disk read for each term in the query,  
> to look
> up the term in the lexicon.

Trading space for more performance is certainly a reasonable  
tradeoff. Since the backends are basically B+ trees, I figured that  
space saved by using a term ID might allow more of the index pages  
for the other tables to be cached in memory (and possibly reduce disk  
hits), but how this compares with the existing compression schemes  
seems like it would be very sensitive to the types of documents being  
indexed.

If I manage the time to attempt such an overhaul, I'll share what I  
(Continue reading)

Olly Betts | 3 Nov 23:03
Favicon
Gravatar

Re: term duplication among index tables

On Fri, Nov 03, 2006 at 07:47:30PM +0000, richard <at> lemurconsulting.com wrote:
> On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
> > I was wondering if there is a performance or complexity reason for not  
> > having a separate table mapping term strings to unique numbers, which  
> > could then be used in the other tables. Is this something that has  
> > been considered previously and discarded as unworkable, or do you  
> > think it may be worth pursuing?
> 
> It's a perfectly good question, and one which has been discussed several
> times in the past (though maybe not on this list).

Actually, I'm not sure it's been discussed since the project became
Xapian, but we certainly talked about it at BrightStation.

> Indeed, IIRC, one of the early backend implementations used a central
> lexicon of terms, and referenced this lexicon using term IDs in the
> posting list, position list and termlist tables.

Yes, an early version of Quartz had a lexicon table.  I think the long
defunct "sleepycat" backend did too.

> This idea was discarded at a later date for performance reasons.

My recollection is that we originally gave up on this approach because it
greatly complicated handling multiple databases and query expansion.
But it turned out there were also performance arguments too.

> The problem is that while getting the database size as small as possible is
> useful, our primary goal with Xapian (at least, with the current backends)
> is to make searches as fast as possible; and the speed is very much
(Continue reading)

Olly Betts | 3 Nov 23:14
Favicon
Gravatar

Re: term duplication among index tables

On Fri, Nov 03, 2006 at 12:35:29PM -0800, Peter Friend wrote:
> Since the backends are basically B+ trees, I figured that
> space saved by using a term ID might allow more of the index pages
> for the other tables to be cached in memory (and possibly reduce disk
> hits)

This is definitely an important consideration.  Getting more branching
in the first few levels of the Btree will reduce the number of levels of
branching, which reduces the number of disk blocks required to get to an
entry.

> If I manage the time to attempt such an overhaul, I'll share what I
> find.

It would be interesting to hear the outcome of any experiments.

Cheers,
    Olly
Neal Becker | 4 Nov 17:07
Picon

Fedora FC6 rpm, howto disable -rpath?

I'd like to contribute xapian to fedora extras.  It builds OK, but breaks
one requirement for inclusion.  I need to disable rpath.  Unfortunately,
I'm not very familiar with the autoconf, automake,... setup, and I can't
seem to find where to disable this.  I tried autoreconf, but that didn't
fix it.  Can anyone help me?
Olly Betts | 4 Nov 17:19
Favicon
Gravatar

Re: Fedora FC6 rpm, howto disable -rpath?

On Sat, Nov 04, 2006 at 11:07:58AM -0500, Neal Becker wrote:
> I'd like to contribute xapian to fedora extras.  It builds OK, but breaks
> one requirement for inclusion.  I need to disable rpath.  Unfortunately,
> I'm not very familiar with the autoconf, automake,... setup, and I can't
> seem to find where to disable this.  I tried autoreconf, but that didn't
> fix it.  Can anyone help me?

Hmm, I thought libtool only set rpath if the prefix isn't /usr.  I
recall I checked the debian packages and they didn't have rpath set
(it's against debian policy too).

Are you using the xapian.spec we ship or your own version?

Cheers,
    Olly
Neal Becker | 4 Nov 17:21
Picon

Re: Fedora FC6 rpm, howto disable -rpath?

On Saturday 04 November 2006 11:19 am, Olly Betts wrote:
> On Sat, Nov 04, 2006 at 11:07:58AM -0500, Neal Becker wrote:
> > I'd like to contribute xapian to fedora extras.  It builds OK, but breaks
> > one requirement for inclusion.  I need to disable rpath.  Unfortunately,
> > I'm not very familiar with the autoconf, automake,... setup, and I can't
> > seem to find where to disable this.  I tried autoreconf, but that didn't
> > fix it.  Can anyone help me?
>
> Hmm, I thought libtool only set rpath if the prefix isn't /usr.  I
> recall I checked the debian packages and they didn't have rpath set
> (it's against debian policy too).
>
> Are you using the xapian.spec we ship or your own version?
>
> Cheers,
>     Olly

The one shipped, but this is x86_64, and libdir=/usr/lib64.
Olly Betts | 4 Nov 17:31
Favicon
Gravatar

Re: Fedora FC6 rpm, howto disable -rpath?

On Sat, Nov 04, 2006 at 11:21:41AM -0500, Neal Becker wrote:
> The one shipped, but this is x86_64, and libdir=/usr/lib64.

Ah, that's probably fooling libtool then I guess.  I'll investigate and
try to get that fixed.  But for now, you could try not specifying
--libdir to configure, and then moving the installed libraries from
/usr/lib to /usr/lib64 after "make install".

Cheers,
    Olly
Neal Becker | 4 Nov 17:33
Picon

Re: Fedora FC6 rpm, howto disable -rpath?

On Saturday 04 November 2006 11:31 am, Olly Betts wrote:
> On Sat, Nov 04, 2006 at 11:21:41AM -0500, Neal Becker wrote:
> > The one shipped, but this is x86_64, and libdir=/usr/lib64.
>
> Ah, that's probably fooling libtool then I guess.  I'll investigate and
> try to get that fixed.  But for now, you could try not specifying
> --libdir to configure, and then moving the installed libraries from
> /usr/lib to /usr/lib64 after "make install".
>
> Cheers,
>     Olly

Thanks.  It seems to be fairly common to allow a --disable-rpath switch in 
configure for this purpose.  I suggest this approach.

Gmane