Re: Interested in IR, Getting started with Xapian
Olly Betts <olly <at> survex.com>
2012-03-06 23:44:16 GMT
On Mon, Mar 05, 2012 at 11:39:31PM +0530, Akshay M S wrote:
> And in the project description, I couldn't understand this - *"Additionally,
> for faster searching, an upper bound on each component is needed (each
> database stores a number of summary statistics to help with this - if
> additional statistics would be useful, you could add them as part of the
> I'm thinking components refer to the per-document term weights and the
> DF/IDF weights? Any elaboration would be really helpful.
Xapian assumes that a document's weight is the sum of a contribution
from each term from the query which indexes it, plus an optional
per-document contribution which is independent of query. You can
also add in other contributions (e.g. from link analysis, click-through
data, how much an advertiser has paid you to boost their pages (yuck),
etc) using a PostingSource.
For each of these contributions, we want an upper bound on what the
contribution can be for a particular term (or PostingSource) in a
A concrete example is probably the best way to show how this can be
useful, so consider a query: moon OR cheese
Say we want to get the best 10 matches (i.e. highest weights). We
consider documents in ascending order of the numeric Xapian document
id. Once we have found 10, we can ignore anything which has a weight
less than the lowest weight of those 10, and as we gradually improve
out candidate set of 10, that minimum weight rises.
If we have bounds on the weights which "moon" and "cheese" can return,
then at some point we can spot that both terms will need to match to
achieve that minimum weight - for example, if "moon" can return at most
2.0 and "cheese" 3.0, and the lowest weight in our current best 10 is
4.0. Once this happens, we can convert that OR to AND, and that allows
us to skip through the document ids more quickly.
There are a number of optimisations we have which make use of these
bounds - in some cases we can even spot that there's no chance of
getting another candidate and stop early.
The tighter the bound you can calculate for a weighting scheme, the
more effective these optimisations can be.
If implementing a new backend, you could just return the maximum value
of the type and Xapian will work, but it won't be as fast. One of the
improvements the chert backend has over the flint backend is that it
keeps track of bounds on document length and within-document-frequency
which allow tighter bounds to be calculated for BM25.
If you implement a new weighting scheme, you might find that keeping
track of one or more additional statistics would allow a tighter
bound to be calculated, so modifying the backend to track these could
be worthwhile (and older backends could just return a really high or
low value, depending if it's an upper or lower bound).
> Can someone please point me to patches/bugs that are related to this
> project so I can understand the existing code better, especially related to
> Xapian::Weight class or anything else that can get me started with Xapian
I don't think there are any patches or bugs related to this one.
This page talks a bit more about the optimisations, but isn't currently
completely up to date:
And you can see an implementation of coordinate weighting (probably the
simplest weighting scheme after BoolWeight) here: