Olly Betts | 8 May 2003 19:59
Favicon
Gravatar

OmRSet copying semantics

Hi,

Xapian API classes are designed to act as handles - copying or assigning
an OmEnquire object gives you another reference to the same underlying
entity.  They're like pointers which you don't have to worry about
free-ing - reference counting takes care of that.

I've been reworking the classes into a Xapian namespace and noticed that
OmRSet gets duplicated when assigned or copied, which is inconsistent.
This example shows what I mean:

    OmRSet rset;
    rset.add_document(4);
    OmRSet copy(rset);
    OmRSet assign = rset;
    assign.add_document(5);
    copy.add_document(6);
    copy.add_document(7);
    cout << rset.size() << " " << assign.size() << " " << copy.size() << endl;

At present this prints "1 2 3" - I'd expect "4 4 4".

My plan is to correct this to bring OmRSet in line with other API
classes.  Omega doesn't rely on either behaviour, and neither do any of
the examples.  But it's possible that others have written code which
relies on getting a fresh copy - if so be warned that your code will need
updating.

Cheers,
    Olly
(Continue reading)

Arjen van der Meijden | 12 May 2003 11:02
Picon
Picon

Finding the 'place' within a document?

Hi,

There is a discussion going on in our forum about the ability to find
specific messages (the ability of retrieve their messageid). The idea
behind this is that a very long and diverse discussion (document) is not
such a usefull result for your problem, since you'll have to skim
through another 1000 (which isn't even a hard limit, but usualy a sort
of soft limit to start a 'long discussion topic part i++'.

Our documents are build up like this:
<topic>
[author, date, topicid, other metadata]
[title]
<message1>
[author, messageid]
[message content]
<...>
</topic>

We feed this to the scriptindex-program like so, but that can easily be
adjusted to any other form of course:
Topicid=$topicid
Author=$author
Date=$date
Title=$title
Message=$message1content
=$message2content
=...

I was wondering whether it is possible to allow omega to retrieve the
(Continue reading)

James Aylett | 12 May 2003 11:28

Re: Finding the 'place' within a document?

On Mon, May 12, 2003 at 11:02:47AM +0200, Arjen van der Meijden wrote:

> There is a discussion going on in our forum about the ability to
> find specific messages (the ability of retrieve their messageid).  I
> was wondering whether it is possible to allow omega to retrieve the
> messageid of the messages that contained a certain keyword, without
> having to base the indexation on the messages themselves (indexing
> all the messages independent of each other, instead of the
> discussions as a whole thread)?

I can't think of a good way of doing it - Xapian operates in terms of
Xapian::Document, so you need to map that to whatever the smallest
result you want is. In this case, that'll be the message. What sort of
volume of topics are you talking about?

> The reverse of this question is: I was wondering whether, when indexing
> on the messages independently, if they can be somehow joined together
> and thereby allowing you to find a topic that contains keyword1,
> keyword2 and keyword3 in message1, keyword4 in message2 etc... ?

Hmm. Well ... there's no straightforward way of doing this (beyond
indexing the topics as well as the messages). I haven't thought this
through fully, but could you possibly do something like adding a value
to each message giving the topic id, and then sorting on that? Then
you build the actual list of matching topics by going over the mset
checking the existence of required terms in the individual documents,
and filtering out any topics that don't contain them.

That's fairly unpleasant, though, and I haven't no idea what the speed
would be like (horrible, I suspect).
(Continue reading)

Arjen van der Meijden | 12 May 2003 22:45
Picon
Picon

RE: Finding the 'place' within a document?

> What sort of volume of topics are you 
> talking about?
We've about 700K topics, containing around 8M messages in them.
So the increase of documents would be more than 10 times the current
amount.

It results in a termlist of 2.2G, postlist of 2G, positionlist of 4.6G,
recordslist of 388M and a valueslist of 67M, adding up to a total of
9.3G.

> That's fairly unpleasant, though, and I haven't no idea what 
> the speed would be like (horrible, I suspect).
Which is 'acceptable' (it really is acceptable, don't be alarmed :) ) at
the moment.

> Pass. I'd have thought it'll be measurably larger, mostly 
> since the postlist will be much larger (assuming there's lots 
> of overlap in the termlists of the individual messages within 
> a thread, which seems likely). The positionlist won't need 
> much additional storage, although I can't remember how that's 
> stored these days (and don't have a practical database to 
> play with right now).

I see you don't name any option I overlooked. And I estimated the same
about the growing files.
Anyway, as a very simple sidestep I enhanced our nice 'keyword
highlighter'. It is like the groups.google searchresult-highlighter and
simply highlights the words in each message. That is done, one-by-one,
while retrieving the messages from the database, so it was fairly easy
to check whether something changed at all (i.e. whether the message
(Continue reading)

Olly Betts | 12 May 2003 23:14
Favicon
Gravatar

Re: Finding the 'place' within a document?

On Mon, May 12, 2003 at 11:02:47AM +0200, Arjen van der Meijden wrote:
> The reverse of this question is: I was wondering whether, when indexing
> on the messages independently, if they can be somehow joined together
> and thereby allowing you to find a topic that contains keyword1,
> keyword2 and keyword3 in message1, keyword4 in message2 etc... ?

Back when we were Open Muscat, we talked about a "structured docid"
so that a term would be in a section of a chapter of a document or
what have you, instead of just being in a document.  It's essentially
what you ideally want I think, but it was an idea for the long term, and
we've still not implemented it...

> And with that last one comes another difficult one, would the
> database-size explode (compared to the current one, which includes
> positional information) if I'd index the content on a message-basis?

The only way to be sure is to try indexing a substantial subset of the
database both ways and see (I'd be interested to hear how it compares).
I'd imagine it'll be larger, but I doubt it would "explode".

Cheers,
    Olly

-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com
Daniel Hanks | 20 May 2003 20:50

Xapian/Omega limits

Hi folks, 

I just recently stumbled across Xapian and am wondering about any limits as to size of databases. I would
like to be able to use it to index potentially around 10 million web pages. In the near future the number will
probably be more like 2 million, but if Xapian does well, I'll index a lot more. 

I'm using the quartz backend, and wondering if I'm going to run into issues like queries slowing down as I
scale up to those numbers, database file size issues, etc.

Any pointers, docs, rtfm, etc. would be very helpful. I've been very pleased with the flexibility of the
software so far. I'm just hoping it can scale to our needs. Thanks for the good work!

-- Dan
========================================================================
   Daniel Hanks - Systems/Database Administrator
   About Inc., Web Services Division
========================================================================

-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
James Aylett | 21 May 2003 11:52

Re: Xapian/Omega limits

On Tue, May 20, 2003 at 12:50:04PM -0600, Daniel Hanks wrote:

> I just recently stumbled across Xapian and am wondering about any
> limits as to size of databases. I would like to be able to use it to
> index potentially around 10 million web pages. In the near future
> the number will probably be more like 2 million, but if Xapian does
> well, I'll index a lot more.

The short answer is that it'll work. Omsee (the software Xapian has
evolved from) was used for the Webtop search engine, with over 500
million documents. Searches took less than a second.

The only caveat here is that they were using the legacy muscat36
backend; you're using Quartz. However there are people using it for
fairly large systems - I don't have figures, though.

Things will be impacted by the size of the database, obviously, but
I'm not really the right person to say how. (If we can start getting
some info on this, I'll put together a page on this sort of thing for
the website.)

Cheers,
James

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org

-------------------------------------------------------
(Continue reading)

Sam Liddicott | 21 May 2003 19:09

Re: Xapian/Omega limits

James Aylett wrote:
On Tue, May 20, 2003 at 12:50:04PM -0600, Daniel Hanks wrote:
I just recently stumbled across Xapian and am wondering about any limits as to size of databases. I would like to be able to use it to index potentially around 10 million web pages. In the near future the number will probably be more like 2 million, but if Xapian does well, I'll index a lot more.
The short answer is that it'll work. Omsee (the software Xapian has evolved from) was used for the Webtop search engine, with over 500 million documents. Searches took less than a second. The only caveat here is that they were using the legacy muscat36 backend; you're using Quartz. However there are people using it for fairly large systems - I don't have figures, though. Things will be impacted by the size of the database, obviously, but I'm not really the right person to say how. (If we can start getting some info on this, I'll put together a page on this sort of thing for the website.)
At Orange we used it successfully for a few million documents.
We did a trial at gmane with 10 million and came accross some internal quartz limits at a few million because we were indexing a lot.
Olly can give you the low down on this, but we also found to keep up with the gmame posting rate at peak times we had to have many small DB's as it becomes progrssively slower to insert into large DB's.

Luckily xapian can combine searches accross DB's dynamically so the only problem was working out which DB an article was in when it wanted deleting.

Sam
Cheers, James

Olly Betts | 22 May 2003 16:22
Favicon
Gravatar

Re: Xapian/Omega limits

On Tue, May 20, 2003 at 12:50:04PM -0600, Daniel Hanks wrote:
> I just recently stumbled across Xapian and am wondering about any
> limits as to size of databases. I would like to be able to use it to
> index potentially around 10 million web pages. In the near future the
> number will probably be more like 2 million, but if Xapian does well,
> I'll index a lot more. 
> 
> I'm using the quartz backend, and wondering if I'm going to run into
> issues like queries slowing down as I scale up to those numbers,
> database file size issues, etc.

The quartz database backend (which is the backend you'd usually use) stores the
indexes in several files containing B-tree tables.  If you're indexing with
positional information (for phrase searching) the term positions table is
usually the largest.

The current limitations which I'm aware of are:

* Xapian uses unsigned 32-bit ints for document ids, so you're limited to
  just over 4 billion documents in a database.  The other limits will cut in
  first for a single database, but when searching over multiple databases are
  done by interleaving the document ids, so this might start to matter
  (especially if one database is much larger than the others).

* If you search many databases concurrently, you may hit the per-process
  file-descriptor limit - each quartz database uses 5 fds.  Some Unix-like OSes
  allow this limit to be raised.  Another way to avoid it (and to spread the
  search load) is to use the remote backend to search databases on a cluster of
  machines.  Quartz could be made to not open fds for tables which aren't being
  used during search (values and positions may not be), or to juggle fds - the
  record table is typically only used for results, while the posting table is
  typically only used during matching.

* If the OS has a filesize limit, that obviously applies to Xapian (a 2GB limit
  is common for older operating systems).  The old Muscat36 system supported
  splitting the B-tree over multiple 1GB files, but native support for larger
  files is available in any serious modern OS so quartz doesn't support this
  (something similar could be implemented, but it no longer seems necessary).

* The B-trees use a 32-bit signed block count.  The default blocksize is 8K
  which limits you to 16GB tables.  You can increase the blocksize if this
  is a problem, but it's best to do it before you create the database as
  otherwise you need to use copydatabase to rebuild the database with the
  new blocksize, and that will take a while.  This count should probably
  be unsigned, not signed, which would buy another factor of 2 (so 32GB
  limit with the default blocksize).  I believe this is the limit Sam
  referred to.

* Quartz stores the total length (i.e. number of terms) of all the documents in
  a database so it can calculate the average document length.  This is
  currently stored as an unsigned 64-bit quantity so it's not likely to be a
  limit you'll hit.  I've listed it for completeness.

* Not a hard limit, but Quartz update isn't as fast as it could be, and this
  currently can become a bottleneck as the database grows.  This is due to the
  way new entries are merged in - I know how it could be improved, but haven't
  had time to work on it yet.

In general, I'd suggest a strategy of splitting the data over several databases
like Sam suggested.  Once each has finished being updated, run it through
quartzcompact to optimise it for small size and fast searching (quartz usually
leaves some spare space in each block so that updates are more efficient - you
can update a compacted database, but the first few updates will cause a lot of
block splitting).

A multiple-database scheme works particularly well if you want a rolling web
index where the contents of the oldest database can be rechecked and live links
put back into a new database.  It's also good for a news-type application where
older documents should expire from the index.

As a more general point, for a large search application, I/O will end up being
the limiting factor.  So you want a RAID setup optimised for fast reading,
lots of RAM in the box so the OS can cache lots of disk blocks (the access
patterns typically mean that you only need to cache a few percent of the
database to eliminate most disk accesses).

It also means that reducing the database size is usually a win, which is why
quartzcompact is useful.  Quartz compacts the information in the tables in
ways which work well given the nature of the data (e.g. lists of sorted docids
are stored as differences with smaller values encoded in fewer bytes).  This
could be improved a little using a bitstream instead of a bytestream.

I've also toyed with the idea of implementing an ultra-compact read-only
backend which would take a quartz b-tree and convert it to something like a cdb
hashed file.  The same information would be stored, but with less overhead than
quartz b-trees (which need to allow for update).

> Any pointers, docs, rtfm, etc. would be very helpful. I've been very pleased
> with the flexibility of the software so far. I'm just hoping it can scale to
> our needs. Thanks for the good work!

Scalability is probably our most frequently asked question.  I don't think
anything on the website addresses the issue in much depth, but as James said,
we should probably write something (the above is a start).

If you have a budget, I can provide consultancy and customisation services
should you need specific assistance setting up a large search-related project
(or indeed a small one!)

Cheers,
Olly

-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
Sam Liddicott | 22 May 2003 22:31

Re: Xapian/Omega limits

Olly Betts wrote:

>I've also toyed with the idea of implementing an ultra-compact read-only
>backend which would take a quartz b-tree and convert it to something like a cdb
>hashed file.  The same information would be stored, but with less overhead than
>quartz b-trees (which need to allow for update).
>
If you did this it would also be a great benefit if records could still 
be nullified/deleted/removed/made-not-there etc
There are a few strong cases where a libellous/dangerous record might 
want removing out of the fixed DB while a new version might be added to 
the normal DB.

>
>If you have a budget, I can provide consultancy and customisation services
>should you need specific assistance setting up a large search-related project
>(or indeed a small one!)
>
Daniel, I recommend Olly to you on this; he did some customisation for 
us at Orange when I worked there.  He added sorting of results in a 
similar manner to the old muscat product, as well as other things.  I 
was very pleased with what he did for us.

Sam

-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge

Gmane