Re: Xapian/Omega limits
Olly Betts <olly <at> survex.com>
2003-05-22 14:22:10 GMT
On Tue, May 20, 2003 at 12:50:04PM -0600, Daniel Hanks wrote:
> I just recently stumbled across Xapian and am wondering about any
> limits as to size of databases. I would like to be able to use it to
> index potentially around 10 million web pages. In the near future the
> number will probably be more like 2 million, but if Xapian does well,
> I'll index a lot more.
>
> I'm using the quartz backend, and wondering if I'm going to run into
> issues like queries slowing down as I scale up to those numbers,
> database file size issues, etc.
The quartz database backend (which is the backend you'd usually use) stores the
indexes in several files containing B-tree tables. If you're indexing with
positional information (for phrase searching) the term positions table is
usually the largest.
The current limitations which I'm aware of are:
* Xapian uses unsigned 32-bit ints for document ids, so you're limited to
just over 4 billion documents in a database. The other limits will cut in
first for a single database, but when searching over multiple databases are
done by interleaving the document ids, so this might start to matter
(especially if one database is much larger than the others).
* If you search many databases concurrently, you may hit the per-process
file-descriptor limit - each quartz database uses 5 fds. Some Unix-like OSes
allow this limit to be raised. Another way to avoid it (and to spread the
search load) is to use the remote backend to search databases on a cluster of
machines. Quartz could be made to not open fds for tables which aren't being
used during search (values and positions may not be), or to juggle fds - the
record table is typically only used for results, while the posting table is
typically only used during matching.
* If the OS has a filesize limit, that obviously applies to Xapian (a 2GB limit
is common for older operating systems). The old Muscat36 system supported
splitting the B-tree over multiple 1GB files, but native support for larger
files is available in any serious modern OS so quartz doesn't support this
(something similar could be implemented, but it no longer seems necessary).
* The B-trees use a 32-bit signed block count. The default blocksize is 8K
which limits you to 16GB tables. You can increase the blocksize if this
is a problem, but it's best to do it before you create the database as
otherwise you need to use copydatabase to rebuild the database with the
new blocksize, and that will take a while. This count should probably
be unsigned, not signed, which would buy another factor of 2 (so 32GB
limit with the default blocksize). I believe this is the limit Sam
referred to.
* Quartz stores the total length (i.e. number of terms) of all the documents in
a database so it can calculate the average document length. This is
currently stored as an unsigned 64-bit quantity so it's not likely to be a
limit you'll hit. I've listed it for completeness.
* Not a hard limit, but Quartz update isn't as fast as it could be, and this
currently can become a bottleneck as the database grows. This is due to the
way new entries are merged in - I know how it could be improved, but haven't
had time to work on it yet.
In general, I'd suggest a strategy of splitting the data over several databases
like Sam suggested. Once each has finished being updated, run it through
quartzcompact to optimise it for small size and fast searching (quartz usually
leaves some spare space in each block so that updates are more efficient - you
can update a compacted database, but the first few updates will cause a lot of
block splitting).
A multiple-database scheme works particularly well if you want a rolling web
index where the contents of the oldest database can be rechecked and live links
put back into a new database. It's also good for a news-type application where
older documents should expire from the index.
As a more general point, for a large search application, I/O will end up being
the limiting factor. So you want a RAID setup optimised for fast reading,
lots of RAM in the box so the OS can cache lots of disk blocks (the access
patterns typically mean that you only need to cache a few percent of the
database to eliminate most disk accesses).
It also means that reducing the database size is usually a win, which is why
quartzcompact is useful. Quartz compacts the information in the tables in
ways which work well given the nature of the data (e.g. lists of sorted docids
are stored as differences with smaller values encoded in fewer bytes). This
could be improved a little using a bitstream instead of a bytestream.
I've also toyed with the idea of implementing an ultra-compact read-only
backend which would take a quartz b-tree and convert it to something like a cdb
hashed file. The same information would be stored, but with less overhead than
quartz b-trees (which need to allow for update).
> Any pointers, docs, rtfm, etc. would be very helpful. I've been very pleased
> with the flexibility of the software so far. I'm just hoping it can scale to
> our needs. Thanks for the good work!
Scalability is probably our most frequently asked question. I don't think
anything on the website addresses the issue in much depth, but as James said,
we should probably write something (the above is a start).
If you have a budget, I can provide consultancy and customisation services
should you need specific assistance setting up a large search-related project
(or indeed a small one!)
Cheers,
Olly
-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge