Alexander Lind | 1 Dec 18:31

writabledatabase_delete_document()

Hi guys

I have implemented xapian on a website, and it currently has about 2M
items in its index.

Its all been working quite nicely so far, until I tried removing some
old items from the index (removing items when the index was smaller was
no problems at all).

When I try to remove them now (using writabledatabase_delete_document()
via php), it halfway freezes up the machine, and the apache httpd runs
amok spawning more and more children, until I break the php script that
is trying to remove documents from xapian.

>From my laymens point of view, it seems that the xapian delete document
function freezes up the OS on a filesystem level. Is this a correct
assessment?

Any ideas on how I can get xapian _not_ to freeze up the system like
this when deleting documents?

I'm using php 5.2.0, and xapian 0.9.9.

This is a 'ls -lha' of the xapian index dir:
total 1.3G
drwxr-xr-x    2 malte    users         408 Dec  1 17:22 .
drwxr-xr-x   16 malte    users         400 Nov 15 20:21 ..
-rw-------    1 malte    users           0 Dec  1 17:22 db_lock
-rw-r--r--    1 malte    users          10 Nov 24 00:39 meta
-rw-r--r--    1 malte    users        468M Dec  1 17:22 position_DB
(Continue reading)

Alexander Lind | 1 Dec 18:44

Re: writabledatabase_delete_document()

I should add that I use xapians own document_id as the argument to
writabledatabase_delete_document().
I have made it so the indexing part of my script saves the xapian
document_id in the sql db.

I just read a post somewhere on the net about how you can use term names
to ID and delete items out of an xapian index instead of using a
document_id. Is this faster?  (would seem strange if it was).

Alec

Alexander Lind wrote:
> Hi guys
>
> I have implemented xapian on a website, and it currently has about 2M
> items in its index.
>
> Its all been working quite nicely so far, until I tried removing some
> old items from the index (removing items when the index was smaller was
> no problems at all).
>
> When I try to remove them now (using writabledatabase_delete_document()
> via php), it halfway freezes up the machine, and the apache httpd runs
> amok spawning more and more children, until I break the php script that
> is trying to remove documents from xapian.
>
> >From my laymens point of view, it seems that the xapian delete document
> function freezes up the OS on a filesystem level. Is this a correct
> assessment?
>
(Continue reading)

Alexander Lind | 1 Dec 21:44

Re: writabledatabase_delete_document()

Is the answer to my question here to split the data into multiple databases?
I have tried to read as much as I can find on this subject on xapian.org and elsewhere, but can't really seem to find the answer to how it should be done.
Technically I know how to do it, but not logically. How many databases should I aim for - ie, should I aim for them not to be over a certain size, contain a certain amount of documents, or something else?
Should I distribute documents sequentially into them, or randomly, or use some other scheme?

The data in the documents are products, and I add, delete and update them continuously. There is no first-in-first-out rule, any product can be updated or removed at any time, so putting them in the order they arrive is of no particular use.

Starting to feel a little lonely in my thread here :p

Alec

Alexander Lind wrote:
I should add that I use xapians own document_id as the argument to writabledatabase_delete_document(). I have made it so the indexing part of my script saves the xapian document_id in the sql db. I just read a post somewhere on the net about how you can use term names to ID and delete items out of an xapian index instead of using a document_id. Is this faster? (would seem strange if it was). Alec Alexander Lind wrote:
Hi guys I have implemented xapian on a website, and it currently has about 2M items in its index. Its all been working quite nicely so far, until I tried removing some old items from the index (removing items when the index was smaller was no problems at all). When I try to remove them now (using writabledatabase_delete_document() via php), it halfway freezes up the machine, and the apache httpd runs amok spawning more and more children, until I break the php script that is trying to remove documents from xapian. >From my laymens point of view, it seems that the xapian delete document function freezes up the OS on a filesystem level. Is this a correct assessment? Any ideas on how I can get xapian _not_ to freeze up the system like this when deleting documents? I'm using php 5.2.0, and xapian 0.9.9. This is a 'ls -lha' of the xapian index dir: total 1.3G drwxr-xr-x 2 malte users 408 Dec 1 17:22 . drwxr-xr-x 16 malte users 400 Nov 15 20:21 .. -rw------- 1 malte users 0 Dec 1 17:22 db_lock -rw-r--r-- 1 malte users 10 Nov 24 00:39 meta -rw-r--r-- 1 malte users 468M Dec 1 17:22 position_DB -rw-r--r-- 1 malte users 7.4K Dec 1 16:46 position_baseB -rw-r--r-- 1 malte users 324M Dec 1 17:23 postlist_DB -rw-r--r-- 1 malte users 5.1K Dec 1 16:46 postlist_baseB -rw-r--r-- 1 malte users 61M Dec 1 17:22 record_DB -rw-r--r-- 1 malte users 956 Dec 1 16:46 record_baseB -rw-r--r-- 1 malte users 339M Dec 1 17:22 termlist_DB -rw-r--r-- 1 malte users 5.4K Dec 1 16:46 termlist_baseB -rw-r--r-- 1 malte users 99M Dec 1 17:22 value_DB -rw-r--r-- 1 malte users 1.6K Dec 1 16:46 value_baseB Thanks for your help. Alec
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Alexander Lind | 2 Dec 08:29

document_id globally incrementing

Hi All

I have made my xapian indexer automatically create new indexes once it
reaches X documents in each, and for each document that I add to each
sub-index, I record its document_id and its index_id (relating to what
index the document ended up in).

writabledatabase_add_document()  returns document_id:s beginning from 0
for each new index when you add new documents, like you would expect.

So far all good.

Here is the problem: when I search through the indexes together (using
database_add_database() on each sub-index), all the document_id:s are
numbered globally, so it seems while writabledatabase_add_document()
reset the document_id counter for each subindex, in the subindexes they
were never reset.

So instead of having 10 sub-indexes with 50 documents in each,
document_id:s ranging from 0-50 in each, I seem to end up with 10
sub-indexes with 50 documents in each, document_id:s ranging from 0-49
in subindex 1, 50-99 in subindex 2, and so on. This would not be a
problem if writabledatabase_add_document() returned these globally
incrementing document_id:s, but it doesn't.

Sorry for the lenghty email here, but is this a bug or a feature or am I
doing something wrong?

Thank you so much for your help.
Alec
Alexander Lind | 3 Dec 07:16

Re: document_id globally incrementing

And another revision (phue!):

I solved it. All I had to do was to cast the doucment_id as an int. Xapian tried to delete the document via a term since I unwittingly passed the document_id in as a string. Doh :p

Alec

Alexander Lind wrote:
Revision on this; I am mistaken below about the 'global' incrementing of document_id:s across split databases - however they become incremented like that when combined in a read only database opener. I assume this is something Xapian does out of necessity, and that doesn't constitute a problem since the db:s have their true document_id:s set correctly. My question then instead is, how come that even though I call writabledatabase_delete_document() with the correct index pointer for x sub-index, and a certain document_id within that index, that the entry for this document_id still seems to persist in this index? It was the assumption of that this function indeed worked like I thought it would that brought me to the erroneous conclusion about the document_id:s below. I assumed that since the documents seemed to stay in the indexes after I called the delete function that the document_ids must not be matching. Very grateful for anyones input or ideas on this. Alec Alexander Lind wrote:
Hi All I have made my xapian indexer automatically create new indexes once it reaches X documents in each, and for each document that I add to each sub-index, I record its document_id and its index_id (relating to what index the document ended up in). writabledatabase_add_document() returns document_id:s beginning from 0 for each new index when you add new documents, like you would expect. So far all good. Here is the problem: when I search through the indexes together (using database_add_database() on each sub-index), all the document_id:s are numbered globally, so it seems while writabledatabase_add_document() reset the document_id counter for each subindex, in the subindexes they were never reset. So instead of having 10 sub-indexes with 50 documents in each, document_id:s ranging from 0-50 in each, I seem to end up with 10 sub-indexes with 50 documents in each, document_id:s ranging from 0-49 in subindex 1, 50-99 in subindex 2, and so on. This would not be a problem if writabledatabase_add_document() returned these globally incrementing document_id:s, but it doesn't. Sorry for the lenghty email here, but is this a bug or a feature or am I doing something wrong? Thank you so much for your help. Alec
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Alexander Lind | 3 Dec 07:07

Re: document_id globally incrementing

Revision on this;
I am mistaken below about the 'global' incrementing of document_id:s
across split databases - however they become incremented like that when
combined in a read only database opener. I assume this is something
Xapian does out of necessity, and that doesn't constitute a problem
since the db:s have their true document_id:s set correctly.

My question then instead is, how come that even though I call
writabledatabase_delete_document()
with the correct index pointer for x sub-index, and a certain
document_id within that index, that the entry for this document_id still
seems to persist in this index?

It was the assumption of that this function indeed worked like I thought
it would that brought me to the erroneous conclusion about the
document_id:s below. I assumed that since the documents seemed to stay
in the indexes after I called the delete function that the document_ids
must not be matching.

Very grateful for anyones input or ideas on this.

Alec

Alexander Lind wrote:
> Hi All
>
> I have made my xapian indexer automatically create new indexes once it
> reaches X documents in each, and for each document that I add to each
> sub-index, I record its document_id and its index_id (relating to what
> index the document ended up in).
>
> writabledatabase_add_document()  returns document_id:s beginning from 0
> for each new index when you add new documents, like you would expect.
>
> So far all good.
>
> Here is the problem: when I search through the indexes together (using
> database_add_database() on each sub-index), all the document_id:s are
> numbered globally, so it seems while writabledatabase_add_document()
> reset the document_id counter for each subindex, in the subindexes they
> were never reset.
>
> So instead of having 10 sub-indexes with 50 documents in each,
> document_id:s ranging from 0-50 in each, I seem to end up with 10
> sub-indexes with 50 documents in each, document_id:s ranging from 0-49
> in subindex 1, 50-99 in subindex 2, and so on. This would not be a
> problem if writabledatabase_add_document() returned these globally
> incrementing document_id:s, but it doesn't.
>
> Sorry for the lenghty email here, but is this a bug or a feature or am I
> doing something wrong?
>
> Thank you so much for your help.
> Alec
>
>   
Olly Betts | 4 Dec 19:53
Favicon
Gravatar

Re: Re: writabledatabase_delete_document()

On Fri, Dec 01, 2006 at 12:44:37PM -0800, Alexander Lind wrote:
> When I try to remove them now (using writabledatabase_delete_document()
> via php), it halfway freezes up the machine, and the apache httpd runs
> amok spawning more and more children, until I break the php script that
> is trying to remove documents from xapian.

I take it you're trying to do with from a PHP script run through apache?

Flushing updates to a large database can take a while - it's easy to
update 4 of the 5 tables, but for the postlist table we need to amend
an entry for each term which indexed the document, which is potentially
a lot of entries widely spread through the table.

Not that the cost here scales sublinearly - i.e. it's a lot less than 100
times as expensive to delete 100 documents and flush them in one go than
to delete and flush one document at a time.  Perhaps that's something
you can bear in mind when designing how deletions are handled?

> From my laymens point of view, it seems that the xapian delete document
> function freezes up the OS on a filesystem level. Is this a correct
> assessment?

No.  My best guess is you're probably either suffering from a lot of
swapping, or just disk I/O overload.  Or there may be issues to do with
this running inside Apache too.

Calling WritableDatabase::delete_document() doesn't do much work at all.
The postlist table updates are saved up for when you call
WritableDatabase::flush() (either explicitly, or implicitly as happens
when you close a database, or after 1000 updates).  Some memory is
require to buffer up these changes, but I doubt that's causing you
to swap unless you're really tight on memory.

A flush can require a lot of I/O, so the other tasks apache is doing are
also I/O bound, you could cause these to run more slowly such that more
try to run at once and things get slow.  But this is not really Xapian's
fault - the server is just overloaded.  You either need to add more RAM
to improve caching and reduce I/O, beef up the disk subsystem so I/O
requests complete sooner, or move work off to another server.

As for running this inside Apache, if I were designing something like
this, I'd lean towards having the web interface lodge a request with a
index management process which does the real work behind the scenes
(this could be as simple as saving a file in a directory saying to
delete document N, and having the index management process scan the
directory for new files).  Then the index management process can batch
up multiple requests, and different users can make requests at the same
time without locking issues.

> I just read a post somewhere on the net about how you can use term names
> to ID and delete items out of an xapian index instead of using a
> document_id. Is this faster?  (would seem strange if it was).

Using a term will be slower.  Currently the internal implementation is
to just open the postlist for the term and then call delete_document for
each docid but there's scope for optimising this in the future.  But
even with this optimised I expect it will be faster to delete by docid.

Hmm, if you're calling from PHP, make sure that you're passing a PHP
integer to delete_document, not a PHP string.  If you pass a string
you'll be calling the "delete_document by term" variant.  If you
want to make sure, you can force a value to be an integer by using:
intval($docid)

> Is the answer to my question here to split the data into multiple databases?

It might help, since you say this worked with smaller databases, but I'd
try moving other work off the server first.  Your database isn't that
enormous.

> Technically I know how to do it, but not logically. How many databases
> should I aim for - ie, should I aim for them not to be over a certain
> size, contain a certain amount of documents, or something else?

It depends on the spec of the server really.  When I'm building gmane's
index from scratch, I build a number of databases of 1 million documents
each and then merge them at the end, as I found that was fastest (though
I didn't do extensive trials).

> Should I distribute documents sequentially into them, or randomly, or
> use some other scheme?

If you ever want to search a particular subset, you could put that in
its own database (or databases).  And if you need to remove old
documents periodically (e.g. anything over a year old) doing it by
date makes sense.  Otherwise it's probably rather arbitrary.

Cheers,
    Olly
Olly Betts | 4 Dec 20:48
Favicon
Gravatar

Re: Re: document_id globally incrementing

On Sat, Dec 02, 2006 at 10:16:40PM -0800, Alexander Lind wrote:
> writabledatabase_add_document()  returns document_id:s beginning from 0
> for each new index when you add new documents, like you would expect.

Docid 0 is invalid - the returned docids should start at 1.  Perhaps
that's what you meant, but I thought I should clarify.

> So instead of having 10 sub-indexes with 50 documents in each,
> document_id:s ranging from 0-50 in each, I seem to end up with 10
> sub-indexes with 50 documents in each, document_id:s ranging from 0-49
> in subindex 1, 50-99 in subindex 2, and so on. This would not be a
> problem if writabledatabase_add_document() returned these globally
> incrementing document_id:s, but it doesn't.

When you search over a combination of databases, the document ids are
simply interleaved to avoid clashes.  You can easily reverse the
formula used:

http://article.gmane.org/gmane.comp.search.xapian.general/1375

When writing, you're just accessing the docids in a single database.  It
would be feasible to allow writing to multiple database managed by a
single object, but nobody's yet done the work required to support this.
There was some discussion of that in this thread:

http://thread.gmane.org/gmane.comp.search.xapian.general/3464

Cheers,
    Olly
Alexander Lind | 4 Dec 20:54

Re: Re: writabledatabase_delete_document()


I take it you're trying to do with from a PHP script run through apache?
Its via a php CLI script.
Flushing updates to a large database can take a while - it's easy to update 4 of the 5 tables, but for the postlist table we need to amend an entry for each term which indexed the document, which is potentially a lot of entries widely spread through the table. Not that the cost here scales sublinearly - i.e. it's a lot less than 100 times as expensive to delete 100 documents and flush them in one go than to delete and flush one document at a time. Perhaps that's something you can bear in mind when designing how deletions are handled?
Definitely, not using flushes anywhere, just letting a script do all updates and deletes in a row, and flushing on its own. Should I make it so all deletes are done by themselves, ie without updates interwoven?
From my laymens point of view, it seems that the xapian delete document function freezes up the OS on a filesystem level. Is this a correct assessment?
No. My best guess is you're probably either suffering from a lot of swapping, or just disk I/O overload. Or there may be issues to do with this running inside Apache too.
Apache is not involved here, but swapping and disk i/o are both likely pof:s here, especially since the machine is quite busy with serving a pretty large site at the same time.
Calling WritableDatabase::delete_document() doesn't do much work at all. The postlist table updates are saved up for when you call WritableDatabase::flush() (either explicitly, or implicitly as happens when you close a database, or after 1000 updates). Some memory is require to buffer up these changes, but I doubt that's causing you to swap unless you're really tight on memory. A flush can require a lot of I/O, so the other tasks apache is doing are also I/O bound, you could cause these to run more slowly such that more try to run at once and things get slow. But this is not really Xapian's fault - the server is just overloaded. You either need to add more RAM to improve caching and reduce I/O, beef up the disk subsystem so I/O requests complete sooner, or move work off to another server.
Yep, moving to a different machine that can be dedicated to this is probably my best shot.
As for running this inside Apache, if I were designing something like this, I'd lean towards having the web interface lodge a request with a index management process which does the real work behind the scenes (this could be as simple as saving a file in a directory saying to delete document N, and having the index management process scan the directory for new files). Then the index management process can batch up multiple requests, and different users can make requests at the same time without locking issues.
Its free-standing php cli scripts doing this behind the scenes job for me, and it also uses a bunch of tables in a mysql db to know what it should do - add, update or delete documents in the xapian index.
I know the mysql server (other machine) is not the bottleneck, but rather it must be that the server that I run the scripts on is simply overloaded, just as you say.
In fact I know this is the case many times as when I kill other resource hogging scripts the xapian indexer speeds up. However it did not do anything at all for when I was trying to delete documents the other day. But that of course was because I was passing the docid as a string, not an int. Smart eh? :p

I just read a post somewhere on the net about how you can use term names to ID and delete items out of an xapian index instead of using a document_id. Is this faster? (would seem strange if it was).
Using a term will be slower. Currently the internal implementation is to just open the postlist for the term and then call delete_document for each docid but there's scope for optimising this in the future. But even with this optimised I expect it will be faster to delete by docid.
Thats what I thought. Cool.
Hmm, if you're calling from PHP, make sure that you're passing a PHP integer to delete_document, not a PHP string. If you pass a string you'll be calling the "delete_document by term" variant. If you want to make sure, you can force a value to be an integer by using: intval($docid)
Yep that was the root of a long session of bug-hunting the other night.
Can't believe I didn't figure it out sooner though, being that I had already made sure docids are passed as int:s to the update function.
Is the answer to my question here to split the data into multiple databases?
It might help, since you say this worked with smaller databases, but I'd try moving other work off the server first. Your database isn't that enormous.
Can't do that until I have set up a new machine, but the multiple-db patch is already done and implemented so will see how that pans out. The index is rebuilding right now. Exciting :)
Technically I know how to do it, but not logically. How many databases should I aim for - ie, should I aim for them not to be over a certain size, contain a certain amount of documents, or something else?
It depends on the spec of the server really. When I'm building gmane's index from scratch, I build a number of databases of 1 million documents each and then merge them at the end, as I found that was fastest (though I didn't do extensive trials).
I tried with 250k docs per sub-db this time. But I have made it so I can adjust this limit without rebuilding the entire db, so it can change later.

Question: how do you merge the sub-db:s in the end, for the search functions?

The machine I use is a dual p3 1200 mhz xeon, and 3.5 gig ram.
It is quite overworked serving a busy website, various stats-generating scripts, and other misc scripts. Moving the xapian stuff to a different server is my next step.
Should I distribute documents sequentially into them, or randomly, or use some other scheme?
If you ever want to search a particular subset, you could put that in its own database (or databases). And if you need to remove old documents periodically (e.g. anything over a year old) doing it by date makes sense. Otherwise it's probably rather arbitrary.
Yeah documents could stay in the index for a day or for 5 years, so no need for me to think about that then.

Thank you so much for your help Olly, very much appreciated.

Alec
Cheers, Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Alexander Lind | 4 Dec 20:58

Re: Re: document_id globally incrementing


Docid 0 is invalid - the returned docids should start at 1. Perhaps that's what you meant, but I thought I should clarify.
Yes, sorry my mistake.
So instead of having 10 sub-indexes with 50 documents in each, document_id:s ranging from 0-50 in each, I seem to end up with 10 sub-indexes with 50 documents in each, document_id:s ranging from 0-49 in subindex 1, 50-99 in subindex 2, and so on. This would not be a problem if writabledatabase_add_document() returned these globally incrementing document_id:s, but it doesn't.
When you search over a combination of databases, the document ids are simply interleaved to avoid clashes. You can easily reverse the formula used: http://article.gmane.org/gmane.comp.search.xapian.general/1375
Good to know. I figured this out after a while. Sorry for spamming down the mailing list with redundant questions.
When writing, you're just accessing the docids in a single database. It would be feasible to allow writing to multiple database managed by a single object, but nobody's yet done the work required to support this. There was some discussion of that in this thread: http://thread.gmane.org/gmane.comp.search.xapian.general/3464
I have implemented exactly that in my xapian index class, but its php only of course. Perhaps something I could submit to the xapian wiki pages later?  Not until I have tested it out a bit more so I know its stable though.

Thanks
Alec
Cheers, Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Gmane