Olly Betts | 11 Aug 2004 23:00
Favicon
Gravatar

goodbye QuartzBufferedTable

I've finally managed to sort out the problems with my patch to eliminate
QuartzBufferedTable and checked it in.  I'm currently building a large
database to test this.  The rate was quite a bit faster initially but
after around 200K documents it has settled down to being very slightly
faster than before.

Even if the eventual rate is much the same, this at least improves the
time taken to build smaller databases and reduces the complexity of the
code in the quartz backend.

Once this test has completed, I'll try upping the number of documents
per batch and see if that improves the rate.

Cheers,
    Olly
Olly Betts | 12 Aug 2004 19:47
Favicon
Gravatar

Re: goodbye QuartzBufferedTable

On Wed, Aug 11, 2004 at 10:00:20PM +0100, Olly Betts wrote:
> Once this test has completed, I'll try upping the number of documents
> per batch and see if that improves the rate.

Upping the number of documents per batch to 2000 increases the
throughput quite dramatically - once things get going the gain is
50-60%, and it appears to be doing relatively better as the database
size increases (currently this run has done about 700K documents).

Cheers,
    Olly
Arjen van der Meijden | 12 Aug 2004 22:32
Picon
Picon

Re: goodbye QuartzBufferedTable

On 12-8-2004 19:47, Olly Betts wrote:

> On Wed, Aug 11, 2004 at 10:00:20PM +0100, Olly Betts wrote:
> 
>>Once this test has completed, I'll try upping the number of documents
>>per batch and see if that improves the rate.
> 
> 
> Upping the number of documents per batch to 2000 increases the
> throughput quite dramatically - once things get going the gain is
> 50-60%, and it appears to be doing relatively better as the database
> size increases (currently this run has done about 700K documents).

That sounds very good. Let me know whenever that is test-ready and you'd 
like a/another test on some of (our) real-life data :)
Does upping the limit higher, result into more increased throughput?

Best regards,

Arjen van der Meijden
Olly Betts | 13 Aug 2004 01:56
Favicon
Gravatar

Re: goodbye QuartzBufferedTable

On Thu, Aug 12, 2004 at 10:32:07PM +0200, Arjen van der Meijden wrote:
> On 12-8-2004 19:47, Olly Betts wrote:
> 
> >On Wed, Aug 11, 2004 at 10:00:20PM +0100, Olly Betts wrote:
> >
> >>Once this test has completed, I'll try upping the number of documents
> >>per batch and see if that improves the rate.
> >
> >Upping the number of documents per batch to 2000 increases the
> >throughput quite dramatically - once things get going the gain is
> >50-60%, and it appears to be doing relatively better as the database
> >size increases (currently this run has done about 700K documents).
> 
> That sounds very good. Let me know whenever that is test-ready and you'd 
> like a/another test on some of (our) real-life data :)

Pretty much now I think.  I should add something so the batch size can
be set without recompiling though.

> Does upping the limit higher, result into more increased throughput?

I don't know - I'm letting each build run for a while to see how well
the changes scale.  So far I've only tried 1000 and 2000, but I'll keep
going to see roughly what is optimal.  I suspect it'll depend on the
data and the hardware (RAM size in particular).  Sadly I doubt it'll be
possible to make it self-tune well.

Cheers,
    Olly
(Continue reading)

Arjen van der Meijden | 13 Aug 2004 09:10
Picon
Picon

Re: goodbye QuartzBufferedTable

On 13-8-2004 1:56, Olly Betts wrote:
> On Thu, Aug 12, 2004 at 10:32:07PM +0200, Arjen van der Meijden wrote:
> 
>>On 12-8-2004 19:47, Olly Betts wrote:
>>That sounds very good. Let me know whenever that is test-ready and you'd 
>>like a/another test on some of (our) real-life data :)
> 
> Pretty much now I think.  I should add something so the batch size can
> be set without recompiling though.

I'll watch the cvs-commits for this. Will you also allow a switch (or an 
environment value or whatever) on scriptindex to adjust this?

>>Does upping the limit higher, result into more increased throughput?
> 
> I don't know - I'm letting each build run for a while to see how well
> the changes scale.  So far I've only tried 1000 and 2000, but I'll keep
> going to see roughly what is optimal.  I suspect it'll depend on the
> data and the hardware (RAM size in particular).  Sadly I doubt it'll be
> possible to make it self-tune well.

Making it runtime/startuptime adjustable will at least allow easier 
searching for semi-optimal values. Finding the real-optimal values will 
probably cost a lot of extra time, while not really improving the 
performance that much. Apart from the fact that it may vary over time; 
database size and structure and data input can differ quite a bit over a 
long index batch.
Currently we allow scriptindex to either run with 1000 documents or a 
set of documents that results in 16MB of data (whichever limit comes 
first) and that makes scriptindex use amounts in the range of 150-250MB 
(Continue reading)

Olly Betts | 13 Aug 2004 13:54
Favicon
Gravatar

Re: goodbye QuartzBufferedTable

On Fri, Aug 13, 2004 at 09:10:44AM +0200, Arjen van der Meijden wrote:
> On 13-8-2004 1:56, Olly Betts wrote:
> > I should add something so the batch size can be set without
> > recompiling though.
> 
> I'll watch the cvs-commits for this. Will you also allow a switch (or an 
> environment value or whatever) on scriptindex to adjust this?

For the time being, I'll probably just pull the value from an
environment variable inside quartz itself.  We should also look at
whether a document count based flush is the best approach - now that
we only cache changed postings in memory, counting the number of
cached postings might be more appropriate since that'll mostly
dictate memory usage and how much work the merging step does.

> Making it runtime/startuptime adjustable will at least allow easier 
> searching for semi-optimal values. Finding the real-optimal values will 
> probably cost a lot of extra time, while not really improving the 
> performance that much.

I believe we can pick a reasonable default for most users.  If you've
got 10,000,000 documents, it's worth your while spending a bit of time
tuning.

Also, with a smaller collection, it's nice to be able to see documents
searchable while the indexer is still running.  With a large collection
you'd rather get the indexing done sooner.

Perhaps omindex (and maybe scriptindex) ought to force a flush after 10,
100, 1000 documents or something like that.  Mind you, my first batch of
(Continue reading)

Olly Betts | 13 Aug 2004 19:19
Favicon
Gravatar

Re: goodbye QuartzBufferedTable

On Fri, Aug 13, 2004 at 12:54:40PM +0100, Olly Betts wrote:
> On Fri, Aug 13, 2004 at 09:10:44AM +0200, Arjen van der Meijden wrote:
> > On 13-8-2004 1:56, Olly Betts wrote:
> > > I should add something so the batch size can be set without
> > > recompiling though.
> > 
> > I'll watch the cvs-commits for this. Will you also allow a switch (or an 
> > environment value or whatever) on scriptindex to adjust this?
> 
> For the time being, I'll probably just pull the value from an
> environment variable inside quartz itself.  We should also look at
> whether a document count based flush is the best approach - now that
> we only cache changed postings in memory, counting the number of
> cached postings might be more appropriate since that'll mostly
> dictate memory usage and how much work the merging step does.

I don't have easy access to the number of cached postings (I'd have to
tally them myself), but I do have access to the change in document
length which is similar (except it adds the wdfs for the postings
rather than counting them) so I've added an option to flush on that
too.

Set XAPIAN_FLUSH_THRESHOLD=5000 in the environment (and export it!) to
flush every 5000 documents or XAPIAN_FLUSH_THRESHOLD_LENGTH=1000000 to
flush every 1000000 total change in document length.  Set both to flush
whichever is reached first.  Set neither and the default is to flush
every 1000 documents as before.

I'm now trying flushing every 5000 documents - this seems to work
very well.  After an initial period I so far get a sustained 160
(Continue reading)

Samuel Liddicott | 13 Aug 2004 20:56

Re: goodbye QuartzBufferedTable


Olly Betts wrote:

>
>I'm now trying flushing every 5000 documents - this seems to work
>very well.  After an initial period I so far get a sustained 160
>documents per second (currently it's done 519K documents).  I'm using
>CVS HEAD with the "dangerous" quartz patch (which I sent to the mailing
>list about 2 months ago).
>  
>
Whoa! Stop there. This is exciting. You are getting 160 documents per 
second even with half a million in the index?
Wow. Thats TOO much. [stops to jump up and down crazily] Wow.

Gosh.

Does "dangerous" mean it could corrupt the btree if interrupted?

Sam
Olly Betts | 13 Aug 2004 23:04
Favicon
Gravatar

Re: goodbye QuartzBufferedTable

On Fri, Aug 13, 2004 at 07:56:42PM +0100, Sam Liddicott wrote:
> Whoa! Stop there. This is exciting. You are getting 160 documents per 
> second even with half a million in the index?

Yes.  160/sec was sustained until around 600K docs, then it started to
tail off a little.  Currently it's doing 110/sec with 1800K docs.

I'll try flushing after 10K docs next I guess.

> Wow. Thats TOO much. [stops to jump up and down crazily] Wow.

I think there's still lots of scope for improvement.  I've only tried
a performance tuning a few things so far.

> Does "dangerous" mean it could corrupt the btree if interrupted?

Yes - if the power fails or the program is shut down uncleanly, the
database might be corrupt (normally a Xapian database should survive
either event).  But that doesn't give a huge boost - compare the two
graphs on the old box:

http://www.survex.com/~olly/gmaneindexrate.html

"No positional information and O_STREAMING, reuse btree object, no
fdatasync, "dangerous" patch applied:"

vs:

"No positional information and O_STREAMING, reuse btree object, no
fdatasync:"
(Continue reading)

Arjen van der Meijden | 13 Aug 2004 23:31
Picon
Picon

Re: goodbye QuartzBufferedTable

On 13-8-2004 19:19, Olly Betts wrote:
> On Fri, Aug 13, 2004 at 12:54:40PM +0100, Olly Betts wrote:
> 
>>You should find the memory usage is a lot lower now.  Before we were
>>buffering changes to the tables in memory, which used a lot of memory.
>>Now we just update the Btree and leave it to the OS to cache blocks
>>from the Btree which appears to be a better use of the memory.
> 
> As a data point, my indexer (a custom one, but of similar complexity to
> scriptindex) seems to level off at around 60MB with CVS HEAD.

I ran a small test, using an existing database (915788 documents, 13GB). 
The test consisted of 1824 documents, totalling to 103MB of textual data 
for scriptindex.

The normal 0.8.1:
real    54m31.446s
user    7m44.730s
sys     0m54.070s

The cvs-head:
real    23m17.052s
user    6m1.310s
sys     0m13.110s

The first one consumed up to 521MB of memory (480 resident), while the 
second only used 85MB, that alone makes up for quite a performance boost 
I suppose... (the machine only has 1GB of memory). The second was set to 
flush after 5000 documents (i.e. after all the work was done).

(Continue reading)


Gmane