Kevin Duraj | 1 Jul 03:12
Picon

Re: Filesystems

Based on my observation Flint runs best on ext3 filesystems, but you
might want to play with the filesystem block size. Depends on your
data size you might try to increased the filesystem block size, which
might help you in return to increase in performance, thus less reads
needs to be performed when reading Xapian Flint database. Try to
reformat your hard disk with larger ext3 filesystem block size and let
us know.

Kevin Duraj
http://myhealthcare.com/

On Tue, Jun 23, 2009 at 6:10 AM, Olly Betts<olly <at> survex.com> wrote:
> On Tue, Jun 23, 2009 at 05:49:44PM +0930, Frank J Bruzzaniti wrote:
>> Just wondering if anyone from there experience had any advice regarding
>> what file systems perform best with Xapian's Flint backend.
>
> I'm not aware of any comparative tests (I've been intending to perform
> some myself for ages, but never actually got around to it).  If anyone
> has any figures, I'd be interested to hear.
>
> Cheers,
>    Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss <at> lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
James Aylett | 1 Jul 03:18

Re: Filesystems

On Tue, Jun 30, 2009 at 06:12:55PM -0700, Kevin Duraj wrote:

> Based on my observation Flint runs best on ext3 filesystems

I don't suppose you're able to share any of your numbers from this? I
know it's not always possible, but having something on the wiki would
be useful to people, if only to point them in useful directions of how
to construct their own testing. (I assume you were comparing against
JFS and XFS, maybe Reiser?)

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
Kevin Duraj | 1 Jul 03:38
Picon

Re: Deleting documents with scriptindex

I was also wondering how you can quickly delete a document using
scriptindex. I tried all the possibilities, but was not able to figure
out. Can somebody write a simple command line example.

Thanks
-Kevin

On Thu, Jun 18, 2009 at 9:08 AM, Olly Betts<olly <at> survex.com> wrote:
> On Thu, Jun 18, 2009 at 11:24:14AM -0400, Jason Chapin wrote:
>> I can't seem to find an answer for this, or I'm going about it wrong.
>> The situation is this, I use scriptindex on a dump of MySQL product data
>> to update the Xapian database. The product catalog updates about once a
>> month, wherein existing products can be updated or deleted and new
>> products added.
>
> If you feed scriptindex a record which only specifies a unique id then
> it will remove any record with that unique id (documented in
> scriptindex.html under the unique action).
>
>> Would it be best to just delete the database and rebuild it for each
>> product update?
>
> That would work too.  If most products change then it's probably quicker
> to just rebuild from scratch.
>
> Cheers,
>    Olly
>
> _______________________________________________
> Xapian-discuss mailing list
(Continue reading)

Kevin Duraj | 1 Jul 04:23
Picon

Re: Filesystems

I don't have any test performance numbers. I have been running Xapian
search engine on different Linux distros with different filesystems
and felt that ext3 was the best. But, Wikipedia is also saying that
ext3 might be slower than other filesystems, in that case my
assumption would be wrong. It would be good to have some tests
performed with same data on different filesystems and see what happen.

The ext3 or third extended filesystem ...
Although its performance (speed) is less attractive than competing
Linux filesystems such as JFS, ReiserFS and XFS, it has a significant
advantage in that it allows in-place upgrades from the ext2 file
system without having to back up and restore data. Ext3 also uses less
CPU power than ReiserFS and XFS. It is also considered safer than the
other Linux file systems due to its relative simplicity and wider
testing base.

Reference: http://en.wikipedia.org/wiki/Ext3

Kevin Duraj
http://myhealthcare.com/

On Tue, Jun 30, 2009 at 6:18 PM, James Aylett<james-xapian <at> tartarus.org> wrote:
> On Tue, Jun 30, 2009 at 06:12:55PM -0700, Kevin Duraj wrote:
>
>> Based on my observation Flint runs best on ext3 filesystems
>
> I don't suppose you're able to share any of your numbers from this? I
> know it's not always possible, but having something on the wiki would
> be useful to people, if only to point them in useful directions of how
> to construct their own testing. (I assume you were comparing against
(Continue reading)

Matt Chen | 1 Jul 05:37
Picon

Re: Database flush problem

On Mon, Jun 29, 2009 at 5:43 PM, Matt Chen<ceator <at> gmail.com> wrote:
>> You might want to read this thread:
>>
>>    http://lists.xapian.org/pipermail/xapian-discuss/2009-June/006804.html
>>
>> rgh
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss <at> lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
> Thanks for replay.
>
> I have read that thread before, and I add flush() at the end of
> database update, but it doesn't change the search results only if I
> restart the search daemon, and the flush() return void, how could I
> know if it was flushed ?
>
> Is there something I missing ? I'm using the python bindings, and I
> flush database like this:
>
> self.database.flush()
>
>
> Matt
>

I did the test several times, The api said it will be changed
(Continue reading)

Favicon

Re: Filesystems

My colleague is testing several filesystems on our new search-machine. 
He has been looking at a few of the filesystems available in the 2.6.30 
linux kernel, ext2/3/4, xfs, btrfs, nilfs2 and reiser4.

"Unfortunately" the new machine has 24GB ram and 4x ssd in raid5. To get 
somewhat IO-bound results we had to cripple the machine (by making sure 
it couldn't use 20gb of those 24gb for file-cache) *and* cherry-pick our 
queries (only the heaviest with phrase-queries and such).

In the normal scenario of having the full ram (or even 4gb) available 
and the ssd's backing up any cache-miss, it is simply cpu-bound. And 
that is with the fastest x86 2-socket cpu's available right now, a pair 
of intel X5570's. The good news is that it actually appears to scale 
very well when using more cpu-cores (this one has 8 cores with 8 
hyper-threading cores) and that we can get about 90 searches per second 
out of it, which is more than we do now per minute (and we haven't 
benchmarked the compacted database yet).

I.e. our results indicate that for our reads it hardly matters which 
filesystem to pick, most of the database will be in RAM any way.

With the crippled, extra-io, read-scenario, we do see differences in 
performance between de filesystems tested.

When finished, we'll have numbers for linear writes (copying the 25GB 
database from another disk array), non-linear writes (updating the 
database) and semi-linear writes (compacting the database) with 
semi-linear (memory-backed) reads.
And of course the numbers for the crippled read-scenario.

(Continue reading)

Picon

Re: Filesystems

Arjen,

Are you using any mount options optimisations like noatime or noboundary 
with xfs?
Are you using any mount option optimisations with ext4?

 
Arjen van der Meijden wrote:
> My colleague is testing several filesystems on our new search-machine. 
> He has been looking at a few of the filesystems available in the 2.6.30 
> linux kernel, ext2/3/4, xfs, btrfs, nilfs2 and reiser4.
>
> "Unfortunately" the new machine has 24GB ram and 4x ssd in raid5. To get 
> somewhat IO-bound results we had to cripple the machine (by making sure 
> it couldn't use 20gb of those 24gb for file-cache) *and* cherry-pick our 
> queries (only the heaviest with phrase-queries and such).
>
> In the normal scenario of having the full ram (or even 4gb) available 
> and the ssd's backing up any cache-miss, it is simply cpu-bound. And 
> that is with the fastest x86 2-socket cpu's available right now, a pair 
> of intel X5570's. The good news is that it actually appears to scale 
> very well when using more cpu-cores (this one has 8 cores with 8 
> hyper-threading cores) and that we can get about 90 searches per second 
> out of it, which is more than we do now per minute (and we haven't 
> benchmarked the compacted database yet).
>
> I.e. our results indicate that for our reads it hardly matters which 
> filesystem to pick, most of the database will be in RAM any way.
>
> With the crippled, extra-io, read-scenario, we do see differences in 
(Continue reading)

Favicon

Re: Filesystems

Apart from noatime we haven't used any specific additional mount- or 
mkfs options yet. But we're going to check a few to see how they'll do.

Best regards,

Arjen

On 1-7-2009 12:00, Frank John Bruzzaniti wrote:
> Arjen,
> 
> Are you using any mount options optimisations like noatime or noboundary 
> with xfs?
> Are you using any mount option optimisations with ext4?
> 
> 
> Arjen van der Meijden wrote:
>> My colleague is testing several filesystems on our new search-machine. 
>> He has been looking at a few of the filesystems available in the 
>> 2.6.30 linux kernel, ext2/3/4, xfs, btrfs, nilfs2 and reiser4.
>>
>> "Unfortunately" the new machine has 24GB ram and 4x ssd in raid5. To 
>> get somewhat IO-bound results we had to cripple the machine (by making 
>> sure it couldn't use 20gb of those 24gb for file-cache) *and* 
>> cherry-pick our queries (only the heaviest with phrase-queries and such).
>>
>> In the normal scenario of having the full ram (or even 4gb) available 
>> and the ssd's backing up any cache-miss, it is simply cpu-bound. And 
>> that is with the fastest x86 2-socket cpu's available right now, a 
>> pair of intel X5570's. The good news is that it actually appears to 
>> scale very well when using more cpu-cores (this one has 8 cores with 8 
(Continue reading)

Olly Betts | 3 Jul 06:24
Favicon
Gravatar

Re: Database flush problem

On Wed, Jul 01, 2009 at 11:37:38AM +0800, Matt Chen wrote:
> I did the test several times, The api said it will be changed
> automatically after closed the database, but it didn't change anyway.
> I also add the flush() before the index process end, it didn't work.

The thread Richard referred to was talking about the situation when the
index process is persistent and also serviced search requests.

> Shall I re-construct the database obj that search daemon hold every
> time ? any suggestion would be great.

You can just call Database::reopen() which is more efficient than
reopening from scratch - see the API docs for details.

Cheers,
    Olly
James Cauwelier | 5 Jul 09:58
Picon

advanced sorting suggestions

Hi,

Let me first describe my problem:

PROBLEM

  - I want to query Xapian (no problem here)
  - I want to sort by some other relevance, namely by popularity of  
the document relative to this search query.  I want the sort order to  
be influenced by clicking patterns after a query has been executed.   
If a search for 'tomato' gives a set of results and 90% of the users  
click on the third result, then I want to move this document to the  
top of the result set.  I would need a logger, which records this  
behavior, but I would also need to sort by generated key, if I am not  
mistaken.  As a default, I would like to fall back to the default  
relevance sorting of Xapian.

I am a PHP programmer and that 's pretty much all I do (I 've done  
some perl programming in the past too), so I think I am way over my  
head with this.

MY QUESTION

Can I do the subclassing Xapian::Sorter as suggested in the  
documentation in PHP?  Or do I need to learn C++? Has this already  
been done before in an open source and free solution?  I do not want  
to reinvent the wheel...

Thanks a lot for taking the time to answer this question.  I  
appreciate it.
(Continue reading)


Gmane