Olly Betts | 1 Nov 02:02
Favicon
Gravatar

Re: Perl threads and Xapian - incompatibility?

On Sat, Oct 20, 2007 at 04:30:46AM +0100, Olly Betts wrote:
> I just had a go at writing a testcase to ensure this works, but some
> classes still SEGV.  I suspect its those which are implemented only by
> XS code.  I'll investigate.

Yes, it was indeed just the Stem class which was only an XS wrapper.

I've now added CLONE_SKIP subs to all the classes and written a test
which checks that this actually works.

It appears Perl 5.8.8 has a bug - the documentation says they're copied
as unblessed undef values, but they aren't undef.  I've filed a bug
report in Perl's bug system, but this doesn't stop this being useful.

So this should be resolved in Search::Xapian 1.0.4.0, which I've just
uploaded to CPAN.

Cheers,
    Olly
Ron Kass | 1 Nov 02:09

Re: Perl threads and Xapian - incompatibility?

Brilliant.

And good to see that in the process we get Perl better as well :)

Hope (while not holding my breath) Perl 6 coming soon doesn't carry its 
own set of glitches with it..

Best regards,

Ron

Olly Betts wrote:

> On Sat, Oct 20, 2007 at 04:30:46AM +0100, Olly Betts wrote:
>   
>> I just had a go at writing a testcase to ensure this works, but some
>> classes still SEGV.  I suspect its those which are implemented only by
>> XS code.  I'll investigate.
>>     
>
> Yes, it was indeed just the Stem class which was only an XS wrapper.
>
> I've now added CLONE_SKIP subs to all the classes and written a test
> which checks that this actually works.
>
> It appears Perl 5.8.8 has a bug - the documentation says they're copied
> as unblessed undef values, but they aren't undef.  I've filed a bug
> report in Perl's bug system, but this doesn't stop this being useful.
>
> So this should be resolved in Search::Xapian 1.0.4.0, which I've just
(Continue reading)

Olly Betts | 1 Nov 07:09
Favicon
Gravatar

Re: Perl threads and Xapian - incompatibility?

On Thu, Nov 01, 2007 at 03:09:15AM +0200, Ron Kass wrote:
> Hope (while not holding my breath) Perl 6 coming soon doesn't carry its 
> own set of glitches with it..

My (somewhat hazy) understanding is that Perl 6 is intended to have a
cleaner native extension API, but with a compatibility wrapper to allow
XS modules to work.

I don't think Search::Xapian does anything particularly weird with XS
- the only unusual aspect I'm aware of is using C++ with XS - so
hopefully it won't require much effort to get it to work with Perl 6.

Cheers,
    Olly
Olly Betts | 1 Nov 17:58
Favicon
Gravatar

Search::Xapian 1.0.4.0 released

I've uploaded Search::Xapian 1.0.4.0 to CPAN.  For your convenience
(especially since files can take a while to propagate to the CPAN
mirrors) I've also uploaded a copy to oligarchy.co.uk - both copies are
linked to from the Xapian download page:

http://www.xapian.org/download.php

The main changes in this release are documentation improvements and
fixes to work with 'use threads;' (provided you're using Perl >= 5.8.7).

Cheers,
    Olly
Aleph Thomas | 1 Nov 19:53
Picon

Re: Search::Xapian 1.0.4.0 released

Well I have now a new question, well my app, is write in java, and I whant
index all the files of my pc, I can use Xapian for multi threads, any
example or suggesion about I need to do.

  Thanks

      al
Olly Betts | 1 Nov 22:11
Favicon
Gravatar

Java threads

On Thu, Nov 01, 2007 at 01:53:28PM -0500, Aleph Thomas wrote:
> Well I have now a new question

If you have a new question, please start a new thread.  If you just
reply to a random unrelated message, it's harder to follow the
discussion and more likely your mail will be overlooked.  If you
make it easier for people to help you, you're more likely to get
help.

> well my app, is write in java, and I whant index all the files of my
> pc, I can use Xapian for multi threads, any example or suggesion about
> I need to do.

I'm not aware of any examples showing threaded use from Java.

The key thing to note is that a Xapian object can't be safely used from
multiple threads at once - you either need to use separate objects in
each thread, or protect the objects with a mutex.

For indexing, you can only have one WritableDatabase object for each
database, so you probably want to have a single thread handling updates,
and other threads passing Document objects to it to be added.

But assuming by "PC" you mean a typical desktop machine, I doubt it's
worth multithreading indexing of files - the main bottleneck is likely
to be reading data from the disks, and multithreading that probably
won't help performance.  Most PCs don't have a RAID for the disk
subsystem, so interleaving file accesses will probably just send the
disk head seeking back and forth.

(Continue reading)

Andrey | 2 Nov 05:49
Picon

weight question

hi
hi

how to achive Query with terms' weight to a Boolean matching?
i think my question is unclear/misleading... example

I have a document reads:
"I am eating an apple while using apple computer"

My xapian query:
apple(weight:4)
computer(weight:3)

instead of getting a weight of 11 of this doc (2Xapple 1Xcomputer), how to 
make the matching in boolean way so i will get a weight of 7 for this 
document?

Is it possible to add "penalty" in a query?
docA = "How to eat an apple while using apple computer"
docB = "I am eating an apple while using apple computer"

Query(apple:4,computer:3,how:-1) << is it possible to penalty / lost weight 
when doc has the term "how" so the docB ranks heigher?

how heavy will it be if i add a value of "hash(md5  HTML<title> X 
websiteDomain)" to each document, and then use this key to collapse 
duplicated-title-in-domain using set_collapse_key? is it way too heavy?

Thanks and really appreciated
Andrey K. 
(Continue reading)

Olly Betts | 2 Nov 06:43
Favicon
Gravatar

Re: weight question

On Thu, Nov 01, 2007 at 09:49:39PM -0700, Andrey wrote:
> I have a document reads:
> "I am eating an apple while using apple computer"
> 
> My xapian query:
> apple(weight:4)
> computer(weight:3)
> 
> instead of getting a weight of 11 of this doc (2Xapple 1Xcomputer), how to 
> make the matching in boolean way so i will get a weight of 7 for this 
> document?

If I understand correctly, you want to ignore the wdf of terms - you can
do that by setting BM25's k1 parameter to 0:

http://www.xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#_details

That's not what I'd call "boolean" weighting though, so perhaps I'm
misunderstanding you...

> Is it possible to add "penalty" in a query?
> docA = "How to eat an apple while using apple computer"
> docB = "I am eating an apple while using apple computer"
> 
> Query(apple:4,computer:3,how:-1) << is it possible to penalty / lost weight 
> when doc has the term "how" so the docB ranks heigher?

I don't think that's currently possible without indexing each document
which doesn't contain "how" with a "XNOThow" term, or something similar.

(Continue reading)

Olly Betts | 2 Nov 07:06
Favicon
Gravatar

Re: Search performance issues and profiling/debugging

On Thu, Oct 25, 2007 at 08:27:33PM +0200, Ron Kass wrote:
> By the way, this is the weight scheme we used.
> my $k1 = 1;             #governs the importance of within document 
> frequency. Must be >= 0. 0 means ignore wdf. Default is 1.
> my $k2 = 25;            #compensation factor for the high wdf values in 
> large documents. Must be >= 0. 0 means no compensation. Default is 0.
> my $k3 = 1;             #governs the importance of within query 
> frequency. Must be >= 0. 0 means ignore wqf. Default is 1.
> my $b = 0.01;           #Relative importance of within document 
> frequency and document length. Must be >= 0 and <= 1. Default is 0.5.
> my $min_normlen = 0.5;  #specifies a cutoff on the minimum value that 
> can be used for a normalised document length - smaller values will be 
> forced up to this cutoff. This prevents very small documents getting a 
> huge bonus weight. Default is 0.5.
> [...]
> Then we went back to the old server.. Same speed as before (0.9-1.0sec 
> per search) and this time estimates are stable. So, weight scheme is the 
> cause of the inaccurate estimates. Why?

I've made some progress here.  It looks like there's a bug in BM25Weight
where one of the statistics isn't being set correctly, but by default
this doesn't matter since k2 is 0.  If k2 is set to non-zero (as you've
done) then this manifests as an unpredicitable factor in the weights.

I've not yet tracked down where this value comes from, but it shouldn't
take long now I've got it happening in front of me with a four line
testcase!

Was the segfaulting case also using BM25 with non-zero k2?

(Continue reading)

James Aylett | 2 Nov 12:20

Re: Java threads

On Thu, Nov 01, 2007 at 09:11:05PM +0000, Olly Betts wrote:

> For indexing, you can only have one WritableDatabase object for each
> database, so you probably want to have a single thread handling updates,
> and other threads passing Document objects to it to be added.

That's what I'd do. Use a fast thread-safe one-reader queue (this
allows optimisations that multi-reader queues don't), throw your fresh
Document objects at it and scoop them up into a (slower) writer thread.

> But assuming by "PC" you mean a typical desktop machine, I doubt it's
> worth multithreading indexing of files - the main bottleneck is likely
> to be reading data from the disks, and multithreading that probably
> won't help performance.  Most PCs don't have a RAID for the disk
> subsystem, so interleaving file accesses will probably just send the
> disk head seeking back and forth.

If you have lots of small files and a large amount of memory you could
read things into memory out of one thread, and have multiple threads
indexing. That's only going to improve over naive single threading
where the indexing of each document takes a long time. (And your read
thread will stall regularly there.) It also requires plenty of cores
to make it worthwhile, I suspect (unless your OS can do funky IO
rescheduling of threads, which I don't think will actually help in
this case because only two of them are doing IO anyway).

There's some interesting discussion going on at the moment about how
to do things that are IO-bound in a threaded context. I don't think there
are any final conclusions yet, but this is something that smart people
are working on, and once there's consensus it should influence the
(Continue reading)


Gmane