Marinos Yannikos | 1 Nov 01:59 2010
Picon

floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)

I am using BM25 with k1=0 and min_normlen=1 to get weights unaffected by 
document length and term frequency in the document (min_normlen=1 isn't 
necessary I guess) and am expecting single-term weights to be identical for all 
matches. I have added a document value to steer such general search queries and 
it works fine, except that for some search terms, I get results like:

             weight (BM25)               value
-----------------------------------------------
1. xxx (6.3564210045800955128925125 + 4.000000)
2. xxx (6.3564210045800955128925125 + 4.000000)
3. xxx (6.3564210045800955128925125 + 3.500000)
4. xxx (6.3564210045800946247140928 + 7.000000)
5. xxx (6.3564210045800946247140928 + 6.500000)
6. xxx (6.3564210045800946247140928 + 6.000000)
7. xxx (6.3564210045800946247140928 + 6.000000)
8. xxx (6.3564210045800946247140928 + 6.000000)
9. xxx (6.3564210045800946247140928 + 6.000000)
10. xxx (6.3564210045800946247140928 + 6.000000)
	
The weights then always seem to differ after the 14th/15th fractional digit and 
only a small number of results is affected (3 out of ~16000 with a slightly 
lower weight in one case, 4 out of ~70000 with a slightly higher one in 
another). Platform is Debian Lenny 64bit, AMD Opteron CPUs, core-1.2.3 patched 
to r15140 and using chert. This also happens with complex queries where groups 
of results are expected to have identical weights.

FIX: I found a simple fix for this issue, at least for my test cases:

I added

(Continue reading)

Olly Betts | 1 Nov 11:58 2010

Re: floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)

On Mon, Nov 01, 2010 at 01:59:42AM +0100, Marinos Yannikos wrote:
> This apparently prevents floating point precision issues in the last line 
> of get_sumpart() [which calculates termweight * wdf_double * 1 / 
> wdf_double].

Yes, for some values of wdf_double and termweight, this doesn't give
exactly termweight.  We should do the division, and scale termweight by
the result.

I've reproduced this issue and I'm currently working on a fix.

> It also speeds up my case slightly. ;-)

How much is "slightly"?  Or did you just mean it's doing less work,
rather than that there's a measurable speed-up.

> In order to prevent more such issues, it might be a good idea to round 
> weights to a few fractional digits (10 should be enough) before using 
> them as sort keys.

Rounding isn't a magic solution to such issues, and explicitly rounding
all the weights is extra work.  I think it's better to focus on getting
the calculations right rather than trying to disguise any problems.

Cheers,
    Olly
Onur Çelebi | 1 Nov 12:49 2010
Picon

Freaking fast song search, thanks Xapian

Hi all,
I just want to share with you a deployment of Xapian I have made in spool.fm.
I used C++ bindings and directly binded the binary to lighttpd using fastcgi
for minimum latency. The result is quite impressive in Europe, I invite you
to test it at http://spool.fm (just type a search at the top right corner or
test directly the search with this url : http://spool.fm/engine/fast/?s=test
 )

I think the result is not bad at all in America either, although there is
more network latency.

Thank you all Xapian developers and maintainers for such a great product !

Onur
David | 2 Nov 11:41 2010

How to make QueryParser select entire word like "H.O.T"

Hi,

I'm using xapian to build my search engine, but met with a problem.
The code snippet is like:
----------------------Code begin-------------------------------------------------------------
Xapian::QueryParser qp;
qp.add_prefix("Singer", "S");
Xapian::Query query = qp.parse_query("Singer:s.h.e",
Xapian::QueryParser::FLAG_PARTIAL|Xapian::QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS
|Xapian::QueryParser::FLAG_PHRASE );
cout << "Performing query `" << query.get_description() << "'" << endl;
----------------------Code end---------------------------------------------------------------

See the output from the stdio,
----------------------Output begin---------------------------------------------------------
Performing query `Xapian::Query((Ss:(pos=1) PHRASE 3 Sh:(pos=2) PHRASE 3 Se:(pos=3)))'
----------------------Output end-----------------------------------------------------------

See the problem? Actually "s.h.e" is a music band from Taiwan, and I want to use this as an entire query word to
search in the singer field.
So any one who know how to let the parser get "Ss.h.e" rather than splitted query ?

 

Thanks :)
Kevin Duraj | 9 Nov 03:40 2010
Picon

Re: Large database problem

Henry,

I've got 300 million documents indexed, searching for "Henry Henka",
and search result comes out in 23 milliseconds.

http://find1friend.com/search?q=Henry+Henka

real	0m0.238s
user	0m0.020s
sys	0m0.010s

Kevin Duraj
http://find1friend.com

On Wed, Oct 20, 2010 at 11:58 PM, Henry C. <henka <at> cityweb.co.za> wrote:
> On Thu, October 21, 2010 06:47, ³Ì³Ï wrote:
>> Hi!
>> 2,000,000 document in database and 200,000 document be queried£¬It takes
>> about 4 seconds first time,and  next time takes less than 0.5 second for the
>> same query.Please explain it for me ? What method can shorten the time of the
>> first queries? Thanks£¡
>
> The first time xapian has to read from disk, which is slow.
> The second time data is cached in RAM, which is faster.
>
> This is common across all applications.
>
> Regards
> Henry
>
(Continue reading)

Olly Betts | 9 Nov 03:56 2010

Re: How to make QueryParser select entire word like "H.O.T"

On Tue, Nov 02, 2010 at 06:41:24PM +0800, David wrote:
> I'm using xapian to build my search engine, but met with a problem.
> The code snippet is like:
> ----------------------Code begin-------------------------------------------------------------
> Xapian::QueryParser qp;
> qp.add_prefix("Singer", "S");
> Xapian::Query query = qp.parse_query("Singer:s.h.e",
Xapian::QueryParser::FLAG_PARTIAL|Xapian::QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS
|Xapian::QueryParser::FLAG_PHRASE );
> cout << "Performing query `" << query.get_description() << "'" << endl;
> ----------------------Code end---------------------------------------------------------------
>  
> See the output from the stdio,
> ----------------------Output begin---------------------------------------------------------
> Performing query `Xapian::Query((Ss:(pos=1) PHRASE 3 Sh:(pos=2) PHRASE 3 Se:(pos=3)))'
> ----------------------Output end-----------------------------------------------------------
>  
> See the problem? Actually "s.h.e" is a music band from Taiwan, and I want to
> use this as an entire query word to search in the singer field.
> So any one who know how to let the parser get "Ss.h.e" rather than splitted query ?

QueryParser doesn't allow you to customise how it interprets word
boundaries currently - this is ticket #113:

http://trac.xapian.org/ticket/113

There's currently some special handling for acronyms punctuated with
".", but only if capitalised, so this would work as you want:

    Singer:S.H.E
(Continue reading)

goran kent | 15 Nov 09:35 2010
Picon

Stopword addition and stemming

Hi,

Two questions which I'm unsure about:

Stemming:  I've turned on stemming, etc, but how can I confirm that
it's being used in searches?  What should I look/search for?

Stopwords:  I'm trying out xapian on a regional dataset (searching
data from a *.co.us TLD, eg) .  I've noticed that searching for [bob
co.us] results in *very* slow search times (tens of seconds), since it
seems to be searching for two extremely common (almost every document
will have something.co.us in it) terms "co" and "us", and the
not-so-common "bob".  Searching only for "bob" is quick.

Would it make sense to add "co" and "us" to the stopword list to
prevent that kind of catastrophic slowdown in search time?  Since the
dataset is obviously about ".co.us" I feel it's kind of redundant to
be searching for something you know is there...

Thanks
Olly Betts | 15 Nov 11:48 2010

Re: Stopword addition and stemming

On Mon, Nov 15, 2010 at 10:35:59AM +0200, goran kent wrote:
> Stemming:  I've turned on stemming, etc, but how can I confirm that
> it's being used in searches?  What should I look/search for?

Look for Z-prefixed terms in the output of query.get_description().

> Stopwords:  I'm trying out xapian on a regional dataset (searching
> data from a *.co.us TLD, eg) .  I've noticed that searching for [bob
> co.us] results in *very* slow search times (tens of seconds), since it
> seems to be searching for two extremely common (almost every document
> will have something.co.us in it) terms "co" and "us", and the
> not-so-common "bob".  Searching only for "bob" is quick.
> 
> Would it make sense to add "co" and "us" to the stopword list to
> prevent that kind of catastrophic slowdown in search time?  Since the
> dataset is obviously about ".co.us" I feel it's kind of redundant to
> be searching for something you know is there...

It often does make sense to choose stopwords based on the vocabulary of
the text collection you are working with.  And "us" would probably be a
stopword in English anyway.

But here bob.co.us is interpreted as a phrase, and stopwords are included
in phrases by the QueryParser.

In this case, I'm not sure you would want to ignore the ".co.us" part
anyway - "bob.co.us" probably has a meaning sufficiently distinct from
that of "bob" that you wouldn't want to conflate them.

If you aren't already using Xapian 1.2, phrase searching should be faster
(Continue reading)

goran kent | 15 Nov 12:56 2010
Picon

Re: Stopword addition and stemming

top-post apology to Kevin Duraj for hijacking his thread.  I have no
idea what the hell I did to cause it.  I want to blame it on overly
permissive stateless web stuff instead of my own stupidity, but...

On 11/15/10, Olly Betts <olly <at> survex.com> wrote:
> It often does make sense to choose stopwords based on the vocabulary of
> the text collection you are working with.  And "us" would probably be a
> stopword in English anyway.
>
> But here bob.co.us is interpreted as a phrase, and stopwords are included
> in phrases by the QueryParser.

Just to clarify, the search string was [bob_co.us], where "_" is a
space., but I gather from what you're saying that this would be a term
(bob) and a phrase ("co.us") type search anyway, correct?

> In this case, I'm not sure you would want to ignore the ".co.us" part
> anyway - "bob.co.us" probably has a meaning sufficiently distinct from
> that of "bob" that you wouldn't want to conflate them.

See above.

>
> If you aren't already using Xapian 1.2, phrase searching should be faster
> with the new default chert backend.

Using 1.2.3 trunk.

> The patch in this ticket can also make a huge difference to slow phrase
> cases:
(Continue reading)

goran kent | 15 Nov 13:18 2010
Picon

Re: Stopword addition and stemming

Also meant to ask:  can I apply that patch to search-code only, or
must it also go into the indexing code?

Gmane