1 Nov 2010 01:59
floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)
Marinos Yannikos <mjy <at> pobox.com>
2010-11-01 00:59:42 GMT
2010-11-01 00:59:42 GMT
I am using BM25 with k1=0 and min_normlen=1 to get weights unaffected by
document length and term frequency in the document (min_normlen=1 isn't
necessary I guess) and am expecting single-term weights to be identical for all
matches. I have added a document value to steer such general search queries and
it works fine, except that for some search terms, I get results like:
weight (BM25) value
-----------------------------------------------
1. xxx (6.3564210045800955128925125 + 4.000000)
2. xxx (6.3564210045800955128925125 + 4.000000)
3. xxx (6.3564210045800955128925125 + 3.500000)
4. xxx (6.3564210045800946247140928 + 7.000000)
5. xxx (6.3564210045800946247140928 + 6.500000)
6. xxx (6.3564210045800946247140928 + 6.000000)
7. xxx (6.3564210045800946247140928 + 6.000000)
8. xxx (6.3564210045800946247140928 + 6.000000)
9. xxx (6.3564210045800946247140928 + 6.000000)
10. xxx (6.3564210045800946247140928 + 6.000000)
The weights then always seem to differ after the 14th/15th fractional digit and
only a small number of results is affected (3 out of ~16000 with a slightly
lower weight in one case, 4 out of ~70000 with a slightly higher one in
another). Platform is Debian Lenny 64bit, AMD Opteron CPUs, core-1.2.3 patched
to r15140 and using chert. This also happens with complex queries where groups
of results are expected to have identical weights.
FIX: I found a simple fix for this issue, at least for my test cases:
I added
(Continue reading)
How much is "slightly"? Or did you just mean it's doing less work,
rather than that there's a measurable speed-up.
> In order to prevent more such issues, it might be a good idea to round
> weights to a few fractional digits (10 should be enough) before using
> them as sort keys.
Rounding isn't a magic solution to such issues, and explicitly rounding
all the weights is extra work. I think it's better to focus on getting
the calculations right rather than trying to disguise any problems.
Cheers,
Olly
RSS Feed