Olly Betts | 7 Jul 02:41
Favicon
Gravatar

Re: [Xapian-commits] 10821: trunk/xapian-core/ trunk/xapian-core/api/

On Sun, Jul 06, 2008 at 11:57:40PM +0100, richard wrote:
> api/omenquire.cc: When calculating percentages, round to the
> nearest integer, rather than rounding down.  There was a FIXME
> about this, but no explanation of why it hadn't already been
> done, and I can see no bad side effects so far.  The most obvious
> positive effect is that queries which should get precisely 100%
> will no longer be assigned 99% due to rounding errors.

Well, one issue is that queries which shouldn't get precisely 100% now
can...

I don't know how common an issue that is, but then I don't know how
common the issue you mention is either.

Cheers,
    Olly
Richard Boulton | 7 Jul 09:06

Re: [Xapian-commits] 10821: trunk/xapian-core/ trunk/xapian-core/api/

Olly Betts wrote:
> On Sun, Jul 06, 2008 at 11:57:40PM +0100, richard wrote:
>> api/omenquire.cc: When calculating percentages, round to the
>> nearest integer, rather than rounding down.  There was a FIXME
>> about this, but no explanation of why it hadn't already been
>> done, and I can see no bad side effects so far.  The most obvious
>> positive effect is that queries which should get precisely 100%
>> will no longer be assigned 99% due to rounding errors.
> 
> Well, one issue is that queries which shouldn't get precisely 100% now
> can...
> 
> I don't know how common an issue that is, but then I don't know how
> common the issue you mention is either.

The test case I committed yesterday suffered from this problem for me, 
and I've certainly seen it before (generally with large queries), but I 
couldn't guess at a rate at which it occurs.

I don't think it's unreasonable to return 100% for a document which 
matches well enough to get 99.5%; and it's certainly more reasonable 
than returning 99% for a document which actually got 99.999999%.

I suppose we could instead round up only very slightly, so that a 
document needed to get at least 99.9999% or so to be returned with 100%. 
  I'm not sure whether that would be better or worse than rounding to 
nearest, but either is better than the rounding down which we had.

--

-- 
Richard
(Continue reading)

Olly Betts | 13 Jul 11:38
Favicon
Gravatar

Re: [Xapian-commits] 10821: trunk/xapian-core/ trunk/xapian-core/api/

On Mon, Jul 07, 2008 at 08:06:32AM +0100, Richard Boulton wrote:
> Olly Betts wrote:
> > On Sun, Jul 06, 2008 at 11:57:40PM +0100, richard wrote:
> >> api/omenquire.cc: When calculating percentages, round to the
> >> nearest integer, rather than rounding down.  There was a FIXME
> >> about this, but no explanation of why it hadn't already been
> >> done, and I can see no bad side effects so far.  The most obvious
> >> positive effect is that queries which should get precisely 100%
> >> will no longer be assigned 99% due to rounding errors.
> > 
> > Well, one issue is that queries which shouldn't get precisely 100% now
> > can...
> > 
> > I don't know how common an issue that is, but then I don't know how
> > common the issue you mention is either.
> 
> The test case I committed yesterday suffered from this problem for me, 
> and I've certainly seen it before (generally with large queries), but I 
> couldn't guess at a rate at which it occurs.

I can't reproduce this issue with the patch reversed, and it makes
handling of percentage cutoff inconsistent - setting the cutoff to n%
doesn't return documents which would have got n% by being rounded up.

So I've reversed it for now (and added a testcase pctcutoff3 to show the
issue, which failed with the patch applied).

> I don't think it's unreasonable to return 100% for a document which 
> matches well enough to get 99.5%; and it's certainly more reasonable 
> than returning 99% for a document which actually got 99.999999%.
(Continue reading)

Richard Boulton | 14 Jul 12:04

Re: [Xapian-commits] 10821: trunk/xapian-core/ trunk/xapian-core/api/

Olly Betts wrote:
> I can't reproduce this issue with the patch reversed, and it makes
> handling of percentage cutoff inconsistent - setting the cutoff to n%
> doesn't return documents which would have got n% by being rounded up.

Ah, good point.

> If we're going to round, we need to fix how the percentage cut-off is
> handled by the matches to account for the 0.5% shift.

Yes, that makes sense.

> Do you have a repeatable testcase where this happens?

No, I don't.  I'll keep an eye out and try and build a test case if/when 
I see if happening again.

--

-- 
Richard
Henrik Brix Andersen | 29 Jul 20:53
Picon
Favicon

xapian-omega runfilter.cc patch

Hi,

The following patch for runfilter.cc is needed for building
xapian-omega on FreeBSD:

--- runfilter.cc.orig	2008-07-03 21:16:54.000000000 +0200
+++ runfilter.cc	2008-07-03 21:18:48.000000000 +0200
@@ -25,6 +25,7 @@
 #include "safeerrno.h"
 #include <sys/types.h>
 #include <stdio.h>
+#include <signal.h>
 #include "safefcntl.h"

 #ifdef HAVE_SYS_TIME_H

Attempting to build xapian-omega 1.0.7 (and 1.0.6 for that matter)
without the above patch results in the following compile error:

c++ -DHAVE_CONFIG_H -I. -I./common  -DCONFIGFILE_SYSTEM=\"/usr/local/etc/omega.conf\"
-I/usr/local/include -Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align
-Wno-long-long -Wformat-security -Wconversion -fno-gnu-keywords -Wundef -Wshadow -Winit-self
-Wstrict-overflow=5 -fvisibility=hidden -O2 -pipe -march=prescott -fno-strict-aliasing
-I/usr/local/include -MT runfilter.o -MD -MP -MF .deps/runfilter.Tpo -c -o runfilter.o runfilter.cc
runfilter.cc: In function 'std::string stdout_to_string(const std::string&)':
runfilter.cc:69: error: 'SIGCHLD' was not declared in this scope
runfilter.cc:69: error: 'SIG_DFL' was not declared in this scope
runfilter.cc:69: error: 'signal' was not declared in this scope
*** Error code 1

(Continue reading)

Olly Betts | 30 Jul 02:11
Favicon
Gravatar

Re: xapian-omega runfilter.cc patch

On Tue, Jul 29, 2008 at 08:53:36PM +0200, Henrik Brix Andersen wrote:
> The following patch for runfilter.cc is needed for building
> xapian-omega on FreeBSD:

Thanks, applied to trunk and added to the release notes.  I'll backport
the fix to 1.0.x when I'm next doing a batch.

Cheers,
    Olly
Reini Urban | 31 Jul 09:53
Picon
Gravatar

Re: [Xapian-discuss] Dealing with image PDF's

2008/7/30 Frank Bruzzaniti <frank.bruzzaniti <at> gmail.com>:
>    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>    string safefile = shell_protect(file);
>    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>    try {
>        dump = stdout_to_string(cmd);
>    } catch (ReadError) {
>        cout << "\"" << cmd << "\" failed - skipping\n";
>        return;
>    }

Can we finally please use configure checks for such weird helper apps,
to avoid runtime exceptions were the system clearly has no such app.

I once provided a huge patch to to do that.
http://thread.gmane.org/gmane.comp.search.xapian.devel/783/

Applied to 1.0.5 it is attached. But there's much more in this patch
so some parts may be stripped. See ChangeLog.
TEXTCAT support for language and charset detection, cached virtual
directories (zip,msg,pst,...) to name a few. Works fine for me for two
years and I haven't touched
it since 0.9.6.
--

-- 
Reini Urban
http://phpwiki.org/ http://murbreak.at/
Attachment (xapian-omega-1.0.5a.patch.gz): application/x-gzip, 41 KiB
_______________________________________________
(Continue reading)

Richard Boulton | 31 Jul 10:55

Re: [Xapian-discuss] Dealing with image PDF's

Reini Urban wrote:
> 2008/7/30 Frank Bruzzaniti <frank.bruzzaniti <at> gmail.com>:
>>    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>>    string safefile = shell_protect(file);
>>    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>>    try {
>>        dump = stdout_to_string(cmd);
>>    } catch (ReadError) {
>>        cout << "\"" << cmd << "\" failed - skipping\n";
>>        return;
>>    }
> 
> Can we finally please use configure checks for such weird helper apps,
> to avoid runtime exceptions were the system clearly has no such app.
> 
> I once provided a huge patch to to do that.
> http://thread.gmane.org/gmane.comp.search.xapian.devel/783/

Perhaps the patch should go in a ticket; that way, we're less likely to 
forget about it.

> Applied to 1.0.5 it is attached. But there's much more in this patch
> so some parts may be stripped. See ChangeLog.
> TEXTCAT support for language and charset detection, cached virtual
> directories (zip,msg,pst,...) to name a few. Works fine for me for two
> years and I haven't touched
> it since 0.9.6.

Sounds useful.  However, I'm not sure that configure time is the right 
place to check for the existence of helper apps.  In particular, quite 
(Continue reading)

Reini Urban | 31 Jul 13:40
Picon
Gravatar

Re: [Xapian-discuss] Dealing with image PDF's

2008/7/31 Richard Boulton <richard <at> lemurconsulting.com>:
> Reini Urban wrote:
>>
>> 2008/7/30 Frank Bruzzaniti <frank.bruzzaniti <at> gmail.com>:
>>>
>>>   // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>>>   string safefile = shell_protect(file);
>>>   string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>>>   try {
>>>       dump = stdout_to_string(cmd);
>>>   } catch (ReadError) {
>>>       cout << "\"" << cmd << "\" failed - skipping\n";
>>>       return;
>>>   }
>>
>> Can we finally please use configure checks for such weird helper apps,
>> to avoid runtime exceptions were the system clearly has no such app.
>>
>> I once provided a huge patch to to do that.
>> http://thread.gmane.org/gmane.comp.search.xapian.devel/783/
>
> Perhaps the patch should go in a ticket; that way, we're less likely to
> forget about it.

Ticket? Uh my fault. I never though about that. Sounds useful :)
http://trac.xapian.org/ticket/285
Should probably be splitted into multiple tickets, patches.

>> Applied to 1.0.5 it is attached. But there's much more in this patch
>> so some parts may be stripped. See ChangeLog.
(Continue reading)

Richard Boulton | 31 Jul 13:54

Re: [Xapian-discuss] Dealing with image PDF's

(Moving this thread to just Xapian-devel - it's not really general 
discussion about xapian, and I don't think cross posting it further is 
particularly useful.)

Reini Urban wrote:
> Ticket? Uh my fault. I never though about that. Sounds useful :)
> http://trac.xapian.org/ticket/285
> Should probably be splitted into multiple tickets, patches.

That would be helpful, yes.

> I solved the preconfigured binary package problem with packaging dependencies.

But that requires all the potential helper packages to be installed 
whenever omega is.  That would be fairly annoying for a user who, say, 
just wanted to index some HTML pages.  It also doesn't help if a helper 
program isn't available as a package (which could certainly be the case 
for a new helper program, but I don't know if we're using any such 
helpers currently).

> I cache would be overkill.

A simple (in-memory) cache of helper program paths is hardly 
heavyweight.  But, in any case, I'm not convinced that such a cache is 
needed - I don't expect the time taken to look through $PATH for files 
to be a bit part of the indexing time.

> Another advantage of such a config setting would be to hardcode the
> actual helper location and don't search the whole PATH at runtime for it.

(Continue reading)


Gmane