Olly Betts | 1 May 01:05
Favicon
Gravatar

Re: bootstrap: macro `AM_CXXFLAGS' not found in library

On Sat, Apr 28, 2007 at 09:26:57PM -0500, Kenneth Loafman wrote:
> automake (GNU automake) 1.4-p6 <--- THIS IS OLD, SB 1.9.5 <---

Indeed.

> The version problem was not reported by bootstrap.

The bootstrap script doesn't check versions, but configure.ac does
specify a minimum version for automake of 1.9.5:

AM_INIT_AUTOMAKE([1.9.5 -Wportability])

And Makefile.am also specifies a minimum version higher than the one
you were using:

AUTOMAKE_OPTIONS = 1.5 subdir-objects

In case you're wonderng, the reason it's different in each place is it
specifies the minimum version which that particular file requires (or at
least is believed to).

If automake 1.4-p6 doesn't give a message telling you that a newer
automake is required, I think we've done all we reasonably can.
After all, we do clearly document the required and recommended
versions in "HACKING", which we recommend people building from SVN
should read.

I'd rather not start adding parallel version checks to the bootstrap
script as that increases maintenance work, especially as they're rather
fiddly to do in shell script.
(Continue reading)

Olly Betts | 1 May 01:10
Favicon
Gravatar

Re: Omega display times

On Mon, Apr 30, 2007 at 01:49:54PM +0100, Richard Boulton wrote:
> Are you using the "topterms" facility?  Turning that off can give a 
> large speed boost.

This issue seems to crop up on the list a few times a year, and I
suspect more people hit it but don't report it (either they find the
solution via the list archives, or just condemn Xapian as slow).

I've been fixing a problem in expand, and I think it could be made
somewhat more efficient, but I wonder if we should remove $topterms from
the default "query" template, and offer a new "topterms" template with
it in.

Thoughts?

Cheers,
    Olly
Duncan McQueen | 1 May 03:15
Picon

Re: Simple Question on Omega

Thank you!  Figured it was an easy thing that didn't need a change to the
code ;)

On 4/30/07, Olly Betts <olly <at> survex.com> wrote:
>
> On Sat, Apr 28, 2007 at 02:05:01PM -0500, Duncan McQueen wrote:
> > When my results return from the web search, I see the format as
> > /fileserver/[path to file].   I would rather see it as F:\[path to
> file].
> > Is there a way to change this easily?
>
> You should just be able to use omindex's --url option, probably
> something like this (I'm mostly guessing the syntax for "file:" URLs
> with a drive letter):
>
> omindex --db /path/to/mydatabase --url 'file://F:/' /fileserver/
>
> (So, do I get the $20?
>
> http://www.rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequest.asp?lngBidRequestId=674021
> )
>
> Cheers,
>     Olly
>
Josef Novak | 1 May 04:16
Picon

BTREE_MAX_KEY_LEN=252

Hi,

  I have a quick question about the BTREE_MAX_KEY_LEN variable and
what happens when it is exceeded.

  I have an app which is indexing a very large set of (japanese)
documents, and some of the keys are rather long, garbage-like 400+
byte nuggets of text.  When my app attempts to index these guys xapian
balks and throws an exception:

Exception: Key too long: length was 446 bytes, maximum length of a key
is BTREE_MAX_KEY_LEN bytes

  If I make sure to check that the new posting token (key) does not
exceed the 252 byte maximum specified here,
http://www.xapian.org/docs/sourcedoc/html/btree_8h.html#b8d8c0c3cbbcec113aa5e3f5edace5dd

I have no problems.  However, I noticed that if I run the program
without checking the posting token length before attempting to add it,
it will sometimes throw the exception and keep on trucking, yet
sometimes it will throw the exception and then throw a segmentation
fault and unceremoniously die.

  As far as I can tell it is the over-long posting token that is
causing the problem in both cases.  It may be that the exception then
causes the posting index to also get out of sync?

for (int i = 0; i < index_tokens.size(); i++)
      newdocument.add_posting(index_tokens[i], i);

(Continue reading)

Olly Betts | 1 May 04:27
Favicon
Gravatar

Re: BTREE_MAX_KEY_LEN=252

On Tue, May 01, 2007 at 11:16:35AM +0900, Josef Novak wrote:
> Exception: Key too long: length was 446 bytes, maximum length of a key
> is BTREE_MAX_KEY_LEN bytes
> 
>  If I make sure to check that the new posting token (key) does not
> exceed the 252 byte maximum specified here,
> http://www.xapian.org/docs/sourcedoc/html/btree_8h.html#b8d8c0c3cbbcec113aa5e3f5edace5dd

The btree keys need to contain a term and other information, so the
actual safe term length limit is less than 252 bytes - I recommend
imposing a limit of 240.  Read this for the full details:

http://article.gmane.org/gmane.comp.search.xapian.general/3656

> I have no problems.  However, I noticed that if I run the program
> without checking the posting token length before attempting to add it,
> it will sometimes throw the exception and keep on trucking, yet
> sometimes it will throw the exception and then throw a segmentation
> fault and unceremoniously die.

You shouldn't get a SEGV, so that sounds like a bug.  Can you provide a
small self-contained example which demonstrates this?

>  As far as I can tell it is the over-long posting token that is
> causing the problem in both cases.  It may be that the exception then
> causes the posting index to also get out of sync?

The exception should cause the current batch of unapplied changes to be
abandonned, so it shouldn't be possible for the tables to get out of
sync.
(Continue reading)

Josef Novak | 1 May 05:20
Picon

Re: BTREE_MAX_KEY_LEN=252

Hi,

> The btree keys need to contain a term and other information, so the
> actual safe term length limit is less than 252 bytes - I recommend
> imposing a limit of 240.  Read this for the full details:
That's funny, I just ran into this problem and had to set it to 240 to
get past it!  Thanks though.
>
> http://article.gmane.org/gmane.comp.search.xapian.general/3656
Thanks.

>
> > I have no problems.  However, I noticed that if I run the program
> > without checking the posting token length before attempting to add it,
> > it will sometimes throw the exception and keep on trucking, yet
> > sometimes it will throw the exception and then throw a segmentation
> > fault and unceremoniously die.
>
> You shouldn't get a SEGV, so that sounds like a bug.  Can you provide a
> small self-contained example which demonstrates this?
I tried doing this a bunch of times while trying to figure out what
the problem was (before I realized it was the BTREE_MAX_KEY_LEN
problem).  I tried taking out a large chunk of text around the
offending key (1000 documents before and after) and reindexing just
this subsection.  However I can't seem to reproduce the problem.  I'm
sure I have the right subsection, where the problem is occurring,
because when index the entire set, it invariably segmentation faults
at the same line.  Yet, aside from the exception regarding the byte
length, the subsection seems to get indexed properly.  Incidentally
this time, using the same code I posted in the previous mail, I also
(Continue reading)

Kenneth Loafman | 1 May 13:48
Favicon
Gravatar

Re: bootstrap: macro `AM_CXXFLAGS' not found in library

Olly Betts wrote:
> On Sat, Apr 28, 2007 at 09:26:57PM -0500, Kenneth Loafman wrote:
>> automake (GNU automake) 1.4-p6 <--- THIS IS OLD, SB 1.9.5 <---
> 
> Indeed.
> 
>> The version problem was not reported by bootstrap.
> 
> The bootstrap script doesn't check versions, but configure.ac does
> specify a minimum version for automake of 1.9.5:

I understand that now.

> If automake 1.4-p6 doesn't give a message telling you that a newer
> automake is required, I think we've done all we reasonably can.
> After all, we do clearly document the required and recommended
> versions in "HACKING", which we recommend people building from SVN
> should read.

Agreed.  Bootstrap is not for the 'casual' builder.

> I'd rather not start adding parallel version checks to the bootstrap
> script as that increases maintenance work, especially as they're rather
> fiddly to do in shell script.

Agreed.  Would almost need a configure script for bootstrap to use in 
building the final configure, i.e. bootstrap the bootstrap.

>>> Alternatively, you can just try building from one of the automatic 
>>> snapshots, available at http://www.oligarchy.co.uk/xapian/trunk/
(Continue reading)

Olly Betts | 1 May 16:25
Favicon
Gravatar

Re: bootstrap: macro `AM_CXXFLAGS' not found in library

On Tue, May 01, 2007 at 06:48:36AM -0500, Kenneth Loafman wrote:
> No need to test for 1.0.0, but I do need to gen a new version with my 
> own weighting scheme.

You can implement your own weighting scheme just by subclassing
Xapian::Weight - no modifications to the library should be necessary.

For an example, see "MyWeight" in tests/api_db.cc.

Is there something extra you need access to that isn't currently
available?

> If you can tokenize it, you can search it, and some weighting schemes
> work better than others if the corpus is other than a collection of
> documents.

If the weighting scheme is likely to be useful to others, I'd encourage
you to submit it for inclusion in Xapian.

Cheers,
    Olly
Kenneth Loafman | 1 May 19:07
Favicon
Gravatar

Re: bootstrap: macro `AM_CXXFLAGS' not found in library

Olly Betts wrote:
> On Tue, May 01, 2007 at 06:48:36AM -0500, Kenneth Loafman wrote:
>> No need to test for 1.0.0, but I do need to gen a new version with my 
>> own weighting scheme.
> 
> You can implement your own weighting scheme just by subclassing
> Xapian::Weight - no modifications to the library should be necessary.
> 
> For an example, see "MyWeight" in tests/api_db.cc.

Thanks for the pointer!

> Is there something extra you need access to that isn't currently
> available?

Not sure at this time.  I have a few ideas to test out.

>> If you can tokenize it, you can search it, and some weighting schemes
>> work better than others if the corpus is other than a collection of
>> documents.
> 
> If the weighting scheme is likely to be useful to others, I'd encourage
> you to submit it for inclusion in Xapian.

That is my plan.

...Ken
Peter Karman | 2 May 18:23
Favicon
Gravatar

Re: Re: Xapian document matching


Denis Kuzmenok scribbled on 4/30/07 11:07 AM:
> Denis Kuzmenok <denis.kuzmenok <at> gmail.com> writes:
> 
>> Hi, i'm wondering is there a possibility to do like  ABCSok do 
>> (http://nyheter.abcsok.no/), to make "Main article" and "Same articles" 
>> collapsed to it.
>> Like on http://news.google.com/?hl=en the same thing. "Parent" and "same 
>> article on other sites" (they do differ from each other a little bit).
>> Maybe somebody know how to do that thing or where to read theory on doing 
> such 
>> things.
>> Thank you
>>
> 
> I find this module on CPAN
> http://search.cpan.org/~sid/WordNet-Similarity-1.04/lib/WordNet/Similarity.pm
> That's what i mean, to find if there is a similar document in the base and 
> collapse follow-ups to a thread.. Is there ary implementation in Xapian?
> Thanks
> 

I don't believe there's a similarity implementation in Xapian. There was a 
similar thread on the Swish list a couple years ago:

  http://swish-e.org/archive/2005-02/8977.html

which seemed to suggest using the Levenshtein distance algorithm to determine 
similarity before indexing. Maybe the use of a 'similarity' field (value?) in 
Xapian could achieve something similar.
(Continue reading)


Gmane