Olly Betts | 1 Dec 2005 01:36
Favicon
Gravatar

Re: Easy deletion of documents?

On Wed, Nov 30, 2005 at 01:55:05PM -0600, Tony Lambiris wrote:
> Is there an easy way to delete, or lookup a docid, then delete out of 
> the database? I've been hunting around but came up empty. Basically I 
> have a bunch of files, all unique filenames, and some files will become 
> "out-of-date" if you will, I just need a simple way of reaching into the 
> database and deleting that doc based on the filename.

Just add the filename as a term to each document (with a prefix to
distinguish it from terms from text - "Q" is the usual prefix for
such uniQue terms, but nothing in the core library relies on this.

Then you can directly call delete_document or replace_document with a
termname instead of a document id.

If you have long filenames, beware of the termname length limit.  I
don't remember off the top of my head what the precise limit is, but
240 characters is certainly safe.

Omega handles longer terms by truncating and adding a hash of the rest
of the string in this case, which is a simple and robust approach.  See
function make_url_term in omindex.cc for details.

Cheers,
    Olly
Olly Betts | 1 Dec 2005 02:09
Favicon
Gravatar

Re: queryparser characters question

On Wed, Nov 30, 2005 at 03:03:52PM -0800, Eric Parusel wrote:
> I noticed that  <at> #$%^ gets split on when in a query, but & doesn't...
> 
> eg: search term: aaa!bbb <at> ccc#ddd$eee%fff%ggg^hhh&iii
> query: Xapian::Query((aaa:(pos=1) OR bbb:(pos=2) OR ccc:(pos=3) OR
> ddd:(pos=4) OR eee:(pos=5) OR fff:(pos=6) OR ggg:(pos=7) OR
> hhh&iii:(pos=8)))

Yes, the rationale is that company names are often abbreviated to
terms containing "&", such as "AT&T", "M&S", "C&W", etc.

We also keep trailing "+" (e.g. C++), "#" (e.g. C#), and - (e.g. SO42-,
as in a sulphate ion).  The last is of debatable benefit, and has a
tendency to glue a hyphen onto the preceding word when the author
misses out a space, or from hardcopy formatted text where words are
hyphenated over linebreaks.  I've been wondering about dropping that
rule.

There's special handling for capital letters with dots between (so
I.B.M. is treated as IBM).

And I think we split on everything else which isn't alphanumeric, then
generate phrase searches when query terms are separated by one or
more of .-/':\_ <at>  which covers contractions ("doesn't" etc), common urls,
most email addresses, ip addresses (both v4 and v6), hostnames,
filenames, identifiers (like LD_LIBRARY_PATH), classnames in most OO
languages.

> Is there somewhere where this is documented?  I'd like to try to match
> up my importing "splitting" to the queryparser.
(Continue reading)

Tony Lambiris | 1 Dec 2005 23:05

Ideas for faster DB searching

We at work recently built an raid level 0 in an Apple Xraid, have it 
mounted under Linux 2.6 with noatime options... is there any other 
general recommendations for speeding up Xapian DB searching? Is there 
anything I can store in memory for faster lookups? Also we have a bunch 
of seperate DBs (one per day, each one ~100k documents)... is it 
recommended to just have one big DB? To span across 10 DBs with a 2 word 
search query takes about 25-30 seconds.

Any insight is greatly appriciated.
Tony Lambiris | 1 Dec 2005 23:29

Re: Ideas for faster DB searching

I finally came across that quest utility, and I have to say the 
speed-ups are incredible... why would omega have that much overhead?

Still, any additional ideas for speed-ups are very appriciated.

Thanks.

Tony Lambiris wrote:
> We at work recently built an raid level 0 in an Apple Xraid, have it 
> mounted under Linux 2.6 with noatime options... is there any other 
> general recommendations for speeding up Xapian DB searching? Is there 
> anything I can store in memory for faster lookups? Also we have a bunch 
> of seperate DBs (one per day, each one ~100k documents)... is it 
> recommended to just have one big DB? To span across 10 DBs with a 2 word 
> search query takes about 25-30 seconds.
> 
> Any insight is greatly appriciated.
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss <at> lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 
Tony Lambiris | 2 Dec 2005 00:44

Searching Databases

When I use quest, if I search for "don't have", it appears as if Xapian 
is parsing for "don" and "t" -- is there a way to modify or change this 
behavior?
Jason Toffaletti | 2 Dec 2005 01:27

identifying matched terms

I'm having trouble finding information on how to identify which terms
within a document triggered a match for a query. I'd like to use this
information to highlight the terms when displaying query results to a
user, much like how gmane or google do. Is there an API for this
purpose?

P.S. - I'm not subscribed to the list so please reply to me in addition
to the list.

Thanks,
Jason T.
Olly Betts | 2 Dec 2005 01:45
Favicon
Gravatar

Re: Ideas for faster DB searching

On Thu, Dec 01, 2005 at 04:29:42PM -0600, Tony Lambiris wrote:
> I finally came across that quest utility, and I have to say the 
> speed-ups are incredible... why would omega have that much overhead?

Did you try disabling "topterms"?  Do so by removing the $topterms bits
of the default query template.

The topterms can take a while to compute - it seems to be especially
a problem with multiple databases from what I've seen, but I've not
investigated why.

> We at work recently built an raid level 0 in an Apple Xraid, have it 
> mounted under Linux 2.6 with noatime options...

I doubt noatime makes much difference, as there aren't a lot of files
involved (unlike INN or similar), but it can't hurt.  If you have
figures showing it makes a measurable difference, let me know and I'll
add it as general advice.

> is there any other general recommendations for speeding up Xapian DB
> searching? Is there anything I can store in memory for faster lookups?

We leave disk caching to the OS, since it has a much better overview of
the VM system.  This also simplifies the library code.  This approach
seems to work very well on modern OSes.  Linux in particular is
generally very eager to cache disk blocks, sometimes to the extent of
swapping out code you didn't want swapped out I've occasionally found.

It's possible you can tune your OS's VM system for this - I know Linux
2.6 has some knobs, but I've not tried using them myself.
(Continue reading)

Olly Betts | 2 Dec 2005 01:51
Favicon
Gravatar

Re: identifying matched terms

On Thu, Dec 01, 2005 at 04:27:32PM -0800, Jason Toffaletti wrote:
> I'm having trouble finding information on how to identify which terms
> within a document triggered a match for a query. I'd like to use this
> information to highlight the terms when displaying query results to a
> user, much like how gmane or google do. Is there an API for this
> purpose?

Yes:

http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#a18

Hmm, it seems you should really be able to call methods on an
MSetIterator to get this information - the Enquire class isn't the most
logical place for them.  I'll have a look and see if there's a good
reason why this is part of Enquire.

Cheers,
    Olly
Francis Irving | 5 Dec 2005 19:51

Error compiling Perl bindings

I'm installing Search-Xapian-0.9.2.2 (from
http://search.cpan.org/~kilinrax/Search-Xapian-0.9.2.2/Xapian.pm)
on top of xapian-core-0.9.2

I get the following errors:

g++ -c   -D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2   -DVERSION=\"0.9.2.1\"
-DXS_VERSION=\"0.9.2.1\" -fPIC "-I/usr/lib/perl/5.8/CORE"   Xapian.c
XS/WritableDatabase.xs: In function ‘void
XS_Search__Xapian__WritableDatabase_new3(PerlInterpreter*, CV*)’:
XS/WritableDatabase.xs:32: error: ‘InMemory’ has not been declared
/usr/include/fcntl.h:76: error: too few arguments to function ‘int open(const char*, int, ...)’
XS/WritableDatabase.xs:32: error: at this point in file
XS/WritableDatabase.xs:32: error: no match for ‘operator=’ in ‘* RETVAL = open(<expression error>)’
/usr/local/include/xapian/database.h:261: note: candidates are: void
Xapian::WritableDatabase::operator=(const Xapian::WritableDatabase&)
make: *** [Xapian.o] Error 1

Any ideas?

Francis
Olly Betts | 5 Dec 2005 20:11
Favicon
Gravatar

Re: Error compiling Perl bindings

On Mon, Dec 05, 2005 at 06:51:07PM +0000, Francis Irving wrote:
> I'm installing Search-Xapian-0.9.2.2 (from
> http://search.cpan.org/~kilinrax/Search-Xapian-0.9.2.2/Xapian.pm)
> on top of xapian-core-0.9.2
> 
> I get the following errors:
> 
> g++ -c   -D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2   -DVERSION=\"0.9.2.1\"
-DXS_VERSION=\"0.9.2.1\" -fPIC "-I/usr/lib/perl/5.8/CORE"   Xapian.c

This seems to think it's 0.9.2.1 not 0.9.2.2 (could be Alex failed to
update the version, or it could be you aren't using the version you
think you are).

> XS/WritableDatabase.xs: In function ???void
XS_Search__Xapian__WritableDatabase_new3(PerlInterpreter*, CV*)???:
> XS/WritableDatabase.xs:32: error: ???InMemory??? has not been declared
> /usr/include/fcntl.h:76: error: too few arguments to function ???int open(const char*, int, ...)???
> XS/WritableDatabase.xs:32: error: at this point in file
> XS/WritableDatabase.xs:32: error: no match for ???operator=??? in ???* RETVAL = open(<expression error>)???
> /usr/local/include/xapian/database.h:261: note: candidates are: void
Xapian::WritableDatabase::operator=(const Xapian::WritableDatabase&)
> make: *** [Xapian.o] Error 1

Well, my guess from the information here would be it's GCC 4.0.1 being
fussier about C++ syntax.

What does "g++ -v" report?

(Continue reading)


Gmane