Olly Betts | 1 Oct 01:21
Favicon
Gravatar

Re: C++ parser for doc.get_data() result.

On Wed, Sep 30, 2009 at 03:04:44PM -0700, Kevin Duraj wrote:
> Did anybody wrote and would like to share a routines that parse result
> from doc.get_data() into some key and  pair values in C++ ?

Omega has such code:

http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2060

Cheers,
    Olly
James Aylett | 1 Oct 12:48

Re: C++ parser for doc.get_data() result.

On Wed, Sep 30, 2009 at 03:04:44PM -0700, Kevin Duraj wrote:

> Did anybody wrote and would like to share a routines that parse result
> from doc.get_data() into some key and  pair values in C++ ?

There's code in omega's query.cc (around line 2050 on trunk).

However if you're indexing yourself (as well as searching yourself),
you may prefer to use another serialisation format for which you have
an off-the-shelf library. There's nothing magical about the omega
convention for use of document data -- it is purely a convention. (And
has some limitations if you want to stuff arbitrary data inside it.)

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
Tom Mortimer | 1 Oct 12:56
Picon
Favicon

QueryParser and booleans with spaces

Hi,

Does anyone know if it's possible to use the add_boolean_prefix
technique in QueryParser with terms which include a space? I want to
filter on (say) The Guardian, but none of these work:

title:The Guardian
title:"The Guardian"
title:(The Guardian)

I can work around it with a string substitution, e.g.
title:The_Guardian, but I'd rather be able to keep the space, if
possible.

cheers,
Tom
double | 1 Oct 16:35
Picon
Picon

The revision being read has been discarded

Hello,

What does this error mean? What do to is clear :-), but what
was the cause?

The revision being read has been discarded - you should call
Xapian::Database::reopen() and retry the operation

Thanks a lot
Marcus
Olly Betts | 1 Oct 19:37
Favicon
Gravatar

Re: QueryParser and booleans with spaces

On Thu, Oct 01, 2009 at 11:56:33AM +0100, Tom Mortimer wrote:
> Does anyone know if it's possible to use the add_boolean_prefix
> technique in QueryParser with terms which include a space? I want to
> filter on (say) The Guardian, but none of these work:
> 
> title:The Guardian
> title:"The Guardian"
> title:(The Guardian)
> 
> I can work around it with a string substitution, e.g.
> title:The_Guardian, but I'd rather be able to keep the space, if
> possible.

I already answered on IRC, but for the archives, this isn't currently
possible.  And there is a ticket for it:

http://trac.xapian.org/ticket/128

Cheers,
    Olly
Andreas Marienborg | 2 Oct 07:53
Picon
Gravatar

Re: The revision being read has been discarded


On 1 Oct 2009, at 16:35, double wrote:

> Hello,
>
> What does this error mean? What do to is clear :-), but what
> was the cause?
>
> The revision being read has been discarded - you should call
> Xapian::Database::reopen() and retry the operation

This is usually a result of the database chaining during the read, or  
in between calling enquire and get_mset (at least thats when I get it).

I usually (in perl) eval around it in a loop and retry a few times  
with reopen each time, but I index new items several times each  
minute, so I guess I'm getting the exception more often than some  
other use cases should :)

- andreas
Jason Tackaberry | 4 Oct 03:15
Picon

Filtering queries with many boolean terms

Hi all,

I'm thinking of using xapian to index my growing mail collection, which
I suppose is a reasonably common use-case.  (The fact that gmane has
used xapian to index some tens of millions of emails is what piqued my
interest.)

First I want to start off with how very impressed I've been with Xapian.
The API is nice, feature-complete bindings are available for my
preferred language, it's very fast, and it generally Just Works exactly
how I expect.  It's very good quality software.

One of the features I'd like to have is gmail-like tagging of threads.
This means I need to be able to add and remove tags to arbitrarily many
documents after they have been indexed.

Unfortunately, xapian is quite slow at updating many documents.
Updating 10k documents by adding an XTAGfoo term takes 40 seconds
(including db flush) on my system.  One use-case for updating this many
documents is, for example, applying a tag to existing emails in a
mailing list.

So, I am looking into an alternative approach.  I'm already maintaining
a table of all threads (gmail would call them conversations) in an sql
database.  All emails in the xapian database are given an XTID<n> term
corresponding to the thread id in the sql database.  If I want to search
for all documents with the word "foo" and with tag "bar", I figured I'd
first look up all thread ids with tag "bar" in SQL, and then use those
ids as filters with the Xapian search.  So the Xapian query might look
like:
(Continue reading)

James Aylett | 5 Oct 20:27

Re: Filtering queries with many boolean terms

On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:

>         (foo) (tid:0 tid:1 tid:2 tid:3)

What are the brackets for? The '*' in the output is, I think,
OP_SCALE_WEIGHT, which doesn't seem a good query structure for what
you're trying to do.

You actually want a FILTER query at top level; which you've figured
out how to generate. I don't know why this isn't behaving as fast as
you want (some tabulated figures on this, with information about the
corpus size and so on might help others respond here).

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
Olly Betts | 5 Oct 22:50
Favicon
Gravatar

Re: Filtering queries with many boolean terms

On Mon, Oct 05, 2009 at 07:27:38PM +0100, James Aylett wrote:
> On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:
> 
> >         (foo) (tid:0 tid:1 tid:2 tid:3)
> 
> What are the brackets for? The '*' in the output is, I think,
> OP_SCALE_WEIGHT, which doesn't seem a good query structure for what
> you're trying to do.

Yes, "*" means OP_SCALE_WEIGHT.

> You actually want a FILTER query at top level; which you've figured
> out how to generate. I don't know why this isn't behaving as fast as
> you want (some tabulated figures on this, with information about the
> corpus size and so on might help others respond here).

As of 1.0.4, any scale factor is pushed down to the leaf level by the
query optimiser when it builds the postlist tree, and in 1.1.x it is
pushed into the weighting scheme.  So the different query
representations aren't *necessarily* actually executed differently.  In
particular, OP_FILTER should be executed the same as OP_AND with
OP_SCALE_WEIGHT 0 on one branch.

But I've not had time to look at this in detail yet I'm afraid.

Cheers,
    Olly
Olly Betts | 6 Oct 12:00
Favicon
Gravatar

Re: Filtering queries with many boolean terms

On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:
> Unfortunately, xapian is quite slow at updating many documents.
> Updating 10k documents by adding an XTAGfoo term takes 40 seconds
> (including db flush) on my system.  One use-case for updating this many
> documents is, for example, applying a tag to existing emails in a
> mailing list.

Currently we rewrite all the terms if any are changed, which is why it
is a bit slow.  This could be improved:

http://trac.xapian.org/ticket/250

So one approach would just be to live with the 40 seconds for now and
you'll get a nice speed up when that optimisation is implemented.

> So, I am looking into an alternative approach.  I'm already maintaining
> a table of all threads (gmail would call them conversations) in an sql
> database.  All emails in the xapian database are given an XTID<n> term
> corresponding to the thread id in the sql database.  If I want to search
> for all documents with the word "foo" and with tag "bar", I figured I'd
> first look up all thread ids with tag "bar" in SQL, and then use those
> ids as filters with the Xapian search.  So the Xapian query might look
> like:
> 
>         (foo) (tid:0 tid:1 tid:2 tid:3)

Note that you really shouldn't build up query strings mechanically to
pass to the QueryParser.  It doesn't have a rigidly defined syntax, but
is really a heuristic parser aiming to handle potentially poorly formed
human written input (and those heuristics can change between releases)
(Continue reading)


Gmane