Picon
Favicon

Simulating Fields


I'm very new with Xapian.

I'm trying to index a set of documents (pretty big). They have titles,
categories, tags, and of course, the document bodies themselves. I would like to
be able to query for any of those "fields". In particular, in the case of the
titles, I'd like to be able to retrieve the document with _this exact title_.

First, am I right to assume that Xapian has no concept of "fields" and that they
must be simulated with term prefixes? I couldn't find a lot of documentation
about it...

If they must be simulated, what's the best way to do it? Currently, I'm indexing
the documents:

=== python ===
       indexer.set_document(doc)
       indexer.index_text_without_positions(page.text)
       indexer.index_text(page.title)
       indexer.index_text(page.title,1,"XTITLE")
===

delve shows the title terms indexed with the XTITLE prefix, but, how do I write
a query to search for those? How can I write a query to retrieve a document
given the title (but not those with similar titles)?

--

-- 
Luis Zarrabeitia
Facultad de Matemática y Computación, UH
http://profesores.matcom.uh.cu/~kyrie
(Continue reading)

James Aylett | 3 May 00:32

Re: Simulating Fields

On Sat, May 02, 2009 at 05:19:56PM -0400, Luis Alberto Zarrabeitia Gomez wrote:

> delve shows the title terms indexed with the XTITLE prefix, but, how
> do I write a query to search for those? How can I write a query to
> retrieve a document given the title (but not those with similar
> titles)?

You can use the QueryParser for this; set a term prefix before parsing
the query, for instance. Not that if you're matching on exact titles,
you probably want a phrase search (or, if you're not doing anything
else, possibly to construct the term as XTITLE<title> and match it as
a single (boolean) term, without using the QueryParser at all).

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
Picon
Favicon

Re: Simulating Fields


Quoting James Aylett <james-xapian <at> tartarus.org>:

> On Sat, May 02, 2009 at 05:19:56PM -0400, Luis Alberto Zarrabeitia Gomez
> wrote:
> 
> > delve shows the title terms indexed with the XTITLE prefix, but, how
> > do I write a query to search for those? How can I write a query to
> > retrieve a document given the title (but not those with similar
> > titles)?
> 
> You can use the QueryParser for this; set a term prefix before parsing
> the query, for instance.

I tried to do this (qp is a QueryParser instance, with the same stemmer)

===
  qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
  qp.add_boolean_prefix("realtitle",'XTITLE')
  qp.add_prefix("title","XTITLE")
===

but then... how do I construct the query?

  title:Sex and the City

produces the query:
  Xapian::Query((XTITLEsex:(pos=1) OR Zand:(pos=2) OR Zthe:(pos=3) OR city:(pos=4)))

(i.e, only "sex" in the title), while
(Continue reading)

Olly Betts | 4 May 04:51
Favicon
Gravatar

Re: Simulating Fields

On Sun, May 03, 2009 at 10:04:03PM -0400, Luis Alberto Zarrabeitia Gomez wrote:
> 
> Quoting James Aylett <james-xapian <at> tartarus.org>:
> 
> > You can use the QueryParser for this; set a term prefix before parsing
> > the query, for instance.
> 
> I tried to do this (qp is a QueryParser instance, with the same stemmer)
> 
> ===
>   qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
>   qp.add_boolean_prefix("realtitle",'XTITLE')
>   qp.add_prefix("title","XTITLE")
> ===

I'm not sure what you think "realtitle" is going to be useful for.  It
doesn't generally make sense to have the same term prefix generated by
both boolean and probabilistic prefixes.

> but then... how do I construct the query?
> 
>   title:Sex and the City
> 
> produces the query:
>   Xapian::Query((XTITLEsex:(pos=1) OR Zand:(pos=2) OR Zthe:(pos=3) OR city:(pos=4)))
> 
> (i.e, only "sex" in the title), while
> 
>   title:'Sex and the City'

(Continue reading)

Picon
Favicon

Re: Simulating Fields


Quoting Olly Betts <olly <at> survex.com>:

> On Sun, May 03, 2009 at 10:04:03PM -0400, Luis Alberto Zarrabeitia Gomez
> wrote:
> > 
> > Quoting James Aylett <james-xapian <at> tartarus.org>:
> > 
> > > You can use the QueryParser for this; set a term prefix before parsing
> > > the query, for instance.
> > 
> > I tried to do this (qp is a QueryParser instance, with the same stemmer)
> > 
> > ===
> >   qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
> >   qp.add_boolean_prefix("realtitle",'XTITLE')
> >   qp.add_prefix("title","XTITLE")
> > ===
> 
> I'm not sure what you think "realtitle" is going to be useful for.  It
> doesn't generally make sense to have the same term prefix generated by
> both boolean and probabilistic prefixes.

Neither do I :D. I was trying to experiment which kind of prefix would help
me... and as the probabilistic query wasn't working (see below), I tried a
boolean one.

 
> If you want a phrase, the syntax is double quotes:
> 
(Continue reading)

George Wu | 4 May 06:56
Picon

Concurrency during update/merge

Dear all,

I am new to Xapian and actually I am evulating how Xapian is suitable
for our project.

One of our concern is concurrency during update/index flush operation.
Can a Xapian instance handle read during update/index flush operation?

Also, real-time search is important in our application.  In a previous
post I learnt that there can only be one Xapian instance having a write
access.  For other read-only instance, the database have to be reopened
before the newly inserted data is visible to read-only instances.  In
this case, how fast and how efficient is the index-reopen operation?

Thanks a lot!
George Wu
Olly Betts | 4 May 12:59
Favicon
Gravatar

Re: Simulating Fields

On Mon, May 04, 2009 at 12:26:43AM -0400, Luis Alberto Zarrabeitia Gomez wrote:
> Now, what would you recommend to match the document titled "sex and the city",
> but not "sex and the city 2: the return"?

I'm not sure I understand why the sequel isn't a relevant result (albeit
one which you would want to rank lower than the exact match).  Since I
don't really seem to understand the aim, I suspect I may be missing the
point of what you're trying to do.

> Adding a value to the document and
> then checking it for the documents in the result set?

That would avoid the length limit of a term.

But I think I'd try just setting a percentage cut-off at 100%.  With the
default BM25 parameters, that will only give you the shortest document
which matches all the terms in the query (or multiple documents if there
is a tie, such as the case of two documents with the same title).

That would return "sex and the city" for a query for 'sex city' (unless
there was a better match), but I'd think that was desirable.  You can
always vet the matching document to check if it was exact or not if you
want.

Cheers,
    Olly
Olly Betts | 4 May 13:26
Favicon
Gravatar

Re: Concurrency during update/merge

On Mon, May 04, 2009 at 12:56:13PM +0800, George Wu wrote:
> Can a Xapian instance handle read during update/index flush operation?

Yes.  A reader doesn't "lock down" its current version of the database
to ensure it remains valid indefinitely (if it did, the database could
easily become bloated with old revisions), so to be robust you should
catch any DatabaseModifiedError exception in the reader and recover by
calling reopen() on the database and retrying the failed operation.

> Also, real-time search is important in our application.  In a previous
> post I learnt that there can only be one Xapian instance having a write
> access.  For other read-only instance, the database have to be reopened
> before the newly inserted data is visible to read-only instances.  In
> this case, how fast and how efficient is the index-reopen operation?

Yes, readers look at a consistent version of the database so need to be
told to reopen() to see the latest version.

And reopen() is designed to be efficient.  If you want to know how long
it takes in microseconds, you'll have to profile it on your own
hardware.  If that seem too slow, send us some profiling data:

    http://trac.xapian.org/wiki/ProfilingXapian

Cheers,
    Olly
James Aylett | 4 May 15:35

Re: Simulating Fields

On Sun, May 03, 2009 at 10:04:03PM -0400, Luis Alberto Zarrabeitia Gomez wrote:

>   title:'Sex and the City'
> 
> produces
>   Xapian::Query((Ztitl:(pos=1) OR sex:(pos=2) OR Zand:(pos=3) OR Zthe:(pos=4) OR
> city:(pos=5)))

If you use double quote marks:

>>> print qp.parse_query('title:"Sex and the City"')
Xapian::Query((XTITLEsex:(pos=1) PHRASE 4 XTITLEand:(pos=2) PHRASE 4
XTITLEthe:(pos=3) PHRASE 4 XTITLEcity:(pos=4)))

then you get what you want. (Note that XTITLE is perhaps overkill;
Omega uses S as a prefix for subject, which is probably semantically
the same for you. Not that it's going to make an enormous difference.)

> > Not that if you're matching on exact titles,
> > you probably want a phrase search (or, if you're not doing anything
> > else, possibly to construct the term as XTITLE<title> and match it as
> > a single (boolean) term, without using the QueryParser at all).
> 
> I guess that'd mean that during the indexing, I would have to use
> the whole title as a single term? (just to be clear), instead of
> 'indexer.index_text(page.title,1,"XTITLE")'. What function should I
> call, then?  Could you provide me an example?

Don't use the TermGenerator if you're creating your own terms, just
Document.add_term() / Document.add_posting() them instead. (In this
(Continue reading)

Luis Zarrabeitia | 4 May 16:08
Picon
Favicon

Re: Simulating Fields

On Monday 04 May 2009 06:59:24 am Olly Betts wrote:
> On Mon, May 04, 2009 at 12:26:43AM -0400, Luis Alberto Zarrabeitia Gomez 
wrote:
> > Now, what would you recommend to match the document titled "sex and the
> > city", but not "sex and the city 2: the return"?
>
> I'm not sure I understand why the sequel isn't a relevant result (albeit
> one which you would want to rank lower than the exact match).  Since I
> don't really seem to understand the aim, I suspect I may be missing the
> point of what you're trying to do.

Yes, you are missing the point, but that's my fault for not explaining it :D. 
And amazingly, even if you are missing it, you are giving me helpful hints!

Anyway:

I'm indexing a set of documents, and storing them. The documents have title 
(and category, and so on). I can retrieve the individual documents by the 
docid, but then I'd need to know the docid beforehand. The titles, however, 
may be known in advance (i.e, a link pointing to them somewhere). Following 
your example, if I want to read the paper named "sex and the city", I need 
the system to retrieve _that_ paper, and hopefully suggest the articles "sex 
city" and "sex and the city 2: the return". I'm not _searching_ the 
collection, I'm "browsing" through it. (Of course, I need to search the 
collection, otherwise I wouldn't be using xapian instead of a relational 
database). Think of the buttons "search" and "go" in most wikis. "Go" should 
do a search through the titles only if an exact match is not found. And a 
link within the wiki itself should never do a search - if there is no 
document by that name, it should give a 404.

(Continue reading)


Gmane