Richard Boulton | 16 Sep 11:56

Re: Some Questions From the beginner of Xapian

liminghit wrote:
> (1) I see the Xapian::Document has a method
> 
> *  void  add_value (Xapian::valueno valueno, const std::string &value)*
> 
>   What's the purpose of this method?  Document will related to the 
> terms, but what's the purpose of this?

Values are extra pieces of information which can be used during the 
search to modify the search in some way.  For example, they can be used 
to add an extra weight to some documents, or to sort the results in a 
different order, or to collapse results from a single website.

> (2) add_posting method will add term to a documents.
> 
> *   void add_posting (const std::string &tname, Xapian::termpos tpos, 
> Xapian::termcount wdfinc=1)*
> 
> I noticed that
> 
> Xapian::TermGenerator has follow method
> 
> *  void  index_text (const Xapian::Utf8Iterator &itor, Xapian::termcount 
> weight=1, const std::string &prefix="")*
> 
>  What's the differences and relationship between these two functions?

I've just added a FAQ which should answer this.
http://trac.xapian.org/wiki/FAQ/TermGenerator

(Continue reading)

Dave Spencer | 17 Sep 08:13
Picon

Re: Some Questions From the beginner of Xapian

Richard Boulton <richard <at> lemurconsulting.com> writes:

> 
> liminghit wrote:
> > (1) I see the Xapian::Document has a method
> > 
> > *  void  add_value (Xapian::valueno valueno, const std::string &value)*
> > 
> >   What's the purpose of this method?  Document will related to the 
> > terms, but what's the purpose of this?
> 
> Values are extra pieces of information which can be used during the 
> search to modify the search in some way.  For example, they can be used 
> to add an extra weight to some documents, or to sort the results in a 
> different order, or to collapse results from a single website.

I've been meaning to ask the same basic question.

One other thing, I believe, is that values can be retrieved just as they were
stored, so it can be a form of additional "structured" metadata associated with
a document. So I believe you can do things like

doc.add_value(0, "URL");
doc.add_value(1, "Title");
doc.add_value(2, "Author");

etc.

It would be nice if there was some page on "concepts" that covered this, or 
at least an update to the developer docs:
(Continue reading)

Olly Betts | 17 Sep 09:26
Favicon
Gravatar

Re: Some Questions From the beginner of Xapian

On Wed, Sep 17, 2008 at 06:13:40AM +0000, Dave Spencer wrote:
> It would be nice if there was some page on "concepts" that covered this

http://xapian.org/docs/glossary

> I've wondered what the intent of get_data and set_data was, esp why have
> the indexed values (the index being the first arg to get/add value) whereas
> with data it's just a single value -- why not have multiple "data" values,
> or why not get rid of "data" and just let the get/add value calls cover it?

Use values if you need fast access during the match process itself (e.g.
for sorting, collapsing, etc).  Then Xapian knows to store the data such
that this can be done efficiently.  If you're sorting by date, Xapian
only needs date information and doesn't want to have to fetch extraneous
data to get it - this is why there are multiple value slots (the current
implementation doesn't make best use of this but I'm working on that at
the moment as it happens!)

Optimising the storage scheme for this use case will hurt other access
patterns, so we advise against storing arbitrary "data fields" in value
slots.  If you need to store other data which isn't needed in this way
(e.g. you want it for displaying results), serialise it into the
document data instead.

There are already plenty of existing ways to serialise structured data
into a single string, so when we were originally building Xapian we just
chose a simple approach which allows you to pick an existing solution
you like (some examples: XML, Python's pickle, JSON, Omega's
"name=value" scheme) and allowed us to get on with the rest of the job.

(Continue reading)

liminghit | 18 Sep 04:02
Favicon

Re: Some Questions From the beginner of Xapian

 
 Thanks for your guys enthusiasm replys, realy helpful!
 
Cheers,
Ming
 
 
 

在2008-09-17,"Olly Betts" <olly <at> survex.com> 写道: >On Wed, Sep 17, 2008 at 06:13:40AM +0000, Dave Spencer wrote: >> It would be nice if there was some page on "concepts" that covered this > >http://xapian.org/docs/glossary > >> I've wondered what the intent of get_data and set_data was, esp why have >> the indexed values (the index being the first arg to get/add value) whereas >> with data it's just a single value -- why not have multiple "data" values, >> or why not get rid of "data" and just let the get/add value calls cover it? > >Use values if you need fast access during the match process itself (e.g. >for sorting, collapsing, etc). Then Xapian knows to store the data such >that this can be done efficiently. If you're sorting by date, Xapian >only needs date information and doesn't want to have to fetch extraneous >data to get it - this is why there are multiple value slots (the current >implementation doesn't make best use of this but I'm working on that at >the moment as it happens!) > >Optimising the storage scheme for this use case will hurt other access >patterns, so we advise against storing arbitrary "data fields" in value >slots. If you need to store other data which isn't needed in this way >(e.g. you want it for displaying results), serialise it into the >document data instead. > >There are already plenty of existing ways to serialise structured data >into a single string, so when we were originally building Xapian we just >chose a simple approach which allows you to pick an existing solution >you like (some examples: XML, Python's pickle, JSON, Omega's >"name=value" scheme) and allowed us to get on with the rest of the job. > >At some point I think we probably will add support for some sort of >document fields. Verbosity is more of an issue here than in most >situations, so it's not just a case of reinventing the wheel, and >we may be able to reuse an existing solution anyway. > >A numerically subscripted array of strings doesn't add much generality >though - if you want to store any other sort of structure or any >non-string data, you're still going to have to serialise it to one or >more strings. I think we probably should aim higher. > >There's a ticket tracking this issue: > >http://trac.xapian.org/ticket/53 > >> I'm guessing the intent of 'data' is to store some key piece of info >> about a document such as the URL of a doc that represents a web page. > >One *or more* pieces of information, but otherwise yes. > >Cheers, > Olly > >_______________________________________________ >Xapian-devel mailing list >Xapian-devel <at> lists.xapian.org >http://lists.xapian.org/mailman/listinfo/xapian-devel
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Eric Sellin | 1 Oct 16:28

C++ MatchDecider::operator() is const

Hello all,

Why is MatchDecider::operator() a const in the C++ API?

An implementation of MatchDecider might want to store stuff in member
variables so the decision on one document can depend on other decisions
already made on previous documents.

There must be a valid reason why it's const but I can't see.

Thanks,
Eric.
James Aylett | 1 Oct 16:36

Re: C++ MatchDecider::operator() is const

On Wed, Oct 01, 2008 at 03:28:08PM +0100, Eric Sellin wrote:

> Why is MatchDecider::operator() a const in the C++ API?
> 
> An implementation of MatchDecider might want to store stuff in member
> variables so the decision on one document can depend on other decisions
> already made on previous documents.

Wouldn't you need to guarantee the order in which documents are
processed for this to work?

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Eric Sellin | 1 Oct 16:47

Re: C++ MatchDecider::operator() is const


> Wouldn't you need to guarantee the order in which documents are
> processed for this to work?

Yes, quite. So that's why it's const. Because documents are processed in
a random order, each document must be considered for inclusion on a
purely individual basis, otherwise we get random results.

Many thanks,
Eric.
Richard Boulton | 1 Oct 18:54

Re: C++ MatchDecider::operator() is const

Eric Sellin wrote:
> Hello all,
> 
> Why is MatchDecider::operator() a const in the C++ API?
> 
> An implementation of MatchDecider might want to store stuff in member
> variables so the decision on one document can depend on other decisions
> already made on previous documents.
> 
> There must be a valid reason why it's const but I can't see.

That's a good question.  I'm actually not certain it should be const, 
and perhaps it's something which should be fixed in future.  However, 
fixing it would be a fairly intrusive API change, and would require any 
user subclasses to be updated, so I wouldn't be happy doing this until 
the 2.0 release (and I'm not convinced it's worth the effort, overall). 
  It's easy enough to mark members as mutable in the meantime, if you 
really want to.

Note, however, that Xapian does not give you any guarantees about the 
order in which the match decider is applied to documents, and doesn't 
guarantee that it will be called with all the potential matching 
documents.  (At least, I don't think there's any guarantees about the 
order in which it will be presented with documents in the documentation 
anywhere - I'm willing to be disproved!  It certainly often doesn't see 
all the potential matches.)  Therefore, it's not safe to base the 
decision on previously seen documents.  This is why the const was 
originally introduced, as far as I can recall - to try and make this 
clear to users.

Since matchdeciders were introduced, the "matchspy" parameter has been 
added to get_mset().  This takes a match decider, and can be used to 
select a subset of the potential matches, but is called a bit earlier in 
the match process, and it is guaranteed to be called on at least 
"checkatleast" documents.  The intention is that this allows the 
"matchspy" MatchDecider subclass to be used to count features of the 
supplied documents (for example, the matchspy branch contains some 
matchspies which count the occurrences of values in value slots, which 
can be used to present relevant facets or tags for refining the search 
results).  Of course, a matchspy can only really be useful if it has 
mutable members, which is why I feel it might be reasonable to remove 
the const.

--

-- 
Richard

Gmane