Olly Betts | 1 Mar 2012 07:43
Favicon
Gravatar

Re: Starting to work for xapian

On Tue, Feb 28, 2012 at 09:54:36PM +0530, Rajan Walia wrote:
> So for replacing this with OpenBSD getopt are we going to statically
> compile the newer one and replace the gnu getopt completely (also the
> GNU C Library one) and for that I will have to change the files in
> xapian-core/common/getopt.cc right?

There's certainly an argument for using the same implementation
everywhere, but there's also one for using the C library implementation
where there is one.  If GNU getopt and OpenBSD getopt are API
compatible, I'm not sure what is best to do.

Also, presumably on OpenBSD their getopt implementation is available in
libc, so we probably don't want to use a bundled version there.

> And also currently the getopt.cc is modified by Olly for compilation
> with C++, so what will be the right path for me to go ahead?

Currently we compile all the C code as C++ (aside from tools using
only during the build, like lemon and snowmall).  In some cases we
need to tweak the C code a little to get this to work.

The main reason we went this route was that on some platforms configure
could end up picking an incompatible C and C++ compiler combination by
default.  We were running a lot of automated builds, and trying to stop
them doing this without manual intervention was a pain.  That may not
really be an issue now, so compiling C code as C is probably OK if
that's easier.

> I have compiled the source from the bleeding edge git repo and have
> also familiarized myself with the usage of gnu getopt by building a
(Continue reading)

Olly Betts | 1 Mar 2012 07:47
Favicon
Gravatar

Re: Starting to work for xapian

On Wed, Feb 29, 2012 at 09:33:19PM +0530, Rajan Walia wrote:
>    I wanted to know how do we use and test xapian as a standalone program.

Running "make check" in the source tree runs a lot of automated tests.
This will exercise some command-line options on some of the tools too,
so if that passes all tests, you're probably doing well.

Cheers,
    Olly
Han Jiang | 3 Mar 2012 17:08
Picon

GSoC 2012: Backend for Lucene format indexes

Hi All,
I'm Billy, a senior undergraduate student in Peking University. I'm working in the area of Information Retrieval and Web Mining. When going through the idea list, I felt quite interested in the "Backend for Lucene format indexes" project. I have been using java-lucene for about one year, but my subsequent work prefers C++ codes. This project is very meaningful to smooth the transition.
As far as I know, the operation of index file, e.g. IndexReader, has changed quite some (Lucene3.5 File Format) , while the idea page itself still linked to an old 3.0 version. Since it doesn't seem a simple work to cope with all the versions, shall we just implement to support the old 3.0 format, or a more stable version?
Thank you!

--
Han Jiang
 
EECS, Peking University, China
Every Effort Creates Smile
 
Senior Student

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 4 Mar 2012 05:31
Favicon
Gravatar

Re: GSoC 2012: Backend for Lucene format indexes

On Sun, Mar 04, 2012 at 12:08:47AM +0800, Han Jiang wrote:
> As far as I know, the operation of index file, e.g. IndexReader, has
> changed quite some (Lucene3.5 File
> Format<http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#Index%20File%20Formats>)
> , while the idea page itself still linked to an old 3.0
> version<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/fileformats.html>.
> Since it doesn't seem a simple work to cope with all the versions, shall we
> just implement to support the old 3.0 format, or a more stable version?

Thanks for noticing this - this idea was carried over from last year's
list, and that's why the link points to an old version.  I've updated
the link on the wiki to the newer one you gave above.

I think it makes sense to support the latest version as the priority,
with support for older versions possibly useful if there's time.

Cheers,
    Olly
Han Jiang | 4 Mar 2012 07:31
Picon

Re: GSoC 2012: Backend for Lucene format indexes

ojwb have also suggested this on the IRC channel, thanks olly!

Cheers.
Billy

On Sun, Mar 4, 2012 at 12:31 PM, Olly Betts <olly <at> survex.com> wrote:
On Sun, Mar 04, 2012 at 12:08:47AM +0800, Han Jiang wrote:
> As far as I know, the operation of index file, e.g. IndexReader, has
> changed quite some (Lucene3.5 File
> Format<http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#Index%20File%20Formats>)
> , while the idea page itself still linked to an old 3.0
> version<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/fileformats.html>.
> Since it doesn't seem a simple work to cope with all the versions, shall we
> just implement to support the old 3.0 format, or a more stable version?

Thanks for noticing this - this idea was carried over from last year's
list, and that's why the link points to an old version.  I've updated
the link on the wiki to the newer one you gave above.

I think it makes sense to support the latest version as the priority,
with support for older versions possibly useful if there's time.

Cheers,
   Olly



--
Han Jiang

EECS, Peking University, China
Every Effort Creates Smile
 
Senior Student

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Akshay M S | 5 Mar 2012 19:09
Picon

Interested in IR, Getting started with Xapian

Hi everyone, 

I'm Akshay, an Information Science undergrad from Bangalore. I'm interested in Information Retrieval and I'd like to contribute to Xapian as a part of GSoC and later to feed my interests. 

I liked the idea of adding more weighting schemes (Project #2). I did a project last semester on Document Retrieval on Hadoop using TF-IDF and Cosine Similarity (the query had to be a document).
I read about BM25 from the resources. I don't have a good idea about DFR. I'm referring to [1] and [2] for more information on DFR in addition to the resources mentioned on the page.

And in the project description, I couldn't understand this - "Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project)."
I'm thinking components refer to the per-document term weights and the DF/IDF weights? Any elaboration would be really helpful.

Can someone please point me to patches/bugs that are related to this project so I can understand the existing code better, especially related to Xapian::Weight class or anything else that can get me started with Xapian codebase? 

References - 
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 7 Mar 2012 00:44
Favicon
Gravatar

Re: Interested in IR, Getting started with Xapian

On Mon, Mar 05, 2012 at 11:39:31PM +0530, Akshay M S wrote:
> And in the project description, I couldn't understand this - *"Additionally,
> for faster searching, an upper bound on each component is needed (each
> database stores a number of summary statistics to help with this - if
> additional statistics would be useful, you could add them as part of the
> project)."*
> I'm thinking components refer to the per-document term weights and the
> DF/IDF weights? Any elaboration would be really helpful.

Xapian assumes that a document's weight is the sum of a contribution
from each term from the query which indexes it, plus an optional
per-document contribution which is independent of query.  You can
also add in other contributions (e.g. from link analysis, click-through
data, how much an advertiser has paid you to boost their pages (yuck),
etc) using a PostingSource.

For each of these contributions, we want an upper bound on what the
contribution can be for a particular term (or PostingSource) in a
particular query.

A concrete example is probably the best way to show how this can be
useful, so consider a query: moon OR cheese

Say we want to get the best 10 matches (i.e. highest weights).  We
consider documents in ascending order of the numeric Xapian document
id.  Once we have found 10, we can ignore anything which has a weight
less than the lowest weight of those 10, and as we gradually improve
out candidate set of 10, that minimum weight rises.

If we have bounds on the weights which "moon" and "cheese" can return,
then at some point we can spot that both terms will need to match to
achieve that minimum weight - for example, if "moon" can return at most
2.0 and "cheese" 3.0, and the lowest weight in our current best 10 is
4.0.  Once this happens, we can convert that OR to AND, and that allows
us to skip through the document ids more quickly.

There are a number of optimisations we have which make use of these
bounds - in some cases we can even spot that there's no chance of
getting another candidate and stop early.

The tighter the bound you can calculate for a weighting scheme, the
more effective these optimisations can be.

If implementing a new backend, you could just return the maximum value
of the type and Xapian will work, but it won't be as fast.  One of the
improvements the chert backend has over the flint backend is that it
keeps track of bounds on document length and within-document-frequency
which allow tighter bounds to be calculated for BM25.

If you implement a new weighting scheme, you might find that keeping
track of one or more additional statistics would allow a tighter
bound to be calculated, so modifying the backend to track these could
be worthwhile (and older backends could just return a really high or
low value, depending if it's an upper or lower bound).

> Can someone please point me to patches/bugs that are related to this
> project so I can understand the existing code better, especially related to
> Xapian::Weight class or anything else that can get me started with Xapian
> codebase?

I don't think there are any patches or bugs related to this one.

This page talks a bit more about the optimisations, but isn't currently
completely up to date:

http://xapian.org/docs/matcherdesign.html

And you can see an implementation of coordinate weighting (probably the
simplest weighting scheme after BoolWeight) here:

http://trac.xapian.org/browser/trunk/xapian-core/tests/api_db.cc#L1755

Cheers,
    Olly
Sean Mikalson | 11 Mar 2012 21:51
Picon

GSOC 2012: Dynamic Snippets and QueryParser Reimplementation

Hello,
My name is Sean Mikalson. I am a second year Software Engineering student with a combined degree in Philosophy. I am interested in participating with Xapian in GSOC this year and a couple of projects have initially caught my eye:

  • Dynamic Snippets
  • QueryParser Reimplementation

I have good working knowledge in C/C++, Java and SQL (specifically Transact-SQL). In order to determine where my skills and interests are best suited, as well as to provide the best proposal possible for my application, I would like to get better acquainted with the Xapian code base. I have looked at the the FAQS/Snippets link for Dynamic Snippets and have explored the QueryParser documentation and the source code for QueryParser, all the information provided on the ideas page.

Is there a way that I can modify and play around with the code on my own machine in order to really get familiar with how the code works? Any one want to discuss the above project ideas in more depth? After I do some exploring of the code I am sure I will have questions and ideas!

Thanks,

Sean Mikalson

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Akshay M S | 12 Mar 2012 13:10
Picon

Re: Interested in IR, Getting started with Xapian

Hi Olly, 

Thanks for the reply. I have a small question - 

There are a number of optimisations we have which make use of these
bounds - in some cases we can even spot that there's no chance of
getting another candidate and stop early.

In the moon OR cheese example, when can this occur? How can you infer that there is no chance of getting a better candidate? 


--
Regards,
Akshay
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 12 Mar 2012 22:30
Favicon
Gravatar

Re: Interested in IR, Getting started with Xapian

On Mon, Mar 12, 2012 at 05:40:49PM +0530, Akshay M S wrote:
> Thanks for the reply. I have a small question -
> 
> > There are a number of optimisations we have which make use of these
> > bounds - in some cases we can even spot that there's no chance of
> > getting another candidate and stop early.
> 
> In the moon OR cheese example, when can this occur? How can you infer that
> there is no chance of getting a better candidate?

Say that we run out of documents which are indexed by "moon", but there
are still some for "cheese".  If the max weight which "cheese" can
return is too low to make it into the result set we are building, we
can just stop.

Cheers,
    Olly

Gmane