Parth Gupta | 1 Apr 11:18 2011
Picon

New Idea on Ranking in IR

Hello,

I want to discuss my idea on ranking in IR system which I think can be good extension to Xapian. If I am not too late to discuss it then please consider it. I first give you brief background of me, I am a Masters student working on my thesis in the Information Retrieval. I today only got a mail from one of the professor from Europe whom i am going to join for Ph.D about GSoC and more precisely Xapian.

Generally the ranking is unsupervised, where the rank list is produced based on the score provided by the ranking function. Ranking functions are unsupervised like BM25, TF-IDF and so on. So we give the rank list in the dercreasing order of the score.

Well learning to rank involves supervised learning. If we can extract features for a query and intial retrieval of documents pairs then we can say which document should come above which. Basically search engine requires relevant documents in top order, because user gnerally never bothers to click on the next page of the retrieval rether he chooses to modify the query.

In Laarning to Rank (Letor) we prepare the features which can represent a query document pair. So now after the initial retrieval we take say first 20 or 30 documents and represent them in form of feature vactors, now based on the training data our supervised leaning will give a score to each document for a particular query. For example if this learning is from regression then we have to learn 'W' vector which will give a score to the document vector by dot product.

Here the features can be term frequency, TF-IDF score, BM25 Score etc, as good as many. For Learning there are many machine learning techniques available.

Regards,
Parth Gupta,
M.Tech Candidate,
DA-IICT, Gandhinagar,
India.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Zhang Fan | 1 Apr 18:44 2011
Picon

Applying the Google Summer code

Hi all:


Glad to meet you!!

My name's Zhang Fan, a Phd Student from Nankai University, China. I have been doing the information retrieval research work for 3 years. I have several papers published in the top-tier computer science conferences such as WSDM, VLDB, CIKM and ACL. I have many years of coding experiments and participated several projects about search engines.

I want to take part in the suggested project "weighting schemes". It is a good chance for me to contribute to open source community and add my idea to Xapian. 

Besides DfR, I would like to add two more interesting weighting sachems: term proximity and document structure information.
The term proximity suggest that if the document in which the query terms appear close to each other should have higher relevance score. Some research work already prove this idea.
The document structure information is: we distinguish different parts of a document, we will assign different weight to title, body, anchor text and url in the documents. 

I have two papers involving weighting schemes, please refer to the followings: 

Fan Zhang, Shuming Shi, Hao Yan, and Ji-Rong Wen. Revisiting Globally Sorted Indexes for Efficient Document Retrieval. Third ACM International Conference on Web Search and Data Mining (WSDM'10), New York, 2010.

Hao Yan, Shuming Shi, Fan Zhang, Torsten Suel and Ji-Rong Wen. Efficient Term Proximity Search with Term-Pair Indexes. In CIKM'10


!!Please give me some feedback of my ideas. Thank you very much. 
--
My Homepage: http://sites.google.com/site/zhfan555/

PhD Student at Nankai U and Intern at MSRA
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Zhang Fan | 1 Apr 19:10 2011
Picon

Apply the google summer code (additional idea)

Hi all:

As I have gone through the The Xapian-devel Archives, it seems many people would like to do the project "weight schemes" and few would like to do the CJK project.

I am a native speaker of Chinese and I learned a little Korean and Japanese, so if possible, I would like to apply this projects too. 


Fan Zhang
--
My Homepage: http://sites.google.com/site/zhfan555/

PhD Student at Nankai U and Intern at MSRA
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Sumith Matharage | 2 Apr 09:06 2011
Picon

Re: GSOC 2011 : Weighting Schemes

Hi Olly,


I submitted my application. 


Please have a look at in your free time and let me know your feedback about this.

Thanks,
Sumith

On Thu, Mar 31, 2011 at 1:30 PM, Olly Betts <olly <at> survex.com> wrote:
On Thu, Mar 31, 2011 at 11:44:25AM +1100, Sumith Matharage wrote:
> "If someone on the mailing lists or IRC has explicitly offered to mentor
> your project, please tell us who".

Mentors have to check a box in melange to indicate their willingness to
mentor before the admin can assign them, so this isn't a useful question
really as we get that information direct anyway.  I've removed it from
the wiki template, so please just skip it.

It was a bit backwards too - generally I'll want to read the proposal
before deciding!

Anyway, thanks for highlighting this - we have a shorter application
form now!

Cheers,
   Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 2 Apr 10:16 2011

Re: Apply the google summer code (additional idea)

On Sat, Apr 02, 2011 at 01:10:52AM +0800, Zhang Fan wrote:
> As I have gone through the The Xapian-devel Archives, it seems many people
> would like to do the project "weight schemes" and few would like to do the
> CJK project.

I think the weighting schemes idea has attracted the most interest,
especially early on.  It was first on our ideas list originally, but we
moved it to spread applications out over the ideas a bit (other orgs
reported the first idea often gets undue attention).

> I am a native speaker of Chinese and I learned a little Korean and Japanese,
> so if possible, I would like to apply this projects too.

Sure, feel free to.

Cheers,
    Olly
Zhang Fan | 2 Apr 10:54 2011
Picon

Re: Apply the google summer code (additional idea)

Dear Olly Betts:

Thank you, I will update my application form submitted to Google summer code?

And what something else I should do at present?
--
My Homepage: http://sites.google.com/site/zhfan555/

PhD Student at Nankai U and Intern at MSRA


On Sat, Apr 2, 2011 at 4:16 PM, Olly Betts <olly <at> survex.com> wrote:
On Sat, Apr 02, 2011 at 01:10:52AM +0800, Zhang Fan wrote:
> As I have gone through the The Xapian-devel Archives, it seems many people
> would like to do the project "weight schemes" and few would like to do the
> CJK project.

I think the weighting schemes idea has attracted the most interest,
especially early on.  It was first on our ideas list originally, but we
moved it to spread applications out over the ideas a bit (other orgs
reported the first idea often gets undue attention).

> I am a native speaker of Chinese and I learned a little Korean and Japanese,
> so if possible, I would like to apply this projects too.

Sure, feel free to.

Cheers,
   Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 2 Apr 11:06 2011

Re: Apply the google summer code (additional idea)

On Sat, Apr 02, 2011 at 04:54:20PM +0800, Zhang Fan wrote:
> Thank you, I will update my application form submitted to Google
> summer code?

Better to just submit a second application for the new idea I think.
Changing an existing application so drastically is more likely to
cause confusion.

> And what something else I should do at present?

You're welcome to discuss your approach on #xapian or here if you want.

Cheers,
    Olly
saurabh kumar | 2 Apr 20:15 2011
Picon

Re: Interested in GSOC projects

Respected sir,

I have made several changes in the proposal.

- Now I have not ruled out the idea of tool generated parser.

- Regarding time-line I had already proposed weekly report submission. Now I have written a much detailed time-line.

- I have made several other changes. Please go through it at once.

Looking forward to hear from you

Thanking you

Saurabh Kumar

On Fri, Apr 1, 2011 at 2:40 AM, saurabh kumar <saurabh.catch <at> gmail.com> wrote:
Respected sir,

I have submitted my proposal on gsoc-melange site.
Please have a look and suggest some improvements.

Meanwhile I am spending time understanding the source code
of query parser class from xapian-core.

Looking forward for your comments on my proposal.

Thanking you
Saurabh Kumar



On Thu, Mar 31, 2011 at 8:38 AM, Olly Betts <olly <at> survex.com> wrote:
On Thu, Mar 31, 2011 at 01:30:30AM +0530, saurabh kumar wrote:
> I have some doubts :
>
> 1) Why is using the tools like yacc, bison not a good approach? Can you
> illustrate with an example?

The parser needs to be forgiving, since the input is typed by (often
non-technical) humans.  The input isn't expected to be program code, and
"Syntax Error" is rarely an acceptable response (better to correct the
query and say "Searched for 'XXX' instead", with a "Did you mean 'YYY'?"
is there's an alternative plausible fix up).

Good error recovery in generated parsers is hard to do well, and usually
results in adding extra rules to the parser description, and that
obfuscates what we're actually trying to do.

The grammar is also not something we can always restrain in ways to suit
the parser generator.

For a formally specified grammar (like a language standard perhaps),
there's usually a BNF description of the grammar rules, so it's handy
to have the parser description mirror it.  That's not the case here.

Currently the lexer does things like tracking the "mode", which is
really an indication of where in the grammar we are.

> 2) In the proposed project are we NOT going to use any tools like YACC etc.?

Well, you're welcome to propose what you like, but you'll need to do
a harder sell on this one.

If you want to use a parser generator, we currently use lemon, which has
a clearer syntax than bison/yacc, and is structured such that the lexer
calls the parser (rather than the parser calling the lexer, as in
bison/yacc).  That allows the lexer to be simpler, since it doesn't need
to "keep its place" with explicit state.  So I'd suggest we probably
don't want to move back to using bison (one reason we moved away
originally was the lack of reentrancy in bison-generated parser, but
that at least now seems to have been addressed).

> Should I mail my proposal to the mailing list or just submit it at google
> SOC site? Because certainly I would require your
> comments to improve upon the first draft.

Just submit it to the site - we can comment there and you can revise it
up until the deadline (April 8th, 19:00 UTC).

Cheers,
   Olly


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Parth Gupta | 3 Apr 09:07 2011
Picon

Re: New Idea on Ranking in IR

Hey Olly and Richard,

Research has shown in many papers that Incorporating Learning in Ranking has improved Results in terms of evaluation measures of Information Retrieval, MRR(Mean Reciprocal Rank) or MAP (Mean Average Precision). So I would certainly want to investigate and incorporate it in Xapian project.

Please give your feedback on the possibility of exploration of the idea so that I can incorporate those things in my application.

Waiting for the feedbacks.

Regards,
Parth

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 3 Apr 16:40 2011

Re: New Idea on Ranking in IR

On Fri, Apr 01, 2011 at 02:48:28PM +0530, Parth Gupta wrote:
> In Laarning to Rank (Letor) we prepare the features which can represent a
> query document pair. So now after the initial retrieval we take say first 20
> or 30 documents and represent them in form of feature vactors, now based on
> the training data our supervised leaning will give a score to each document
> for a particular query. For example if this learning is from regression then
> we have to learn 'W' vector which will give a score to the document vector
> by dot product.
> 
> Here the features can be term frequency, TF-IDF score, BM25 Score etc, as
> good as many. For Learning there are many machine learning techniques
> available.

What would be your plan for gathering data to train with?  Some sort of
click-through measurements?

On Sun, Apr 03, 2011 at 12:37:27PM +0530, Parth Gupta wrote:
> Please give your feedback on the possibility of exploration of the idea so
> that I can incorporate those things in my application.

It seems an interesting project to me, though I'm not sure I know enough
about the are to offer a much in the way of useful insights.  I can
probably ask some stupid questions though.

But I'm certainly happy to consider an application from you for working
on this.

Cheers,
    Olly

Gmane