Harini Raghavan | 2 Jan 2006 08:26

Re: Query Scoring

Yes I was refering to how IDF is used in the Highlighter code to find 
out how to prioritize fragments of the documents.

My requirement is to show the relevant fragments of the news article for 
each company along with the search results. But the highlighter api 
sometimes picks up the fragments which are not so relevant to the news 
article/company. I would like to know if there is anyway that I can 
modify the scoring/ranking of these fragments in such a way that the 
news items in which a company name & keywords in the headline gets 
assigned a very strong relevancy ranking,  closely followed by a company 
name mention in the first paragraph and a  multiple-mention within the 
entire story. Something like headline =   5 points,  first paragraph = 
four, etc.

Thanks,
Harini

markharw00d wrote:

> Sorry to contradict, Erik, but the Highlighter's QueryScorer will make 
> use of IDF, given a reader, in order to better prioritise which are 
> the "best" bits of a document.
> However, In the particular example given, the criteria includes 
> several non-text fields which are not useful for IDF and general 
> scoring purposes - these are perhaps better expressed using a filter 
> of some form. Otherwise, why should the scarcity of a particular date 
> in the given range boost one matching document above others? These 
> numeric-type fields are simply mandatory boolean "hygiene factors" 
> and  should ideally play no part in highlight selection or results 
> ordering in general based on their IDF or TF.
(Continue reading)

Chris Hostetter | 2 Jan 2006 08:56

Re: Query Scoring


: My requirement is to show the relevant fragments of the news article for
: each company along with the search results. But the highlighter api
: sometimes picks up the fragments which are not so relevant to the news
: article/company. I would like to know if there is anyway that I can
: modify the scoring/ranking of these fragments in such a way that the
: news items in which a company name & keywords in the headline gets
: assigned a very strong relevancy ranking,  closely followed by a company
: name mention in the first paragraph and a  multiple-mention within the
: entire story. Something like headline =   5 points,  first paragraph =
: four, etc.

Well, the sample query you mentioned isn't checking any company names, or
doing anything with a "keywords" field.  I'm not to familiar with the way
the highlighter package works, but i imagine that with the types of
queries you said you are using, if you are highlighting the "Content"
field, the CompanyId and the FilingDate clauses of your query will be
fairly irelevent (becuase they are numbers, not because they are different
field names)

An idea i've suggested before (but i don't remember if anyone ever said
wether it is a viable use of the Highlighter or not) is to give the
highlighter a completely different Query object then the one you used to
get your search results.

ie, if you search query (what you want used to compute score) is...

  +(CompanyId:10 CompanyId:20) Content:"cost saving" Content:outsource

...but once you've gotten those results, what you really care about is
(Continue reading)

Erik Hatcher | 2 Jan 2006 11:12
Favicon

Re: Problems with sandbox - can't find org.apache.lucene.store.IndexInput

I haven't checked the specifics, but many of the contrib (the  
"sandbox" is the old name for it) projects have upgraded their latest  
code to be against the trunk of Lucene, which is destined to be  
Lucene 1.9.  You'll need to either grab a previous JAR built before  
the codebase changed, or upgrade yourself to the trunk of Lucene's  
subversion repository all the way around.

	Erik

On Dec 31, 2005, at 10:21 AM, Colin Young wrote:

> I'm attempting to compile Lucene with some sandbox code --  
> specifically
> the Berkely DB index storage -- and I'm running into and issue  
> where the
> code is attempting to import IndexInput (apparently located in
> org.apache.lucene.store.IndexInput) but I can't find it in the source
> anywhere. I'm not sure if the sandbox code is maybe using a more  
> recent
> version of the Lucene code, or if I'm missing something obvious. My
> personaly guess is that it's the latter.
>
> I'm using Lucene 1.4.3 source and the db directory from the source
> repository at the apache site.
>
> Thanks for any tips.
>
> Colin
>
>
(Continue reading)

Daniel Cortes | 2 Jan 2006 13:59
Favicon

My first question in 2006 :D

Hello everybody and happy new year!
My first question about lucene in 2006 is the next:

What I have to do with the message "No tvx file". Every night I have to 
do a complete indexation proces of a forum in phpBB.
For example in an indexation of 93 documents (posts in Forum phpBB) i 
see 4 messages of No tvx file in my logs.
The called that produce the message is this(contents is an string):
            Field CONTENTS = Field.UnStored("CONTENTS",contents,true);
            CONTENTS.setBoost((float) 0.2);
            lucene_doc.add(CONTENTS);

The problem is when I'm working with an active forum that I can obtain 
near of 200 message of "No tvx file"
My index is setCompoundFile(true);

What I do wrong? Or what can I do to not obtain this messages?

Thks for any reply
Bernhard Messer | 2 Jan 2006 15:21
Picon
Favicon

Re: My first question in 2006 :D

Daniel,

you can simply ignore this message. It only says that you have term 
vectors enabled and add one ore more "empty" documents without a body. 
If you don't need term vectors for any special operations on index 
terms, switch this feature off.

Bernhard

Daniel Cortes wrote:

> Hello everybody and happy new year!
> My first question about lucene in 2006 is the next:
>
> What I have to do with the message "No tvx file". Every night I have 
> to do a complete indexation proces of a forum in phpBB.
> For example in an indexation of 93 documents (posts in Forum phpBB) i 
> see 4 messages of No tvx file in my logs.
> The called that produce the message is this(contents is an string):
>            Field CONTENTS = Field.UnStored("CONTENTS",contents,true);
>            CONTENTS.setBoost((float) 0.2);
>            lucene_doc.add(CONTENTS);
>
> The problem is when I'm working with an active forum that I can obtain 
> near of 200 message of "No tvx file"
> My index is setCompoundFile(true);
>
> What I do wrong? Or what can I do to not obtain this messages?
>
> Thks for any reply
(Continue reading)

Colin Young | 2 Jan 2006 21:51

RE: Problems with sandbox - can't find org.apache.lucene.store.IndexInput

That would probably explain things. Is 1.9 close, or are we still
talking months aways? Unfortunately, what I'm trying to do is use the
code for Berkeley DB Java Edition which, best as I can tell was only
ported against the 1.9 code, so it looks like my choices are to do the
port myself, or check out 1.9 to see what the current issues and and see
how stable it is for my purposes.

Thanks

Colin Young

-----Original Message-----
From: Erik Hatcher [mailto:erik <at> ehatchersolutions.com] 
Sent: 2 January, 2006 05:12
To: java-user <at> lucene.apache.org
Subject: Re: Problems with sandbox - can't find
org.apache.lucene.store.IndexInput

I haven't checked the specifics, but many of the contrib (the "sandbox"
is the old name for it) projects have upgraded their latest code to be
against the trunk of Lucene, which is destined to be Lucene 1.9.  You'll
need to either grab a previous JAR built before the codebase changed, or
upgrade yourself to the trunk of Lucene's subversion repository all the
way around.

	Erik

On Dec 31, 2005, at 10:21 AM, Colin Young wrote:

> I'm attempting to compile Lucene with some sandbox code -- 
(Continue reading)

Erik Hatcher | 3 Jan 2006 03:02
Favicon

Re: Problems with sandbox - can't find org.apache.lucene.store.IndexInput

Trunk of Lucene is very stable, more so than 1.4.3 I've heard.

Is 1.9 release close?  Hard to even say.  It could be.  No  
substantial changes to the trunk before 1.9 is officially released  
are planned that I know of.

	Erik

On Jan 2, 2006, at 3:51 PM, Colin Young wrote:

> That would probably explain things. Is 1.9 close, or are we still
> talking months aways? Unfortunately, what I'm trying to do is use the
> code for Berkeley DB Java Edition which, best as I can tell was only
> ported against the 1.9 code, so it looks like my choices are to do the
> port myself, or check out 1.9 to see what the current issues and  
> and see
> how stable it is for my purposes.
>
> Thanks
>
> Colin Young
>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik <at> ehatchersolutions.com]
> Sent: 2 January, 2006 05:12
> To: java-user <at> lucene.apache.org
> Subject: Re: Problems with sandbox - can't find
> org.apache.lucene.store.IndexInput
>
> I haven't checked the specifics, but many of the contrib (the  
(Continue reading)

Colin Young | 3 Jan 2006 03:54

RE: Problems with sandbox - can't find org.apache.lucene.store.IndexInput

That's good enough for me. At this point, going with a reasonably stable
branch rather than using my code appears to be the more conservative
option considering our release timeframe (which allows for extensive
testing).

Thanks for the help (and the excellent book).

Colin

-----Original Message-----
From: Erik Hatcher [mailto:erik <at> ehatchersolutions.com] 
Sent: 2 January, 2006 21:03
To: java-user <at> lucene.apache.org
Subject: Re: Problems with sandbox - can't find
org.apache.lucene.store.IndexInput

Trunk of Lucene is very stable, more so than 1.4.3 I've heard.

Is 1.9 release close?  Hard to even say.  It could be.  No substantial
changes to the trunk before 1.9 is officially released are planned that
I know of.

	Erik

On Jan 2, 2006, at 3:51 PM, Colin Young wrote:

> That would probably explain things. Is 1.9 close, or are we still 
> talking months aways? Unfortunately, what I'm trying to do is use the 
> code for Berkeley DB Java Edition which, best as I can tell was only 
> ported against the 1.9 code, so it looks like my choices are to do the
(Continue reading)

Harini Raghavan | 3 Jan 2006 05:47

Re: Query Scoring

Thank you Chris. That seems like a good suggestion. I will try to pass a 
different Query object to the Highlighter api that the one used for 
searching.

I plan to break down the HTML document and store the title/sub 
title/content in different fields of the index. So if I create a new 
query comparing company name and keywords against title and content 
fields, then I am assuming that highlighter api will give a higher 
ranking to the fragment where both terms of the query match against 
those fragments where just one term(either title or content) matches. I 
am assuming that even if I do not increase the boost factor of any of 
the terms, the api will take care of this ranking.
This is my understanding of the scoring/ranking algorithm. Any comments 
anyone?

Thanks,
Harini

Chris Hostetter wrote:

>: My requirement is to show the relevant fragments of the news article for
>: each company along with the search results. But the highlighter api
>: sometimes picks up the fragments which are not so relevant to the news
>: article/company. I would like to know if there is anyway that I can
>: modify the scoring/ranking of these fragments in such a way that the
>: news items in which a company name & keywords in the headline gets
>: assigned a very strong relevancy ranking,  closely followed by a company
>: name mention in the first paragraph and a  multiple-mention within the
>: entire story. Something like headline =   5 points,  first paragraph =
>: four, etc.
(Continue reading)

Steven Pannell | 3 Jan 2006 10:54

how to handle plurals

Hi,

Does anyone know how I can handle plurals in lucene.  If I search for dog
and then dogs I get two different search results.  I would like the same
results regardless of the plural.  Can this be done??

thanks,
Steve.

Gmane