Jamal jamalator | 1 Nov 2007 02:23
Picon
Favicon

problem undestanding the hits.score


 Hi 

I have indexed this html document 
=============z1========================
<html>
  <body>
<h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1><br>
<h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1><br>
<h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1>
  </body>
</html>
=============z2=========================
 <html>
   <body>
 <h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1><br>
 <h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1><br>
   </body>
 </html>
=============z3==========================
 <html>
   <body>
 <h1>zo zo zo zo zo zo zo zo zo zo zo zo </h1><br>
   </body>
 </html>
=========================================
with this code

Field contentK1 = new 
Field("htmlcontent",httpd.getContentKeywords(),Field.Store.NO,Field.Index.TOKENIZED );
(Continue reading)

Chris Hostetter | 1 Nov 2007 03:05

Re: Hits.score mystery


: I'm returning the document plus hits.score(i) * 100 but when the

NOTE: the score returned by Hits is not a "percentage" ... it is an 
arbitrary number less then 1.  it might be the "raw score" of the document 
or it might be the result of dividing the "raw score" by the "raw score" 
of the highest scoring document, if hte raw score of the highest scoring 
document is greater then 1

(kinda silly huh?)

basically it's just a way to ensure you always have a number less then 1 
-- but a score of 0.9 from one query isn't neccessarily better then a 
score of 0.1 from another query.

PS...

http://people.apache.org/~hossman/#threadhijack
When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking

-Hoss
Tom Conlon | 1 Nov 2007 08:00

RE: Hits.score mystery

Hi Grant,

> but you should have a look at Searcher.explain() 

I was half-expecting this answer. :(

The query is very basic and the scoring seems completely arbitrary.
Documents with the same number of ocurrences and (seemingly)
distribution are being given widely different scores. 

> Chris Hostetter
> NOTE: the score returned by Hits is not a "percentage" ... 
> a score of 0.9 from 1 query isn't better then a score of 0.1 from
another query.

Thanks for re-emphasizing these points (I was aware of them in any
event).

Tom

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers <at> apache.org] 
Sent: 31 October 2007 19:17
To: java-user <at> lucene.apache.org
Subject: Re: Hits.score mystery

Not sure what UI you are referring to, but you should have a look at
Searcher.explain() for giving you information about why a particular
document scored the way it does

(Continue reading)

Daniel Naber | 1 Nov 2007 09:18
Picon
Favicon

Re: Hits.score mystery

On Wednesday 31 October 2007 19:14, Tom Conlon wrote:

> 119.txt 17.865013        97%    (13 occurences)
> 45.txt  8.600986         47%  (18 occurences)

45.txt might be a document with more therms so that its score is lower 
although it contains more matches.

Regards
 Daniel

--

-- 
http://www.danielnaber.de
Tom Conlon | 1 Nov 2007 10:17

RE: Hits.score mystery

Thanks Daniel,

I'm using Searcher.explain() & luke to try to understand the reasons for the score.

-----Original Message-----
From: Daniel Naber [mailto:lucenelist2007 <at> danielnaber.de] 
Sent: 01 November 2007 08:19
To: java-user <at> lucene.apache.org
Subject: Re: Hits.score mystery

On Wednesday 31 October 2007 19:14, Tom Conlon wrote:

> 119.txt 17.865013        97%    (13 occurences) 45.txt  8.600986         
> 47%  (18 occurences)

45.txt might be a document with more therms so that its score is lower although it contains more matches.

Regards
 Daniel

--
http://www.danielnaber.de
Sonu SR | 1 Nov 2007 10:45
Picon

Question regarding proximity search

Hi,
I got confused of proximity search. I am getting different results for the
queries TTL:"test device"~2 and TTL:"device test"~2
I expect same result for the above two queries. Is there any importance of
position of terms in a proximity query?  Anybody please help me how lucene
exactly handles proximity query search. Is there any good documentation for
it?

Thanks,
Sonu
Daniel Naber | 1 Nov 2007 11:43
Picon
Favicon

Re: Question regarding proximity search

On Thursday 01 November 2007 10:45, Sonu SR wrote:

> I got confused of proximity search. I am getting different results for
> the queries TTL:"test device"~2 and TTL:"device test"~2

Order is significant, this is described here:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/PhraseQuery.html#setSlop(int)

regards
 Daniel

--

-- 
http://www.danielnaber.de
Tom Conlon | 1 Nov 2007 16:30

RE: Hits.score mystery

The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not*
ignoring trailing characters such as #, +. 
(i.e. I needed to match C# and C++)

public final class LowercaseWhitespaceAnalyzer extends Analyzer 
{
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowercaseWhitespaceTokenizer(reader);
  }
} 

Problem now exists that "system," etc is not matched against "system".

Can anyone point to an example of a combination of analyser/tokeniser (or other method) that gets around
this please?

Thanks,
Tom

-----Original Message-----
From: Tom Conlon [mailto:tomc <at> 2ls.com] 
Sent: 01 November 2007 09:18
To: java-user <at> lucene.apache.org
Subject: RE: Hits.score mystery

Thanks Daniel,

I'm using Searcher.explain() & luke to try to understand the reasons for the score.

-----Original Message-----
(Continue reading)

Cool Coder | 1 Nov 2007 18:09
Picon
Favicon

Re: Best way to count tokens

This is what I am looking for prior to adding into index. SO that it can help me  to display in my site first 10
tokens that has got maximum occurences in my index. In otherword, user can add weightage to these terms.

  - BR

Karl Wettin <karl.wettin <at> gmail.com> wrote:

31 okt 2007 kl. 15.18 skrev Cool Coder:

> Hi Group,
> I need to display list of tokens (tags) in my side 
> those have got maximum occurances in my index. One way I can think 
> of is to keep track of all tokens during analysis and accordingly 
> display them. Is there any other way? e.g. if I want to display 
> tokens in order of their occurences as well as their weightage.

Are you looking for the term frequency vector?

IndexReader.html#getTermFreqVector(int,%20java.lang.String)>

If you are using 2.3 the TermVectorMapper might save you a couple of 
clock ticks sorting.

Or is this something you want to do prior to adding the document to 
the index?

--

-- 
karl

---------------------------------------------------------------------
(Continue reading)

Karl Wettin | 1 Nov 2007 19:55
Picon

Re: Best way to count tokens


1 nov 2007 kl. 18.09 skrev Cool Coder:

> prior to adding into index

Easiest way out would be to add the document to a temporary index and  
extract the term frequency vector. I would recommend using MemoryIndex.

You could also tokenize the document and pass the data to a  
TermVectorMapper. You could consider replacing the fields of the  
document with CachedTokenStreams if you got the RAM to spare and  
don't want to waste CPU analyzing the document twice. I welcome  
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to  
pass code down the IndexWriter.addDocument using a command pattern or  
something, allowing one to extend the document at the time of the  
analysis.

--

-- 
karl

Gmane