Karolina Bernat | 1 Feb 2011 10:59

Re: Token position vs. token offset - how to bring them together?

Hi,

perhaps there is someone, who's trying to do the same thing so I just write
down, how I got along with this problem. It is NOT the most elegant
solution, but it works for me. I don't really know yet, how the performance
of my search will be, but the tests look so far ok.

For my phrase search I actually used the SpnQuery - I read that the
QueryParser cant't handle this kind of queries so I do it manually by
checking, if the user entered the search text within the " ".
Handling a SpanQuery one have an access to the query spans - and the spans
give you the start and the end position of the searched words.
Furthermore I could find the positions and the offset information of the
WeightedTerms by using :
QueryTermExtractor.getIdfWeightedTerms(...)  and  TermPositionVector

Because there is no possibility (or none that I know of) of getting the
offset information if you know the terms positions, I thought of saving all
the term positions and the term offset informations.. and since I get the
span start- and end-position from a SpanQuery I can look up in the terms
positions-array, at which index/place in the array I find the position I got
from the SpanQuery and then go to my array with terms offset information and
get the one at the same index/position in this array...
With those informations I can get the start offset of the first term in the
SpanQuery and the end offset of the last term - and I can highlight those
continuous.

That is really not the best way to process, but I couldn't find any better.

Please let me know, if there is any other (better) way to do it.
(Continue reading)

Ophir Cohen | 1 Feb 2011 11:18
Picon

Fwd: Payloads API and support

Hi Guys,

I've been using Lucene for more than 5 years and it is a great tool - 
great job! Thanks for everything...

Lately I encountered the new payloads support and it looks its a great 
solution for my project.

*The problem:*

The use case is as follows:

I need to support a way to calculate statistics on web pages.

Each page has few metrics that comes with it (how many user saw it, what 
was the average time on page etc...).

The requirement is to support query such as:

How many users saw pages contains the tokens 'house' and 'white'.

Or

What was the average time on pages contains tokens 'horse' and 'pony'.

*First solution:*

Add pages to Lucene, index the words and store the metrics.

*The problem: performance.*
(Continue reading)

Tsvika Rabkin | 1 Feb 2011 14:27
Picon

Using different field when overriding computeNorm

Hi,

I would like to override default similarity's computeNorm to work with
a different field, other than the query field.

Here is the DefaultSimilarity implementation:

 <at> Override
  public float computeNorm(String field, FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
      numTerms = state.getLength() - state.getNumOverlap();
    else
      numTerms = state.getLength();
    return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
  }

any ideas how to do that?

Thanks,

Tsvika
Ryan Aylward | 1 Feb 2011 19:51
Favicon

RE: Using different field when overriding computeNorm

I have had to do similar things to other methods of Similarity. In my example, I wanted to have different
behavior for the tf() method for each field. The tf method does not include a field parameter as an input to
it. The only solution I could come up with was to add a thread local to set the field and then check the thread
local within the tf function. Here's the tf function...

	public float tf(float freq) {

		// Get the value of the thread local...
		String field = FieldThreadLocal.getField();

		if ("fieldA".equals(field)) {
			// always return 1 for field A
			return 1;
		} else {
			// otherwise, use the normal tf function
			return super.tf(freq);
		}
	}

tf() is used during scoring so I had to override the TermQuery (and TermWeight and TermScorer) to be able to
set and clear the thread local at the appropriate times. This is a pretty ugly hack, but I couldn't find
another way to make this work.

computeNorm() is calculated at index creation time but you try to do something similar.

Would be curious if other people had a better suggestion as to how to do this.

-----Original Message-----
From: Tsvika Rabkin [mailto:tsvika.rabkin <at> gmail.com] 
Sent: Tuesday, February 01, 2011 5:27 AM
(Continue reading)

Robert Muir | 1 Feb 2011 20:09
Picon
Gravatar

Re: Using different field when overriding computeNorm

On Tue, Feb 1, 2011 at 1:51 PM, Ryan Aylward <ryan <at> glassdoor.com> wrote:
> I have had to do similar things to other methods of Similarity. In my example, I wanted to have different
behavior for the tf() method for each field. The tf method does not include a field parameter as an input to
it. The only solution I could come up

in Lucene's trunk, Similarity can now be controlled on a per-field
basis, see https://issues.apache.org/jira/browse/LUCENE-2236

The only exceptions are things like coord() which apply to e.g.
BooleanQuery (which might span multiple fields) and remain top-level
in the new SimilarityProvider.
Ophir Cohen | 1 Feb 2011 08:59
Favicon

Payloads API and support

Hi Guys,

I've been using Lucene for more than 5 years and it is a great tool - 
great job! Thanks for everything...

Lately I encountered the new payloads support and it looks its a great 
solution for my project.

*The problem:*

The use case is as follows:

I need to support a way to calculate statistics on web pages.

Each page has few metrics that comes with it (how many user saw it, what 
was the average time on page etc...).

The requirement is to support query such as:

How many users saw pages contains the tokens 'house' and 'white'.

Or

What was the average time on pages contains tokens 'horse' and 'pony'.

*First solution:*

Add pages to Lucene, index the words and store the metrics.

*The problem: performance.*
(Continue reading)

Grant Ingersoll | 2 Feb 2011 17:07
Picon
Favicon
Gravatar

Re: Payloads API and support


On Feb 1, 2011, at 2:59 AM, Ophir Cohen wrote:

> Hi Guys,
> 
> I've been using Lucene for more than 5 years and it is a great tool - great job! Thanks for everything...

Thanks.

Just so you know going forward, please be patient in expecting answers, especially for complex questions
like this that involve fairly expert usages of Lucene.  From what I can tell, you have sent the same question
3 times in a matter of less than a day.  Sending more than once in a 2-3 day period is just going to make it less
likely that you will get help, not more likely.

Some suggestions inline below.

> 
> 
> Lately I encountered the new payloads support and it looks its a great solution for my project.
> 
> 
> *The problem:*
> 
> The use case is as follows:
> 
> I need to support a way to calculate statistics on web pages.
> 
> Each page has few metrics that comes with it (how many user saw it, what was the average time on page etc...).
> 
> 
(Continue reading)

Jason Rutherglen | 2 Feb 2011 19:03
Picon

Storing an ID alongside a document

I'm curious if there's a new way (using flex or term states) to store
IDs alongside a document and retrieve the IDs of the top N results?
The goal would be to minimize HD seeks, and not use field caches
(because they consume too much heap space) or the doc stores (which
require two seeks).  One possible way using the pre-flex system is to
place the IDs into a payload posting that would match all documents,
and then [somehow] retrieve the payload only when needed.
Alex vB | 2 Feb 2011 20:35
Picon

Storing payloads without term-position and frequency


Hello everybody,

I am currently using Lucene 3.0.2 with payloads. I store extra information
in the payloads about the term like frequencies and therefore I don't need
frequencies and term positions stored normally by Lucene. I would like to
set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
store one payload per term if that information makes it easier.

Best regards
Alex
--

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Yuhan Zhang | 2 Feb 2011 20:52
Favicon

Re: Storing payloads without term-position and frequency

HI Alex,

you can specify the infomation to be stored on a field by setting
Field.TermVector.NO

doc.add(new Field(TEXT_FIELD_NAME, text, Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.NO));

On Wed, Feb 2, 2011 at 11:35 AM, Alex vB <mail <at> avomberg.de> wrote:

>
> Hello everybody,
>
> I am currently using Lucene 3.0.2 with payloads. I store extra information
> in the payloads about the term like frequencies and therefore I don't need
> frequencies and term positions stored normally by Lucene. I would like to
> set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
> payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
> store one payload per term if that information makes it easier.
>
> Best regards
> Alex
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
(Continue reading)


Gmane