Paul Cowan (JIRA | 1 Sep 2008 04:27
Picon
Favicon

[jira] Created: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Proposal: introduce more sensible sorting when a doc has multiple values for a term
-----------------------------------------------------------------------------------

                 Key: LUCENE-1372
                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 2.3.2
            Reporter: Paul Cowan
            Priority: Minor

At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:

doc 1: {"apple"}
doc 2: {"banana"}
doc 3: {"apple", "banana"}
doc 4: {"apple", "zebra"}

if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and
similarly for the other methods in the various FieldCacheImpl caches) does the following:

          while (termDocs.next()) {
            retArray[termDocs.doc()] = t;
          }

which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc]
with the value for each document with that term. Effectively, this overwriting means that a string sort in
(Continue reading)

Paul Cowan (JIRA | 1 Sep 2008 04:31
Picon
Favicon

[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term


     [
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Cowan updated LUCENE-1372:
-------------------------------

    Attachment: lucene-multisort.patch

Patch which deals with this in the case of Strings, with a test case. This is a POC example; if people are happy
with the approach I'll implement for the other types (float, int, etc) as I think it makes sense there also.

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:
> doc 1: {"apple"}
(Continue reading)

Apache Hudson Server | 1 Sep 2008 04:32
Picon
Favicon

Build failed in Hudson: Lucene-trunk #573

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/573/changes

Changes:

[kalle] Javadocs fix

------------------------------------------
[...truncated 9731 lines...]
    [junit] No terms in bi field for: kxor*
    [junit] No terms in bi field for: kxork*
    [junit] No terms in bi field for: kxor*
    [junit] ------------- ---------------- ---------------
   [delete] Deleting:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/surround/test/junitfailed.flag 
     [echo] Building swing...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

(Continue reading)

Michael McCandless | 1 Sep 2008 11:09

Re: Build failed in Hudson: Lucene-trunk #573


This is the 2nd time in recent memory that we've had a false failure  
from HighlighterTest.testEncoding -- here's the traceback this time:

   [junit] Testcase:  
testEncoding(org.apache.lucene.search.highlight.HighlighterTest):	 
Caused an ERROR
     [junit] Connection refused
     [junit] java.net.ConnectException: Connection refused
     [junit] 	at java.net.PlainSocketImpl.socketConnect(Native Method)
     [junit] 	at  
java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     [junit] 	at  
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     [junit] 	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java: 
182)
     [junit] 	at java.net.Socket.connect(Socket.java:520)
     [junit] 	at java.net.Socket.connect(Socket.java:470)
     [junit] 	at sun.net.NetworkClient.doConnect(NetworkClient.java:157)
     [junit] 	at  
sun.net.www.http.HttpClient.openServer(HttpClient.java:388)
     [junit] 	at  
sun.net.www.http.HttpClient.openServer(HttpClient.java:523)
     [junit] 	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java: 
640)
     [junit] 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream 
(HttpURLConnection.java:957)
     [junit] 	at  
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown  
Source)
(Continue reading)

ext-vinay.thota | 1 Sep 2008 14:50
Picon

Multi Phrase Search at the Beginning of a field

Hi,

Can some one please help me in providing a solution for my problem:

I have a single field defined in my document. Now I want to do a MultiPhraseQuery - but at the beginning of the field.

For e.g: If there are 3 documents with single field ( say 'title' ) has the values -> "Hello Love you", "Love You Sister", "Love Yoxyz"

Then my search for "Love yo*" ->  MultiPhraseQuery with first term "Love" ( using addTerm("Love") and the next terms ( using addTerms("Yo*"  - after getting all terms 'You' and 'Yoxyz' using IndexReader.terms(Yo) ) should return only the documents "Love You Sister", "Love Yoxyz" - but not "Hello Love you".

Can some one please help me on how to get it done.


Regards,
Vinay

Andraz Tori | 1 Sep 2008 15:38
Favicon

Re: Multi Phrase Search at the Beginning of a field

You can use standard trick.

Insert a special token at the beginning of every field you are indexing,
and add that special token to beginning of every query.

Since this token will not occur anywhere else in the field, you will
know that your queries match only beginnings of fields

bye
andraz

On Mon, 2008-09-01 at 15:50 +0300, ext-vinay.thota <at> nokia.com wrote:
> Hi,
> 
> Can some one please help me in providing a solution for my problem:
> 
> I have a single field defined in my document. Now I want to do a
> MultiPhraseQuery - but at the beginning of the field. 
> 
> For e.g: If there are 3 documents with single field ( say 'title' )
> has the values -> "Hello Love you", "Love You Sister", "Love Yoxyz" 
> 
> Then my search for "Love yo*" ->  MultiPhraseQuery with first term
> "Love" ( using addTerm("Love") and the next terms ( using
> addTerms("Yo*"  - after getting all terms 'You' and 'Yoxyz' using
> IndexReader.terms(Yo) ) should return only the documents "Love You
> Sister", "Love Yoxyz" - but not "Hello Love you".
> 
> Can some one please help me on how to get it done.
> 
> 
> Regards, 
> Vinay
> 
--

-- 
Andraz Tori, CTO
Zemanta Ltd, London, Ljubljana
www.zemanta.com
mail: andraz <at> zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test
Chris Harris (JIRA | 1 Sep 2008 22:40
Picon
Favicon

[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated


     [
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated LUCENE-1370:
---------------------------------

    Attachment: LUCENE-1370.patch

Fixing to merge cleanly against changes made in r687359. The patch file will also now have a proper name, LUCENE-1370.patch.

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token
long, then ShingleFilter.next() won't return any tokens. This patch provides a new option,
outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then
ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby
expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and
outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the
following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the
outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try
to find out.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Chris Harris (JIRA | 1 Sep 2008 22:42
Picon
Favicon

[jira] Issue Comment Edited: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated


    [
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627533#action_12627533
] 

ryguasu edited comment on LUCENE-1370 at 9/1/08 1:40 PM:
--------------------------------------------------------------

Fixing to merge cleanly against changes made in r687357. The patch file will also now have a proper name, LUCENE-1370.patch.

      was (Author: ryguasu):
    Fixing to merge cleanly against changes made in r687359. The patch file will also now have a proper name, LUCENE-1370.patch.

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token
long, then ShingleFilter.next() won't return any tokens. This patch provides a new option,
outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then
ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby
expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and
outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the
following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the
outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try
to find out.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Chris Harris (JIRA | 1 Sep 2008 22:44
Picon
Favicon

[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated


     [
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated LUCENE-1370:
---------------------------------

    Attachment: LUCENE-1370.patch

Getting rid of Windows-style newlines

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token
long, then ShingleFilter.next() won't return any tokens. This patch provides a new option,
outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then
ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby
expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and
outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the
following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the
outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try
to find out.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Apache Hudson Server | 2 Sep 2008 05:21
Picon
Favicon

Hudson build is back to normal: Lucene-trunk #574

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/574/changes

Gmane