Robert Muir (JIRA | 1 Feb 2010 10:23
Picon
Favicon

[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality


     [
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

apologies for the large patch.

this patch does the following:
* deprecates RussianTokenizer, RussianStemmer, RussianStemFilter, DutchStemmer, DutchStemFilter,
FrenchStemmer, FrenchStemFilter
* use snowball in the above analyzers instead, depending upon version.
* doesn't deprecate germanstemmer, but uses snowball instead (which is maintained and relevance-tested
and supports things like u+umlaut = ue, etc). the old stemmer is kept because it is a different algorithm (alternate).
* the dutchstemmer had 'dictionary based stemming override' support, so to implement this, add
StemmerOverrideFilter which does this in a generic way with KeywordAttribute
* adds KeywordAttribute support to SnowballFilter
* deprecates SnowballAnalyzer in favor of language-specific analyzers.
* adds Romanian and Turkish stopword lists, since snowball is missing them.
* implements language-specific analyzers in place of all the ones snowball tried to do at once before.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
(Continue reading)

Robert Muir (JIRA | 1 Feb 2010 10:57
Picon
Favicon

[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality


    [
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828054#action_12828054
] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

here is a short explanation of what i figure might be the controversial part: adding all the
language-specific analyzers:

I think its too difficult for a non-english user to use lucene. 
Let's take the romanian case, sure its supported by SnowballAnalyzer, but:
* where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it
contains some stray nouns that should not be in there, and will they get the encoding correct?
* for some languages: french, dutch, turkish: we already want to do something different already. For
french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based
exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for
turkish we need the special lowercasing.
* for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in
the future. For german at least, there is a public domain wordlist just itching to be used for this.
* oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry.

So I think instead its best to have a "recommended default" organized by language, preferably one we have
relevance tested / or is already published. many of the existing snowball stemmers have published
relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and
users should still think about using them as a template, but I think digging up your own stoplist / writing
your own analyzer, figuring out your language support is really buried in snowball, combined with
documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high.

(Continue reading)

Uwe Schindler (JIRA | 1 Feb 2010 12:41
Picon
Favicon

[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality


    [
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828071#action_12828071
] 

Uwe Schindler commented on LUCENE-2055:
---------------------------------------

Several bugs:
- StemOverrideFilter should be final or have final incrementToken()
- The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different
contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a
CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a
SnowballStemFilter instead.
(Continue reading)

Robert Muir (JIRA | 1 Feb 2010 13:41
Picon
Favicon

[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality


    [
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828079#action_12828079
] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

bq. StemOverrideFilter should be final or have final incrementToken()

ok, thanks, will fix this now.

bq. The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different
contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a
CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap

No. Instead i will force it to be case sensitive to ignore this, it is stilly to have dutch stem filter be
horribly slow because of some theoretical case like this.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
(Continue reading)

Robert Muir (JIRA | 1 Feb 2010 13:47
Picon
Favicon

[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality


     [
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

make StemmerOverrideFilter final, and hardcode it as case-sensitive (it makes sense to come after some
sort of lowercasefilter anyway)

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a
SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
(Continue reading)

Chris Harris (JIRA | 1 Feb 2010 19:32
Picon
Favicon

[jira] Commented: (LUCENE-2232) Use VShort to encode positions


    [
https://issues.apache.org/jira/browse/LUCENE-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828210#action_12828210
] 

Chris Harris commented on LUCENE-2232:
--------------------------------------

I have a little bit of sampling profiling data from YourKit that may be relevant. (Paul encouraged me to post
anyway.) Note that the queries submitted were not limited to those requiring PRX data, although some of
them (30%? 40%?) did. This data is _without_ applying this LUCENE-2232 patch. YourKit was set to time
java.io.RandomAccessFile.readBytes and .read with wall clock time.

1. I replayed about 1000 queries taken from our user query logs on a test system that uses rotating drives,
without first submitting any battery of warmup queries.

{code}
 SegmentTermPositions.readDeltaPosition()
     IndexInput.readVInt() <----------
{code}

I looked at the time spent in the marked call to IndexInput.readVInt(). 93% of the time in this readVint()
was spent in I/O, leaving a maximum of 7% that could theoretically be wasted on the CPU decoding VInts.

2. I profiled one of our live Solr servers that uses SSD drives, after the system had warmed up a bit. Here is
the resulting profiling data, with times relative to SegmentTermPositions.readDeltaPosition():

{code}
SegmentTermPositions.readDeltaPosition() - 100%
  IndexInput.readVInt - 100%
(Continue reading)

Yonik Seeley (JIRA | 1 Feb 2010 19:50
Picon
Favicon

[jira] Commented: (LUCENE-2232) Use VShort to encode positions


    [
https://issues.apache.org/jira/browse/LUCENE-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828218#action_12828218
] 

Yonik Seeley commented on LUCENE-2232:
--------------------------------------

bq. Has it been tried to unroll getVInt()? Unrolling getVInt into 5 explicit getByte()'s has the advantage
that the later conditional jumps are easier to predict.

IIRC, there was a JIRA issue in the past that tried this.  It showed a speedup in micro-benchmarks but a
slowdown when it was actually used in lucene for something like termdocs.

> Use VShort to encode positions
> ------------------------------
>
>                 Key: LUCENE-2232
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2232
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Paul Elschot
>         Attachments: LUCENE-2232-nonbackwards.patch, LUCENE-2232-nonbackwards.patch
>
>
> Improve decoding speed for typical case of two bytes for a delta position at the cost of increasing the size
of the proximity file.

--

-- 
(Continue reading)

Paul Elschot (JIRA | 1 Feb 2010 21:08
Picon
Favicon

[jira] Commented: (LUCENE-2232) Use VShort to encode positions


    [
https://issues.apache.org/jira/browse/LUCENE-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828251#action_12828251
] 

Paul Elschot commented on LUCENE-2232:
--------------------------------------

Yonik:

I searched for that past issue bit, it was probably LUCENE-639 on what was then readVInt.
The issue was closed as won't fix, inconclusive.
One could conclude that it is probably not worthwhile to unroll the loop here, I'll provide another
version of the patch for that.

Chris:

Indeed the hope is that for larger docs the amount of I/O will hardly increase.

It is good to know that readDeltaPosition spends 93% of time in readVInt when using a disk.
It might well be that the C++ compiler used by Zhang is not as good in preventing jumps that are difficult to predict
as the current Hotspot JIT.

That 93% in readVInt falling to 69% on SSD is a nice confirmation that moving from disk to SSD makes
decoding speed more important.

Could you also measure how much time is spent in total in query search, for example in
IndexSearcher.search() ?
That would give an indication of an upperbound on the practical gain from this.

(Continue reading)

Paul Elschot (JIRA | 1 Feb 2010 21:12
Picon
Favicon

[jira] Commented: (LUCENE-2232) Use VShort to encode positions


    [
https://issues.apache.org/jira/browse/LUCENE-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828256#action_12828256
] 

Paul Elschot commented on LUCENE-2232:
--------------------------------------

While looking around for unrolling I also found HARMONY-3076 on unrolling in 2007.
Any ideas on whether Harmony would be worth trying here?

> Use VShort to encode positions
> ------------------------------
>
>                 Key: LUCENE-2232
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2232
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Paul Elschot
>         Attachments: LUCENE-2232-nonbackwards.patch, LUCENE-2232-nonbackwards.patch
>
>
> Improve decoding speed for typical case of two bytes for a delta position at the cost of increasing the size
of the proximity file.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
(Continue reading)

Chris Harris (JIRA | 1 Feb 2010 22:14
Picon
Favicon

[jira] Issue Comment Edited: (LUCENE-2232) Use VShort to encode positions


    [
https://issues.apache.org/jira/browse/LUCENE-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828210#action_12828210
] 

Chris Harris edited comment on LUCENE-2232 at 2/1/10 9:13 PM:
--------------------------------------------------------------

I have a little bit of sampling profiling data from YourKit that may be relevant. (Paul encouraged me to post
anyway.) Note that the queries submitted were not limited to those requiring PRX data, although some of
them (30%? 40%?) did. This data is _without_ applying this LUCENE-2232 patch. YourKit was set to time
java.io.RandomAccessFile.readBytes and .read with wall clock time.

1. I replayed about 1000 queries taken from our user query logs on a test system that uses rotating drives,
without first submitting any battery of warmup queries.

{code}
 SegmentTermPositions.readDeltaPosition()
     IndexInput.readVInt() <----------
{code}

I looked at the time spent in the marked call to IndexInput.readVInt(). 93% of the time in this readVint()
was spent in I/O, leaving a maximum of 7% that could theoretically be wasted on the CPU decoding VInts.

2. I profiled one of our live Solr servers that uses SSD drives, after the system had warmed up a bit. Here is
the resulting profiling data, with times relative to SegmentTermPositions.readDeltaPosition():

{code}
SegmentTermPositions.readDeltaPosition() - 100%
  IndexInput.readVInt - 100%
(Continue reading)


Gmane