Samphan Raruenrom (JIRA | 1 May 2006 05:48
Picon
Favicon

[jira] Commented: (LUCENE-503) Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene

    [ http://issues.apache.org/jira/browse/LUCENE-503?page=comments#action_12377206 ] 

Samphan Raruenrom commented on LUCENE-503:
------------------------------------------

> -It uses the english stop words, does that make sense?

Yes. Thai usually mix English words in Thai text here and there. So English stop words should apply but this
is arguable. I'll consull with the developer community.

> -Could you write some test cases, similar maybe to those for the French analyzer? 

OK. I'm thinking of writing them.

> Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene
> ---------------------------------------------------------------
>
>          Key: LUCENE-503
>          URL: http://issues.apache.org/jira/browse/LUCENE-503
>      Project: Lucene - Java
>         Type: New Feature

>   Components: Analysis
>     Versions: 1.4
>     Reporter: Samphan Raruenrom
>  Attachments: ThaiAnalyzer.java, ThaiWordFilter.java
>
> Thai text don't have space between words. Usually, a dictionary-based algorithm is used to break string
into words. For Lucene to be usable for Thai, an Analyzer that know how to break Thai words is needed.
> I've implemented such Analyzer, ThaiAnalyzer, using ICU4j DictionaryBasedBreakIterator for word
(Continue reading)

karl wettin | 1 May 2006 09:06

Re: this == that


30 apr 2006 kl. 04.48 skrev Tatu Saloranta:

>> JIT takes care of it when the first line of code in
>> the equals method
>> is if (this == that) return true;
>
> In case where (this == that) is true, this may well be
> correct, but:
>
>>
>> Please correct me if I'm wrong.
>
> ... you are then assuming 100% match rate: if so this
> might be true. But in (this != that) case difference
> will be more significant

Oh, I didn't think that far. Silly me. : )

Perhaps there should be a Wiki-page that discuss the code style.  
Things like this. And the use of bytes as floats. And stuff.

By the way, I have about eight working days to spend on Lucene  
starting tomorrow. I will work on the instanciated index, the rich  
positioning and the contextual spell suggester, in that order. So I  
hope at least something of above will be tested and working really  
soon. Time will tell.
Nicholaus Shupe (JIRA | 1 May 2006 17:14
Picon
Favicon

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

    [ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377225 ] 

Nicholaus Shupe commented on LUCENE-529:
----------------------------------------

I'm experiencing a memory leak in Lucene 1.9.1 that might be related to this problem.  Looks like I'll be
creating my own patched version of Lucene in the meantime.

> TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks =>  OutOfMemoryException
> -------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-529
>          URL: http://issues.apache.org/jira/browse/LUCENE-529
>      Project: Lucene - Java
>         Type: Bug

>   Components: Index
>     Versions: 1.9
>  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer......will aplpy to 1.9 code 
>     Reporter: Andy Hind
>  Attachments: ThreadLocalTest.java
>
> TermInfosReader uses an instance level ThreadLocal for enumerators.
> This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current JVMs, 
> not just an old JVM issue as described in the finalizer of the 1.9 code.
> There is also an instance level thread local in SegmentReader....which will have the same issue.
> There may be other uses which also need to be fixed.
> I don't understand the intended use for these variables.....however
> Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal source code. Each instance of
TermInfosReader will be creating an instance of the thread local. All this does is create an instance
(Continue reading)

robert engels (JIRA | 1 May 2006 17:25
Picon
Favicon

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

    [ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377228 ] 

robert engels commented on LUCENE-529:
--------------------------------------

As was pointed out in the email threads related to this, the submitters test cases are incorrect.

This bug shoul dbe closed as invalid.

> TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks =>  OutOfMemoryException
> -------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-529
>          URL: http://issues.apache.org/jira/browse/LUCENE-529
>      Project: Lucene - Java
>         Type: Bug

>   Components: Index
>     Versions: 1.9
>  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer......will aplpy to 1.9 code 
>     Reporter: Andy Hind
>  Attachments: ThreadLocalTest.java
>
> TermInfosReader uses an instance level ThreadLocal for enumerators.
> This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current JVMs, 
> not just an old JVM issue as described in the finalizer of the 1.9 code.
> There is also an instance level thread local in SegmentReader....which will have the same issue.
> There may be other uses which also need to be fixed.
> I don't understand the intended use for these variables.....however
> Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal source code. Each instance of
(Continue reading)

Marvin Humphrey | 1 May 2006 18:03

Returning a minimum number of clusters

Greets,

I'm toying with the idea of implementing clustering of search results  
based on comparison of document vectors constrained by field.  For  
instance, you could cluster based on "topic", or "domain", or  
"content".  "domain" would be easy, as it's presumably a single value  
field.  "content" would be much more involved.

The problem I'm trying to solve is how to return a minimum number of  
clusters from a search.  Say the most relevant 100 documents for a  
query are all from the same domain, but you want a maximum of two  
results per domain, a la Google.  I don't see any alternative to  
rerunning the query an indeterminate number of times until you've  
accumulated sufficient clusters, because the search logic doesn't  
know what cluster a document belongs to until the document vector is  
retrieved.

Is there a better way?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Nicholaus Shupe (JIRA | 1 May 2006 18:16
Picon
Favicon

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

    [ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377238 ] 

Nicholaus Shupe commented on LUCENE-436:
----------------------------------------

I will be trying this patch on Lucene 1.9.1 to see if it fixes my production memory leak with Tomcat 5.5.15 /
Redhat / Java 1.5.0_06.

> [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
> ----------------------------------------------------------------
>
>          Key: LUCENE-436
>          URL: http://issues.apache.org/jira/browse/LUCENE-436
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: 1.4
>  Environment: Solaris JVM 1.4.1
> Linux JVM 1.4.2/1.5.0
> Windows not tested
>     Reporter: kieran

>
> We've been experiencing terrible memory problems on our production search server, running lucene (1.4.3).
> Our live app regularly opens new indexes and, in doing so, releases old IndexReaders for garbage collection.
> But...there appears to be a memory leak in org.apache.lucene.index.TermInfosReader.java.
> Under certain conditions (possibly related to JVM version, although I've personally observed it under
both linux JVM 1.4.2_06, and 1.5.0_03, and SUNOS JVM 1.4.1) the ThreadLocal member variable,
"enumerators" doesn't get garbage-collected when the TermInfosReader object is gc-ed.
(Continue reading)

Nicholaus Shupe (JIRA | 1 May 2006 18:22
Picon
Favicon

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

    [ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377239 ] 

Nicholaus Shupe commented on LUCENE-529:
----------------------------------------

This bug seems to be a dupe of http://issues.apache.org/jira/browse/LUCENE-436, invalid or not.

> TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks =>  OutOfMemoryException
> -------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-529
>          URL: http://issues.apache.org/jira/browse/LUCENE-529
>      Project: Lucene - Java
>         Type: Bug

>   Components: Index
>     Versions: 1.9
>  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer......will aplpy to 1.9 code 
>     Reporter: Andy Hind
>  Attachments: ThreadLocalTest.java
>
> TermInfosReader uses an instance level ThreadLocal for enumerators.
> This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current JVMs, 
> not just an old JVM issue as described in the finalizer of the 1.9 code.
> There is also an instance level thread local in SegmentReader....which will have the same issue.
> There may be other uses which also need to be fixed.
> I don't understand the intended use for these variables.....however
> Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal source code. Each instance of
TermInfosReader will be creating an instance of the thread local. All this does is create an instance
variable on each thread when it accesses the thread local. Setting it to null in the finaliser will set it to
(Continue reading)

robert engels (JIRA | 1 May 2006 18:37
Picon
Favicon

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

    [ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377242 ] 

robert engels commented on LUCENE-436:
--------------------------------------

The bug is invalid, as was pointed out on the lucene-dev list.

When the thread dies, all thread locals will be cleaned-up.

If the thread does not die, the thread-locals are still cleaned-up PERIODICALLY. And the value is always
cleaned up - since it is held in a WeakReference, the only things that "hangs around" is the key - which
consumes VERYL LITTLE memory.

The submitter should be able to provide a testcase that demonstrate the ThreadLocal bug. The one provided
with bug 529 is invalid (as was pointed out on the dev email thread).

IMHO, there is NO BUG HERE. The users are ding something else that is causing a OOM.

This bug should be closed as invalid.

> [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
> ----------------------------------------------------------------
>
>          Key: LUCENE-436
>          URL: http://issues.apache.org/jira/browse/LUCENE-436
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: 1.4
(Continue reading)

robert engels (JIRA | 1 May 2006 18:39
Picon
Favicon

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

    [ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377243 ] 

robert engels commented on LUCENE-436:
--------------------------------------

The finalize() method should be removed from the Segment/Term readers.

In most cases this will actually make things worse, as the objects will need to stay around until the
finalizer thread can run.

> [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
> ----------------------------------------------------------------
>
>          Key: LUCENE-436
>          URL: http://issues.apache.org/jira/browse/LUCENE-436
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: 1.4
>  Environment: Solaris JVM 1.4.1
> Linux JVM 1.4.2/1.5.0
> Windows not tested
>     Reporter: kieran

>
> We've been experiencing terrible memory problems on our production search server, running lucene (1.4.3).
> Our live app regularly opens new indexes and, in doing so, releases old IndexReaders for garbage collection.
> But...there appears to be a memory leak in org.apache.lucene.index.TermInfosReader.java.
> Under certain conditions (possibly related to JVM version, although I've personally observed it under
(Continue reading)

Grant Ingersoll | 1 May 2006 19:21
Picon
Favicon

Re: Returning a minimum number of clusters

You might be interested in the Carrot project, which has some Lucene 
support.  I don't know if it solves your second problem, but it already 
implements clustering and may allow you to get to an answer for the 
second problem quicker.  I have, just recently, started using it for a 
clustering task I am working on related to search results.  I think the 
author of Carrot is on the user list from time to time

Marvin Humphrey wrote:
> Greets,
>
> I'm toying with the idea of implementing clustering of search results 
> based on comparison of document vectors constrained by field.  For 
> instance, you could cluster based on "topic", or "domain", or 
> "content".  "domain" would be easy, as it's presumably a single value 
> field.  "content" would be much more involved.
>
> The problem I'm trying to solve is how to return a minimum number of 
> clusters from a search.  Say the most relevant 100 documents for a 
> query are all from the same domain, but you want a maximum of two 
> results per domain, a la Google.  I don't see any alternative to 
> rerunning the query an indeterminate number of times until you've 
> accumulated sufficient clusters, because the search logic doesn't know 
> what cluster a document belongs to until the document vector is 
> retrieved.
>
> Is there a better way?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
(Continue reading)


Gmane