Chris Hostetter | 1 Jul 02:40 2009

Re: Order of fields within a Document in Lucene 2.4+


Hmmm... i'm not an expert on the internals of indexing, and i don't use 
FieldSelectors much, but this seems like a pretty big bug to me ... or at 
the very least: a change in behavior that completely eliminates the value 
of LOAD_AND_BREAK.

https://issues.apache.org/jira/browse/LUCENE-1727

: The Lucene FAQ says...
:  
: What is the order of fields returned by Document.fields()?
: * Fields are returned in the same order they were added to the document. 
: (now getFields() as fields is deprecated)
:  
: However I think this may no longer be the case in 2.4 
:  
: We are indexing documents in a specific order so that we can LOAD_AND_BREAK out of our FieldSelector as
early as possible.
: i.e. we have typically 50 indexed fields for a document, but when we are loading results with .doc(), we
know we only need 4 of them.
:  
: So, our code ensures that these are added to the index first - and once the 4th field is loaded we break out of
the selector.
:  
: This speeds us up by an order of magnitude.
:  
:  
:  
: However, we are finding that our field selector is processing fields in alphabetical order, not order of
addition.  This means that we'd have to rename our fields to 'aaa..' in order to guarantee they'd be
(Continue reading)

Mark Miller | 1 Jul 03:20 2009
Picon

Re: Order of fields within a Document in Lucene 2.4+

Yeah, I've heard rumblings about this issue before. I can't remember what
patch changed it though - one of Mike M's I think?

On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter
<hossman_lucene <at> fucit.org>wrote:

>
> Hmmm... i'm not an expert on the internals of indexing, and i don't use
> FieldSelectors much, but this seems like a pretty big bug to me ... or at
> the very least: a change in behavior that completely eliminates the value
> of LOAD_AND_BREAK.
>
> https://issues.apache.org/jira/browse/LUCENE-1727
>
>
>
> : The Lucene FAQ says...
> :
> : What is the order of fields returned by Document.fields()?
> : * Fields are returned in the same order they were added to the document.
> : (now getFields() as fields is deprecated)
> :
> : However I think this may no longer be the case in 2.4
> :
> : We are indexing documents in a specific order so that we can
> LOAD_AND_BREAK out of our FieldSelector as early as possible.
> : i.e. we have typically 50 indexed fields for a document, but when we are
> loading results with .doc(), we know we only need 4 of them.
> :
> : So, our code ensures that these are added to the index first - and once
(Continue reading)

Harsha1 | 1 Jul 07:23 2009
Picon

Stop Word list File processing


Hi,
I have a string through i need to filter off some of words (say stop words).
But I want to use WhiteSpaceAnalyser. So I have created a custom analyser
with capability of whitespaceAnalyser and filtering unwanted words.
Since the String array of Stop words is increasing, i would like put this
into a file and process this file.

How to go about this?
--

-- 
View this message in context: http://www.nabble.com/Stop-Word-list-File-processing-tp24284338p24284338.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Ganesh | 1 Jul 07:39 2009
Picon

Re: Term Frequency vector consumes memory

Thanks for your reply.

My requirement is to fetch the list of top frequency terms indexed in a day. I used the logic said in the
article (refer below link)
http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index

I enabled term vector for a field. Indexed the content and i am able to retrieve the list of top indexed term in
a day / date range.

When IndexReader/ Searcher is opened, whether it will load all term vector frequncies? 

Consider i have enabled this option and indexed say 5GB, Now i don't  want the Reader / Searcher to load term
vector. I want to switch off  
 this feature? Is that possible without re-indexing?

Regards
Ganesh

----- Original Message ----- 
From: "Grant Ingersoll" <gsingers <at> apache.org>
To: <java-user <at> lucene.apache.org>
Sent: Tuesday, June 30, 2009 9:48 PM
Subject: Re: Term Frequency vector consumes memory

> In Lucene, a Term Vector is a specific thing that is stored on disk  
> when creating a Document and Field.  It is optional and off by  
> default.  It is separate from being able to get the term frequencies  
> for all the docs in a specific field.  The former is decided at  
> indexing time and there is no way to remove it w/o reindexing.   
> Furthermore, it is not loaded into memory by the IndexReader.  Term  
(Continue reading)

Shayak Sen | 1 Jul 08:29 2009
Picon

Re: Stop Word list File processing

Hi,
When you make the custom filter for removing stopwords, use the
constructor to load the stopwords list then use it as you were
earlier.

On Wed, Jul 1, 2009 at 1:23 PM, Harsha1<99harsha.h.n99 <at> gmail.com> wrote:
>
> Hi,
> I have a string through i need to filter off some of words (say stop words).
> But I want to use WhiteSpaceAnalyser. So I have created a custom analyser
> with capability of whitespaceAnalyser and filtering unwanted words.
> Since the String array of Stop words is increasing, i would like put this
> into a file and process this file.
>
> How to go about this?
> --
> View this message in context: http://www.nabble.com/Stop-Word-list-File-processing-tp24284338p24284338.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
>
>
Amin Mohammed-Coleman | 1 Jul 09:57 2009
Picon

IndexWriter

Hi

This question has probably been asked before so apologies for asking it
again.  Just to confirm that it is ok to use a single index writer in a web
application and only close that single instance on application shutdown?  As
the indexwriter is thread safe there is no need for any external
synchronisation.  Am I correct in thinking this?

I have configured via spring a single index writer which is used in the
application and I use the same one for adding and updating documents.  This
index writer is closed when the spring application context shutsdown.

Cheers
Amin
Michael McCandless | 1 Jul 11:22 2009

Re: Order of fields within a Document in Lucene 2.4+

Sorry, yes, this was my fault with the indexing speedups in 2.3
(LUCENE-843): as of 2.3, if any fields have term vectors enabled, the
fields are sorted lexicographically.  As of 2.4 (LUCENE-1301,
refactoring the indexing core), that sort happens even without term
vectors.

Hoss I see you've opened an issue for this (LUCENE-1727) for this;
I'll take that & fix for 2.9.

Sorry,

Mike

On Tue, Jun 30, 2009 at 9:20 PM, Mark Miller<markrmiller <at> gmail.com> wrote:
> Yeah, I've heard rumblings about this issue before. I can't remember what
> patch changed it though - one of Mike M's I think?
>
> On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter
> <hossman_lucene <at> fucit.org>wrote:
>
>>
>> Hmmm... i'm not an expert on the internals of indexing, and i don't use
>> FieldSelectors much, but this seems like a pretty big bug to me ... or at
>> the very least: a change in behavior that completely eliminates the value
>> of LOAD_AND_BREAK.
>>
>> https://issues.apache.org/jira/browse/LUCENE-1727
>>
>>
>>
(Continue reading)

Ganesh | 1 Jul 12:39 2009
Picon

Re: IndexWriter

Yes. Single IndexWriter could be maintained in a App and it could be closed when the App is shutdown. 

Regards
Ganesh

----- Original Message ----- 
From: "Amin Mohammed-Coleman" <aminmc <at> gmail.com>
To: <java-user <at> lucene.apache.org>
Sent: Wednesday, July 01, 2009 1:27 PM
Subject: IndexWriter

> Hi
> 
> This question has probably been asked before so apologies for asking it
> again.  Just to confirm that it is ok to use a single index writer in a web
> application and only close that single instance on application shutdown?  As
> the indexwriter is thread safe there is no need for any external
> synchronisation.  Am I correct in thinking this?
> 
> I have configured via spring a single index writer which is used in the
> application and I use the same one for adding and updating documents.  This
> index writer is closed when the spring application context shutsdown.
> 
> 
> Cheers
> Amin
>
Send instant messages to your online friends http://in.messenger.yahoo.com 
Simon Willnauer | 1 Jul 12:47 2009

Re: IndexWriter

You might want to take care of the write.lock file in the index
directory if your application breaks down. If you do not close the
writer and restart you app you might get an LockObtainFailedException.

simon

On Wed, Jul 1, 2009 at 12:39 PM, Ganesh<emailgane <at> yahoo.co.in> wrote:
> Yes. Single IndexWriter could be maintained in a App and it could be closed when the App is shutdown.
>
> Regards
> Ganesh
>
> ----- Original Message -----
> From: "Amin Mohammed-Coleman" <aminmc <at> gmail.com>
> To: <java-user <at> lucene.apache.org>
> Sent: Wednesday, July 01, 2009 1:27 PM
> Subject: IndexWriter
>
>
>> Hi
>>
>> This question has probably been asked before so apologies for asking it
>> again.  Just to confirm that it is ok to use a single index writer in a web
>> application and only close that single instance on application shutdown?  As
>> the indexwriter is thread safe there is no need for any external
>> synchronisation.  Am I correct in thinking this?
>>
>> I have configured via spring a single index writer which is used in the
>> application and I use the same one for adding and updating documents.  This
>> index writer is closed when the spring application context shutsdown.
(Continue reading)

Toke Eskildsen | 1 Jul 13:31 2009
Picon

Re: Scaling out/up or a mix

On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> The number of concurrent users today is insignficant but once we push
> for the service we will get into trouble... I know that since even one
> simple faceting query (which we will use to display trend graphs) can
> take forever (talking about SOLR bytw).

Ah, faceting. That could very well be the defining requirement for your
selection of hardware. As far as I remember, Solr supports two different
ways of faceting (depending on whether there are few or many tags in a
facet), where at least one of them uses counters corresponding to the
number of documents in the index. That scheme is similar to the approach
we're taking and in our experience this quickly moves the bottleneck to
RAM access speed. Now, I'm not at all a Solr expert so they might have
done something clever in that area; I'd recommend that you also state
your question on the Solr mailing list and mention what kind of faceting
you'll be performing.

> "Normal" Lucene queries (title:blah OR description:blah) timing is
> reasonable for the current hardware but not good (Currently 8 machines
> 2GB RAM each serving 130G index). It takes less than 10 secs at all
> times which of course is very bad user experience.

In your first post you stated "I currently have an index which is 16GB
per machine (8 machines = 128GB)" so it's a bit confusing to me what you
have?

> Example of a public query (no sorting on publisheddate but rather on
> relevance = faster):
> http://blogsearch.tailsweep.com/search.do?wa=test&la=all

(Continue reading)


Gmane