Ganesh | 1 Mar 09:56 2010
Picon

Re: If you could have one feature in Lucene...

Replication support as like in Solr. 

Regards
Ganesh

----- Original Message ----- 
From: "Grant Ingersoll" <gsingers <at> apache.org>
To: <java-user <at> lucene.apache.org>
Sent: Wednesday, February 24, 2010 7:12 PM
Subject: If you could have one feature in Lucene...

> What would it be?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
>
Send instant messages to your online friends http://in.messenger.yahoo.com 
Koji Sekiguchi | 1 Mar 15:42 2010
Picon

Re: Highlighting large documents (Lucene 3.0.0)

-Arne- wrote:
> Hi,
>
> I'm using Lucene 3.0.0 and have large documents to search (logfiles
> 0,5-20MB). For better search results the query tokens are truncated left and
> right. A search for "user" is made to "*user*". The performance of searching
> even complex queries with more than one searchterm is quite good. But
> highlighting the search results took quite a while. I have tried the default
> Highlighter, which doesn't seemed to be fast enough and the
> FastVectorHighlighter, which seems to be fast enought, but didn't return
> fragments for truncated queries, for not truncated query I got fragments.
> Could anybode please tell me what is the best way to highlight large
> documents and, if the FastVectorHighlighter is the solution for faster
> highlighting, how to highlight truncated search queries.
>
> Thanks in advance,
> -Arne-
>   
I'm not sure this is the best way, but can you index and search
the highlighting field with NGram? Since FVH supports
NGram field to highlight, you can use "user" just as "user"
(rather than "*user*") to highlight the NGram field.

Koji

--

-- 
http://www.rondhuit.com/en/
-Arne- | 1 Mar 16:34 2010

Re: Highlighting large documents (Lucene 3.0.0)


Hi Koji,
thanks for your answer. Can you help me a once again?  What exactly  I
suposse to do? 

Koji Sekiguchi-2 wrote:
> 
> -Arne- wrote:
>> Hi,
>>
>> I'm using Lucene 3.0.0 and have large documents to search (logfiles
>> 0,5-20MB). For better search results the query tokens are truncated left
>> and
>> right. A search for "user" is made to "*user*". The performance of
>> searching
>> even complex queries with more than one searchterm is quite good. But
>> highlighting the search results took quite a while. I have tried the
>> default
>> Highlighter, which doesn't seemed to be fast enough and the
>> FastVectorHighlighter, which seems to be fast enought, but didn't return
>> fragments for truncated queries, for not truncated query I got fragments.
>> Could anybode please tell me what is the best way to highlight large
>> documents and, if the FastVectorHighlighter is the solution for faster
>> highlighting, how to highlight truncated search queries.
>>
>> Thanks in advance,
>> -Arne-
>>   
> I'm not sure this is the best way, but can you index and search
> the highlighting field with NGram? Since FVH supports
(Continue reading)

Koji Sekiguchi | 1 Mar 17:17 2010
Picon

Re: Highlighting large documents (Lucene 3.0.0)

-Arne- wrote:
> Hi Koji,
> thanks for your answer. Can you help me a once again?  What exactly  I
> suposse to do? 
>
>   
The concrete program in my mind here:

public class TestHighlightTruncatedSearchQuery {

  static Directory dir = new RAMDirectory();
  static Analyzer analyzer = new BiGramAnalyzer();
  static final String[] DOCS = {
    "import org.apache.lucene.analysis.Analyzer;",
    "import org.apache.lucene.analysis.TokenStream;",
    "import org.apache.lucene.analysis.ngram.NGramTokenizer;",
    "import org.apache.lucene.index.IndexWriter;",
    "import org.apache.lucene.index.IndexWriter.MaxFieldLength;",
    "import org.apache.lucene.store.Directory;",
    "import org.apache.lucene.store.RAMDirectory;"
  };
  static final String F = "f";

  public static void main(String[] args) throws Exception {
    makeIndex();
    searchIndex();
  }

  static void makeIndex() throws IOException {
    IndexWriter writer = new IndexWriter( dir, analyzer, true, 
(Continue reading)

Mark Ferguson | 1 Mar 21:35 2010
Picon

Reverse Search

Hello,

I am trying to figure out the best search strategy for my situation and am
looking for advice. I will be processing short bits of text (Tweets for
example), and need to search them to see if they certain terms. The list of
terms is a set of locations (towns cities) and is quite long, approximately
500 different entries, and terms can contain spaces.

My typical approach would be to index each Tweet and then search the
resulting document index for each search term. However, I'm not sure this is
the best solution in this situation for two reasons: first, the list of
locations is quite long so we are talking about a large number of queries,
which may grow even larger so I see scalability issues. Second, my Tweet
index is not stable as I am just interested in each Tweet as it comes in,
and can discard it after, so I have no need really to index each entry. It
is actually my list of locations which is stable and searchable.

My thought is to do some kind of reverse search, in which I index the
locations, and then I pass each Tweet to that index as my query. I am not
exactly sure how to go about this though in a way that will do the search in
the way I want. I am also concerned about locations that contain spaces and
how to have these recognised.

As an example, if my locations list is as follows: {"New York", "Chicago",
"Los Angeles"} and my text is the following: "Fire burning in Los Angeles",
I would like to be able to send that _text_ as  query to my indexed location
list, and get a hit.

Is this something that is doable, or does someone envision a different
approach to the problem? Thanks for your time.
(Continue reading)

Digy | 1 Mar 22:00 2010
Picon

RE: Reverse Search

MoreLikeThis in contrib may help.

DIGY

-----Original Message-----
From: Mark Ferguson [mailto:mark.a.ferguson <at> gmail.com] 
Sent: Monday, March 01, 2010 10:35 PM
To: java-user <at> lucene.apache.org
Subject: Reverse Search

Hello,

I am trying to figure out the best search strategy for my situation and am
looking for advice. I will be processing short bits of text (Tweets for
example), and need to search them to see if they certain terms. The list of
terms is a set of locations (towns cities) and is quite long, approximately
500 different entries, and terms can contain spaces.

My typical approach would be to index each Tweet and then search the
resulting document index for each search term. However, I'm not sure this is
the best solution in this situation for two reasons: first, the list of
locations is quite long so we are talking about a large number of queries,
which may grow even larger so I see scalability issues. Second, my Tweet
index is not stable as I am just interested in each Tweet as it comes in,
and can discard it after, so I have no need really to index each entry. It
is actually my list of locations which is stable and searchable.

My thought is to do some kind of reverse search, in which I index the
locations, and then I pass each Tweet to that index as my query. I am not
exactly sure how to go about this though in a way that will do the search in
(Continue reading)

Steven A Rowe | 1 Mar 22:01 2010
Picon

RE: Reverse Search

Hi Mark,

On 03/01/2010 at 3:35 PM, Mark Ferguson wrote:
> I will be processing short bits of text (Tweets for example), and
> need to search them to see if they certain terms.

You might consider, instead of performing reverse search, just querying all of your locations against one
document at a time using Lucene's MemoryIndex, which is very fast: 

<http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html>

If you decide to go the reverse search route, Lucene's InstantiatedIndex is also very fast, and unlike
MemoryIndex, can handle more than one document at a time:|

<http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/store/instantiated/package-summary.html>

Steve
Mark Ferguson | 1 Mar 22:14 2010
Picon

Re: Reverse Search

Hi Steve,

MemoryIndex appears to be exactly what I'm looking for. Thank you!

Mark

On Mon, Mar 1, 2010 at 2:01 PM, Steven A Rowe <sarowe <at> syr.edu> wrote:

> Hi Mark,
>
> On 03/01/2010 at 3:35 PM, Mark Ferguson wrote:
> > I will be processing short bits of text (Tweets for example), and
> > need to search them to see if they certain terms.
>
> You might consider, instead of performing reverse search, just querying all
> of your locations against one document at a time using Lucene's MemoryIndex,
> which is very fast:
>
> <
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html
> >
>
> If you decide to go the reverse search route, Lucene's InstantiatedIndex is
> also very fast, and unlike MemoryIndex, can handle more than one document at
> a time:|
>
> <
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/store/instantiated/package-summary.html
> >
>
(Continue reading)

ajay_gupta | 2 Mar 14:27 2010
Picon

Lucene Indexing out of memory


Hi,
It might be general question though but I couldn't find the answer yet. I
have around 90k documents sizing around 350 MB. Each document contains a
record which has some text content. For each word in this text I want to
store context for that word and index it so I am reading each document and
for each word in that document I am appending fixed number of surrounding
words. To do that first I search in existing indices if this word already
exist and if it is then I get the content and append the new context and
update the document. In case no context exist I create a document with
fields "word" and "context" and add these two fields with values as word
value and context value.

I tried this in RAM but after certain no of docs it gave out of memory error
so I thought to use FSDirectory method but surprisingly after 70k documents
it also gave OOM error. I have enough disk space but still I am getting this
error.I am not sure even for disk based indexing why its giving this error.
I thought disk based indexing will be slow but atleast it will be scalable. 
Could someone suggest what could be the issue ?

Thanks
Ajay
--

-- 
View this message in context: http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Erick Erickson | 2 Mar 14:39 2010
Picon

Re: Lucene Indexing out of memory

I'm not following this entirely, but these docs may be huge by the
time you add context for every word in them. You say that you
"search the existing indices then I get the content and append....".
So is it possible that after 70K documents your additions become
so huge that you're blowing up? Have you taken any measurements
to determine how big the docs get as you index more and more
of them?

If the above is off base, have you tried setting
IndexWriter.setRAMBufferSizeMB?

HTH
Erick

On Tue, Mar 2, 2010 at 8:27 AM, ajay_gupta <ajay978 <at> gmail.com> wrote:

>
> Hi,
> It might be general question though but I couldn't find the answer yet. I
> have around 90k documents sizing around 350 MB. Each document contains a
> record which has some text content. For each word in this text I want to
> store context for that word and index it so I am reading each document and
> for each word in that document I am appending fixed number of surrounding
> words. To do that first I search in existing indices if this word already
> exist and if it is then I get the content and append the new context and
> update the document. In case no context exist I create a document with
> fields "word" and "context" and add these two fields with values as word
> value and context value.
>
> I tried this in RAM but after certain no of docs it gave out of memory
(Continue reading)


Gmane