Berlin Brown | 1 Sep 04:52 2007
Picon

Re: Indexing in pieces?

So I am assuming that is not just a matter of "indexing" to that same
directory as you "indexed" before.

So, based on what you are saying, you would have to reload the
previous index (eg, INDEX_DIR_OLD) and then index the new content.
When I mean "index", I am talking about actually invoking lucene to
merge the content.

For example, it isnt just a matter of indexing to index_dir_old and
then to index_dir_new and then copying the lucene index files into
another directory index_dir_cur.

On 8/31/07, Chris Lu <chris.lu <at> gmail.com> wrote:
> I think you can simply change you sql to select only the recently updated
> messages, and add to your existing index. Although adding to an existing
> large index also takes a long time, it should be quicker than re-building
> the whole index.
>
> If your index continues to grow, you may need to have a dedicated server for
> indexing and searching.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
(Continue reading)

Chris Lu | 1 Sep 05:27 2007
Picon

Re: Indexing in pieces?

Generally you are right. Except not exactly copy old index, but first index
new content in a new directory, and merge old index with the new index.

There is a quicker merge index method, indexWriter.addIndexesNoOptimize(),
but in general, merging index is slow, although quicker than re-index.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 8/31/07, Berlin Brown <berlin.brown <at> gmail.com> wrote:
>
> So I am assuming that is not just a matter of "indexing" to that same
> directory as you "indexed" before.
>
> So, based on what you are saying, you would have to reload the
> previous index (eg, INDEX_DIR_OLD) and then index the new content.
> When I mean "index", I am talking about actually invoking lucene to
> merge the content.
>
> For example, it isnt just a matter of indexing to index_dir_old and
> then to index_dir_new and then copying the lucene index files into
> another directory index_dir_cur.
>
> On 8/31/07, Chris Lu <chris.lu <at> gmail.com> wrote:
(Continue reading)

Samir Abdou | 1 Sep 12:13 2007
Picon

Re: Is there a Term ID for each distinctive term indexed in Lucene?

Hi,

You should extend the SegmentReader (or IndexReader) class to implement the
following method:

*public* *long* termID(Term t) *throws* IOException {  *return*
tis.getPosition(t);
}

which will give you a mean to get the ID of a given term.  This ID is simply
the position of that term within ".ti?" file !

Hope this will help

Regards,

Samir

2007/9/1, Tao Cheng <tcheng3 <at> gmail.com>:
>
> Hi all,
>
> I found that instead of storing a term ID for a term in the index, Lucene
> stores the actual term string value. I am wondering if there ever is such
> a
> "term ID" for each distinctive term indexed in Lucne, similar as a "doc
> ID"
> for each distinctive document indexed in Lucene.
>
> In other words, I am looking for a method like " int termId(string term)"
(Continue reading)

tom | 1 Sep 12:14 2007
Picon

Re: Re: Is there a Term ID for each distinctive term indexed in Lucene?

Tom Roberts is out of the office until 3rd September 2007 and will get back to you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk
tom | 1 Sep 12:15 2007
Picon

Re: Re: Re: Is there a Term ID for each distinctive term indexed in Lucene?

Tom Roberts is out of the office until 3rd September 2007 and will get back to you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk
Erick Erickson | 1 Sep 21:06 2007
Picon

Re: OutOfMemoryError tokenizing a boring text file

I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that....

Best
Erick

On 8/31/07, Per Lindberg <per <at> implior.com> wrote:
>
> I'm creating a tokenized "content" Field from a plain text file
> using an InputStreamReader and new Field("content", in);
>
> The text file is large, 20 MB, and contains zillions lines,
> each with the the same 100-character token.
>
> That causes an OutOfMemoryError.
>
> Given that all tokens are the *same*,
> why should this cause an OutOfMemoryError?
> Shouldn't StandardAnalyzer just chug along
> and just note "ho hum, this token is the same"?
> That shouldn't take too much memory.
>
> Or have I missed something?
>
>
>
>
(Continue reading)

Askar Zaidi | 1 Sep 21:50 2007
Picon

Re: OutOfMemoryError tokenizing a boring text file

I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <erickerickson <at> gmail.com> wrote:
>
> I can't answer the question of why the same token
> takes up memory, but I've indexed far more than
> 20M of data in a single document field. As in on the
> order of 150M. Of course I allocated 1G or so to the
> JVM, so you might try that....
>
> Best
> Erick
>
> On 8/31/07, Per Lindberg <per <at> implior.com> wrote:
> >
> > I'm creating a tokenized "content" Field from a plain text file
> > using an InputStreamReader and new Field("content", in);
> >
> > The text file is large, 20 MB, and contains zillions lines,
> > each with the the same 100-character token.
> >
> > That causes an OutOfMemoryError.
> >
(Continue reading)

Karl Wettin | 1 Sep 22:00 2007
Picon

Re: OutOfMemoryError tokenizing a boring text file

I belive the problem is that the text value is not the only data  
associated with a token, there is for instance the position offset.  
Depending on your JVM, each instance reference consume 64 bits or so,  
so even if the text value is flyweighted by String.intern() there is  
a cost. I doubt that a document is flushed to the segment prior to a  
fields token stream has been exhaused.

-- 
karl

1 sep 2007 kl. 21.50 skrev Askar Zaidi:

> I have indexed around 100 M of data with 512M to the JVM heap. So  
> that gives
> you an idea. If every token is the same word in one file, shouldn't  
> the
> tokenizer recognize that ?
>
> Try using Luke. That helps solving lots of issues.
>
> -
> AZ
>
> On 9/1/07, Erick Erickson <erickerickson <at> gmail.com> wrote:
>>
>> I can't answer the question of why the same token
>> takes up memory, but I've indexed far more than
>> 20M of data in a single document field. As in on the
>> order of 150M. Of course I allocated 1G or so to the
>> JVM, so you might try that....
(Continue reading)

Per Lindberg | 3 Sep 15:04 2007

SV: OutOfMemoryError tokenizing a boring text file

Aha, that's interesting. However...

Setting writer.setMaxFieldLength(5000) (default is 10000)
seems to eliminate the risk for an OutOfMemoryError,
even with a JVM with only 64 MB max memory.
(I have tried larger values for JVM max memory, too).

(The name is imho slightly misleading, I would have
called it setMaxFieldTerms or something like that).

Still, 64 bits x 10000 = 78 KB. I can't see why that should
eat up 64 MB, unless the 100 char tokens also are multiplied.

(The 20 MB text file contains roughly 200 000 copies of the
same 100 char string).

To me, it appears that simply calling

   new Field("content", new InputStreamReader(in, "ISO-8859-1"))

on a plain text file causes Lucene to buffer it *all*.

> -----Ursprungligt meddelande-----
> Från: Karl Wettin [mailto:karl.wettin <at> gmail.com] 
> Skickat: den 1 september 2007 22:00
> Till: java-user <at> lucene.apache.org
> Ämne: Re: OutOfMemoryError tokenizing a boring text file
> 
> I belive the problem is that the text value is not the only data  
> associated with a token, there is for instance the position offset.  
(Continue reading)

Guilherme Barile | 3 Sep 15:46 2007
Picon

JdbcDirectory

Hello,
	We're starting a new project, which basically catalogs everything we  
have in the department (different objects with different metadata),  
and as I used Lucene before, I'm preparing a presentation to the  
team, as I think it would really simplify the storage of metadata and  
documents.
	The system will be pretty straightforward, all items will be  
cataloged,  and most of them won't be changed too much ( I'll raise  
this question later ).
	So, here are my main concerns, hope you can help

1) Storing all data (index and content) wasn't recommended in the  
past, as the index could become corrupted. Do I have this problem if  
I use a JdbcDirectory (PostgreSQL backend) ? I already read about the  
performance degradation when using a database as main storage, but  
this won't be a problem.

2) Lucene doesn't support incremental editing (a new Document will be  
created when someone edits an item), so is it possible to manage some  
kind of versioning ? Anyone ever implemented something this way ?

Thanks a lot for the attention

Guilherme Barile
Prosoma Informática

Gmane