Marvin Humphrey | 1 Nov 2005 01:31

Re: bytecount as String and prefix length

I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
what's happening under the hood in these CharsetDecoder classes or  
String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any  
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs  
benchmarker doesn't slow down that much when IndexInput is changed to  
use a String constructor that accepts UTF-8 bytes rather than chars.   
However, it's possible that the modified toTerm method of TermBuffer  
(Continue reading)

Robert Engels | 1 Nov 2005 02:15
Picon

RE: bytecount as String and prefix length

All of the JDK source is available via download from Sun.

-----Original Message-----
From: Marvin Humphrey [mailto:marvin <at> rectangular.com]
Sent: Monday, October 31, 2005 6:31 PM
To: java-dev <at> lucene.apache.org
Subject: Re: bytecount as String and prefix length

I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
(Continue reading)

jacky | 1 Nov 2005 04:49
Picon
Favicon

word count in a doc

hi,
  e.g in test.txt:
  hello world, hello friend.

  When i search word "hello", can i get the count 2 by lucene?        
  I found the word count is a field of index data structure, but i can't found the api to get it.
     Best Regards.
Robert Engels | 1 Nov 2005 05:49
Picon

RE: word count in a doc

it is the frequency().

-----Original Message-----
From: jacky [mailto:jacky_chenp <at> hotmail.com]
Sent: Monday, October 31, 2005 9:49 PM
To: java-dev <at> lucene.apache.org
Subject: word count in a doc

hi,
  e.g in test.txt:
  hello world, hello friend.

  When i search word "hello", can i get the count 2 by lucene?
  I found the word count is a field of index data structure, but i can't
found the api to get it.
     Best Regards.
Marvin Humphrey | 1 Nov 2005 07:36

Re: bytecount as String and prefix length


On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

> All of the JDK source is available via download from Sun.

Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads,  
since they have to keep track of the member variables of ByteBuffer  
and CharBuffer objects and prepare to return result objects on each  
loop iter.  Also, they have robust error-checking for malformed  
source data, which Lucene traditionally has not.  The algo below my  
sig should be faster.

I wrote...

> So my next step is to write a utf8ToString method that's as efficient
> as I can make it.

Ok, this time we made a little headway.  We're down from 20% slower  
to around 10% slower indexing than current implementation.  But I  
don't see how I'm going to get it any faster.  There's maybe one  
conditional in FieldsReader that can be simplified.

There's another downside to the way I'm implementing this right now.   
The byteBuf and charBuf have to be kept somewhere.  Currently, I'm  
allocating a ByteBuffer for each TermInfosWriter and a charBuf for  
each TermBuffer.  That's something of a memory hit, though it's hard  
(Continue reading)

Otis Gospodnetic | 1 Nov 2005 08:51
Picon
Favicon

Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?

Thanks,
Otis
Paul Elschot | 1 Nov 2005 09:20
Picon
Picon
Favicon

Re: Faking index merge by modifying segments file?

On Tuesday 01 November 2005 08:51, Otis Gospodnetic wrote:
> Hello,
> 
> I spent most of today talking to some people about Lucene, and one of
> them said how they would really like to have an "instantaneous index
> merge", and how he is thinking he could achieve that by simply opening
> segments file of one index, and adding segment names of the other
> index/indices, plus adjusting the segment size (SegSize in
> fileformats.html), thus creating a single (but unoptimized) index.
> 
> Any reactions to that?
> 
> I imagine this isn't quite that simple to implement, as one would have
> to renumber all documents, in order to avoid having multiple documents
> with the same document id.
> 
> Can anyone think of any other problems with this approach, or perhaps
> offer ideas for possible document renumbering?

Document numbers within segments are determined dynamically in the
index reader, so these should not be a problem. Each segment simply numbers
its documents from zero. Iirc the segment names determine the order
of the segments for an index reader.

I think creating a new index by adding segments from an existing one should
be fairly straightforward. Some care will be needed to avoid
clashes in the segment names. Also what should happen with
the index from which the segments are taken? Should the shared segments
be copied between indexes?
It's possible to share segments between indexes when the file system allows
(Continue reading)

Marvin Humphrey | 1 Nov 2005 09:52

Re: bytecount as String and prefix length

I wrote:

> I've got one more idea... time to try overriding readString and  
> writeString in BufferedIndexInput and BufferedIndexOutput, to take  
> advantage of buffers that are already there.

Too complicated to be worthwhile, it turns out.  I think it's time to  
throw in the towel.  Frustrating, because I was shooting for a win- 
win solution (bytecounts and valid UTF-8 both make things a lot  
easier on the Perl side).  My valid-UTF-8 patches from a month ago  
perform better than the current Java implementation, but bytecounts  
make for slower indexing.  Time to re-rethink the goal of index- 
compatibility.  How about, compatible for ascii-only indexes?  :|

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Robert Engels | 1 Nov 2005 15:34
Picon

RE: Faking index merge by modifying segments file?

Problem is the terms need to be sorted in a single segment.

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic <at> yahoo.com]
Sent: Tuesday, November 01, 2005 1:52 AM
To: java-dev <at> lucene.apache.org
Subject: Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?

Thanks,
Otis

---------------------------------------------------------------------
(Continue reading)

Kevin Oliver | 1 Nov 2005 18:17
Favicon

RE: Faking index merge by modifying segments file?

Hello Otis,

I worked on a similar issue a couple on months ago. I've included our
email conversation below. 

Hopefully, your thread will prompt more interest from the mailing list. 

-Kevin

Sort of -- but only within a very controlled situation along with some
hackery you can comment out both of them. 

Here's what I had to do -- in pseudo code.

Create a new IndexWriter subclass, call it IndexWriter2 that gets its
segmentCounter initialized to the real (the actual pre-existing index)
index's segmentCounter - 1. I suspect this part is not very robust, as I
don't completely understand why I needed to subtract 1 (it has something
to do with the temporary RAMDirectory that gets used before actually
getting written to disk).

Add the documents into our IndexWriter2 so that they get properly
written to a separate place on the file system and get the correct
"next" segment names that would appear in the real index.

Within the addIndexes() loop over the dirs, you move all the newly
created files from their current location over to the real indexes file
directory. This part in particular feels very hacked.

Finally, instead of calling optimize() at the end of addIndexes(), you
(Continue reading)


Gmane