1 Nov 2005 01:31
Re: bytecount as String and prefix length
Marvin Humphrey <marvin <at> rectangular.com>
2005-11-01 00:31:14 GMT
2005-11-01 00:31:14 GMT
I wrote... > I think I'll take a crack at a custom charsToUTF8 converter algo. Still no luck. Still 20% slower than the current implementation. The algo is below, for reference. It's entirely possible that my patches are doing something dumb that's causing this, given my limited experience with Java. But if that's not the case, I can think of two other explanations. One is that the passage of the text through an intermediate buffer before blasting it out is considerably more expensive than anticipated. The other is that the pre-allocation of a char[] array based on the length VInt yields a significant benefit over the standard techniques for reading in UTF-8. That wouldn't be hard to believe. Without that number, there's a lot of guesswork involved. English requires about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory allocation ops may be required as bytes get read in, especially if the final String object kicked out HAS to use the bare minimum amount of memory. I don't suppose there's any way for me to snoop just what's happening under the hood in these CharsetDecoder classes or String constructors, is there? Scanning through a SegmentTermEnum with next() doesn't seem to be any slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs benchmarker doesn't slow down that much when IndexInput is changed to use a String constructor that accepts UTF-8 bytes rather than chars. However, it's possible that the modified toTerm method of TermBuffer(Continue reading)
RSS Feed