Nick Smith | 1 Sep 2003 17:04
Picon
Favicon

Issue with Similarity and negative numbers

Hi Luceners!

I am misusing the document score for date sorting (I display news
headlines in a chronological list).

As the document score is ultimately encoded as a byte the maximum
possible number of values is 256 minus the special value of 0
(document not found).

In the current implementation; all negative float values get
rounded up to zero by Similarity.floatToByte() and the method
Similarity.byteToFloat() returns only values in the range of
1 to 127 values that are greater than the decode for the
next lower byte value.

i.e. 
Similarity.byteToFloat(byteVal+1) > Similarity.byteToFloat(byteVal)

For my application having 255 possible scores from searches was better
than 127 so....

I have patched the Similarity class to encode negative floats into
the negative byte values and to decode the negative byte values back
into negative floats.

The encoding of the positive values are unchanged by this patch.

Could this version please be checked into CVS by someone with commit
rights?  Or is there are a more formal procedure to submitting patches,
say via the Bugzilla?
(Continue reading)

Nick Smith | 1 Sep 2003 16:32
Picon
Favicon

Issue with Similarity and negative numbers

Hi Luceners!

I am misusing the document score for date sorting (I display news
headlines in a chronological list).

As the document score is ultimately encoded as a byte the maximum
possible number of values is 256 minus the special value of 0
(document not found).

In the current implementation; all negative float values get
rounded up to zero by Similarity.floatToByte() and the method
Similarity.byteToFloat() returns only values in the range of
1 to 127 values that are greater than the decode for the
next lower byte value.

i.e. 
Similarity.byteToFloat(byteVal+1) > Similarity.byteToFloat(byteVal)

For my application having 255 possible scores from searches was better
than 127 so....

I have patched the Similarity class to encode negative floats into
the negative byte values and to decode the negative byte values back
into negative floats.

The encoding of the positive values are unchanged by this patch.

Could this version please be checked into CVS by someone with commit
rights?  Or is there are a more formal procedure to submitting patches,
say via the Bugzilla?
(Continue reading)

otis | 2 Sep 2003 14:38
Picon
Favicon

cvs commit: jakarta-lucene/xdocs resources.xml

otis        2003/09/02 05:38:43

  Modified:    docs     resources.html
               xdocs    resources.xml
  Log:
  - Added a link to lucenedotnet project on SF.

  Revision  Changes    Path
  1.33      +3 -0      jakarta-lucene/docs/resources.html

  Index: resources.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/resources.html,v
  retrieving revision 1.32
  retrieving revision 1.33
  diff -u -r1.32 -r1.33
  --- resources.html	2 Sep 2003 12:20:00 -0000	1.32
  +++ resources.html	2 Sep 2003 12:38:43 -0000	1.33
   <at>  <at>  -157,6 +157,9  <at>  <at> 
                   <li><a href="http://sourceforge.net/projects/nlucene/">NLucene</a>
                       <br /> - .NET implementation of Lucene
                   </li>
  +                <li><a href="http://sourceforge.net/projects/lucenedotnet">Lucene.net</a>
  +                    <br /> - .NET implementation of Lucene
  +                </li>
                   <li><a href="http://www.divmod.org/Lupy/">Lupy</a>
                       <br /> - Python implementation of Lucene
                   </li>

  
(Continue reading)

otis | 2 Sep 2003 14:20
Picon
Favicon

cvs commit: jakarta-lucene/docs resources.html

otis        2003/09/02 05:20:00

  Modified:    xdocs    resources.xml
               docs     resources.html
  Log:
  - Added a link to Erik's java.net article.

  Revision  Changes    Path
  1.13      +3 -1      jakarta-lucene/xdocs/resources.xml

  Index: resources.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/resources.xml,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- resources.xml	3 Jun 2003 06:07:29 -0000	1.12
  +++ resources.xml	2 Sep 2003 12:20:00 -0000	1.13
   <at>  <at>  -24,7 +24,9  <at>  <at> 
                           href="http://www-106.ibm.com/developerworks/library/j-lucene/">Parsing, indexing, and
searching XML with Digester and Lucene</a>
                       <br/> - IBM developerWorks, June 2003
                   </li>
  -
  +                <li><a href="http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html">Lucene Intro</a>
  +                    <br/> - java.net, July 2003
  +                </li>
                   <li><a
                           href="http://www.chedong.com/tech/lucene.html">Lucene introduction in Chinese</a>
                   </li>
(Continue reading)

Otis Gospodnetic | 2 Sep 2003 14:54
Picon
Favicon

Re: cvs commit: jakarta-lucene build.xml

Those changes are fine by me.
You could also shorten that 'testcase' part (e.g. test, ut, tc, t).

Otis

--- Erik Hatcher <lists <at> ehatchersolutions.com> wrote:
> FYI - this enables running a single unit test like this:
> 
> 	ant test-unit -Dtestcase=TestRussianAnalyzer
> 
> Anyone object to me changing test-unit target name to "test" and the 
> current "test" one to "test-compile"?  Wouldn't it make more sense to
> 
> type "ant test" to run the tests, not just compile them?
> 
> Also, if there is anything I broke with the build, by all means let
> me 
> know - this is my first foray into committing to Lucene.
> 
> 
> On Tuesday, August 12, 2003, at 06:49  AM, ehatcher <at> apache.org wrote:
> 
> > ehatcher    2003/08/12 03:49:44
> >
> >   Modified:    .        build.xml
> >   Log:
> >   allow isolation of a single unit test
> >
> >   Revision  Changes    Path
> >   1.40      +4 -1      jakarta-lucene/build.xml
(Continue reading)

otis | 2 Sep 2003 15:58
Picon
Favicon

cvs commit: jakarta-lucene/docs fileformats.html

otis        2003/09/02 06:58:01

  Modified:    xdocs    fileformats.xml
               docs     fileformats.html
  Log:
  - Corrected(?) the documentation about normalization factors.

  Revision  Changes    Path
  1.4       +3 -2      jakarta-lucene/xdocs/fileformats.xml

  Index: fileformats.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/fileformats.xml,v
  retrieving revision 1.3
  retrieving revision 1.4
  diff -u -r1.3 -r1.4
  --- fileformats.xml	6 Mar 2003 19:14:17 -0000	1.3
  +++ fileformats.xml	2 Sep 2003 13:58:01 -0000	1.4
   <at>  <at>  -1071,12 +1071,13  <at>  <at> 
                   </p>
               </subsection>
               <subsection name="Normalization Factors">
  -                <p>The .nrm file contains,
  +                <p>There's a norm file for each indexed field with a byte for
  +                   each document.  The .n[0-9]* file contains,
                       for each document, a byte that encodes a value that is multiplied
                       into the score for hits on that field:
                   </p>
                   <p>Norms
  -                    (.nrm)    --&gt; &lt;Byte&gt;<sup>SegSize</sup>
(Continue reading)

bugzilla | 2 Sep 2003 16:21
Picon
Favicon

DO NOT REPLY [Bug 22469] - org.apache.lucene.search.Query.toString(String field) ignores it's only parameter

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=22469>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=22469

org.apache.lucene.search.Query.toString(String field) ignores it's only parameter

------- Additional Comments From otis <at> apache.org  2003-09-02 14:21 -------
Please provide some examples of what you are describing, so I can understand the
problem that you are trying to describe.  Runnable code samples in a form of a
(j)unit test would be even better.
Erik Hatcher | 3 Sep 2003 05:34
Favicon

ASL in test cases

Is there a reason why the Apache license isn't in any (or at least 
several) of the test case code?

I'll go paste it in if no one objects (although not sure why you'd 
object, in fact its probably a mandatory thing in these parts).

	Erik
Christoph Goller | 3 Sep 2003 15:58
Picon

PATCH: SegmentsReader/SegmentsTermEnum

Hi Lucene Developers,

first let me thank you all for this excellent peace of software
that you created. I am using Lucene in several projects and I
am currently also building more enhanced text mining applications
on top of it. Because of that I have spent a lot of time studying
the Lucene sources and I will come up with a couple of proposals
for bug fixes in the next days. Here is the first one:

I think I can fix a bug in SegmentsTermEnum.
One can create a TermEnum from an IndexReader in two ways:

indexReader.terms()
indexReader.terms(t)

If one gets a TermEnum starting at a specified term t one does not
have to call enum.next() before using it. The enum is valid from the
beginning.Calling enum.next() switches to the next term. However, this
bahaviour is only true if our index consists of only one segment. If we
have an index consisting of several segments term t is delivered twice,
1st time after calling indexReader.terms(t); enum.term(), 2nd time after
calling enum.next(). Furthermore the initial document frequency might
be false (if t occurs in more than one segment). The problem can be
fixed by calling next() in the constructor of SegmentsTermEnum.
I attach a test that demonstrates the problem and a patch that fixes it.

kind regards,
Christoph

--

-- 
(Continue reading)

Christoph Goller | 3 Sep 2003 17:21
Picon

PATCH: IndexWriter


IndexWriter implements the method docCount() which reads the number
of documents from the SegmentInfos of the index. However, it delivers
incorrect values if documents get deleted from the index. The reason for
this is that SegmentInfo.docCounts are updated in an incorrect way when
segments get merged. The new value is taken from the old SegmentInfos.
It would be better to take the value from the reader instead. In this
way indexWriter.docCount() would deliver the same value as
indexReader.maxDoc().

test and patch are attached,
Christoph

--

-- 
*****************************************************************
* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
* Detego Software GmbH       Mobile: +49 179 1128469            *
* Keuslinstr. 13             Fax.:   +49 721 151516176          *
* 80798 München, Germany     Email:  goller <at> detego-software.de  *
*****************************************************************
Index: IndexWriter.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v
retrieving revision 1.14
diff -u -r1.14 IndexWriter.java
--- IndexWriter.java	12 Aug 2003 15:05:03 -0000	1.14
+++ IndexWriter.java	3 Sep 2003 14:55:33 -0000
 <at>  <at>  -355,7 +355,7  <at>  <at> 
(Continue reading)


Gmane