lucene-cvs | 1 Jun 08:26 2004
Picon
Picon

[Jakarta Lucene Wiki] Updated: InformationRetrieval

   Date: 2004-05-31T23:26:12
   Editor: 81.92.64.5 <>
   Wiki: Jakarta Lucene Wiki
   Page: InformationRetrieval
   URL: http://wiki.apache.org/jakarta-lucene/InformationRetrieval

   Added to more great book

Change Log:

------------------------------------------------------------------------------
 <at>  <at>  -1,3 +1,5  <at>  <at> 
 = Books =

   * Managing Gigabytes
+  * Modern Information Retrieval
+  * Readings in Information Retrieval
ehatcher | 1 Jun 11:27 2004
Picon

cvs commit: jakarta-lucene-sandbox/contributions/db/src/java/org/apache/lucene/store/db Block.java DbDirectory.java DbInputStream.java DbOutputStream.java File.java

ehatcher    2004/06/01 02:27:04

  Modified:    contributions/db/src/java/org/apache/lucene/store/db
                        Block.java DbDirectory.java DbInputStream.java
                        DbOutputStream.java File.java
  Log:
  applied latest patch from Andi Vajda

  Revision  Changes    Path
  1.2       +14 -9     jakarta-lucene-sandbox/contributions/db/src/java/org/apache/lucene/store/db/Block.java

  Index: Block.java
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/db/src/java/org/apache/lucene/store/db/Block.java,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- Block.java	14 Jan 2004 00:29:45 -0000	1.1
  +++ Block.java	1 Jun 2004 09:27:03 -0000	1.2
   <at>  <at>  -99,27 +99,32  <at>  <at> 
       protected void seek(long position)
           throws IOException
       {
  -        position = position >>> DbOutputStream.BLOCK_SHIFT;
           byte[] data = key.getData();
  -        int last = data.length - 1;
  +        int index = data.length - 8;

  -        for (int i = 0; i < 8; i++) {
  -            data[last - i] = (byte) (position & 0xff);
(Continue reading)

lucene-cvs | 1 Jun 17:29 2004
Picon
Picon

[Jakarta Lucene Wiki] Updated: DateRangeQueries

   Date: 2004-06-01T08:29:33
   Editor: 62.23.147.191 <>
   Wiki: Jakarta Lucene Wiki
   Page: DateRangeQueries
   URL: http://wiki.apache.org/jakarta-lucene/DateRangeQueries

   Note added on the importance of keeping the same instance of IndexReader

Change Log:

------------------------------------------------------------------------------
 <at>  <at>  -18,4 +18,5  <at>  <at> 
   ||1901||6||
   ||1904||8||
   ||1906||6||
-                                                                                                                                                    
+             
+  Erik Hatcher also
wrote([http://www.mail-archive.com/lucene-user <at> jakarta.apache.org/msg07016.html message]):
One more point... caching is done by the IndexReader used for the search, so you will need to keep that
instance (i.e. the IndexSearcher) around to benefit from the caching.                                                                                                                                       
Bernhard Messer | 2 Jun 15:08 2004
Picon

IndexReader.getCurrentVersion() and IndexReader.lastModified()

Hi,

I'm sending a patch which should help to fix a problem using the new 
method IndexReader.getCurrentVersion(). As far as i understand the 
current lucene documentation, developers should use this new method to 
verify if an index is out of date. The older method 
IndexReader.lastModified() is deprecated and therefore a possible 
candidate for deletion.

The problem with getCurrentVersion is, that it's base is 0, when 
creating a new index. Therefore the version number will be identical if 
you delete an index and recreate a new one,  using the same document 
set, doesn't matter if there is a change in the document content or a 
different analyzer is used. The idea of the patch is to intialize the 
version number with the current time in millis as base when creating a 
new SegmentInfos object. So it's "nearly" impossible to get the same 
version number again.

Without this patch, it's impossible for developers to store an 
IndexReader in cache and check it's validity thru getCurrentVersion.

In the attachment is the patch and a JUnit TestCase which tests the 
scenario with a sample implementation for an IndexReader cache.

As far as i can see, there are no negativ side effects when implementing 
this patch. But let's see what the lucene-specialists will see ;-)

best regards
Bernhard

(Continue reading)

Dmitry Serebrennikov | 2 Jun 19:14 2004
Picon
Picon

Re: IndexReader.getCurrentVersion() and IndexReader.lastModified()

Well, I know I didn't think of this case back when we were discussion 
this change. As a recap, the issue was mainly that on some 
architectures, the clock was not granular enough to detect updates 
reliably, so some test cases were failing some of the time. You are 
right, Bernhard, we didn't consider longer running systems where entire 
indexes might be deleted and recreated while the cache was still around.

I don't know, having version start out as a date and then get 
incremented as a version leaves a bad taste in my mouth somehow. At the 
time, we discussed other ideas that would use the date "most of the 
time" but would increment it explicitly if the clock was seen as not 
being granular enough. But the simple 0-based version number was seen as 
a much cleaner and superior solution when it was proposed.

Perhaps it would be cleaner to leave the version number 0-based and add 
an index creation date that would be explicitly available? This would 
mean that checking index validity would require checking the date and 
then the version. I would guess that only some applications or general 
purpose cache implementations would have to go to such an extent, while 
the majority can continue using just the getCurrentVersion() method by 
itself. How does this sound? Is there (should there be) an isCurrent() 
method on the IndexReader that could encapsulate this process?

Dmitry.

Bernhard Messer wrote:

> Hi,
>
> I'm sending a patch which should help to fix a problem using the new 
(Continue reading)

ehatcher | 3 Jun 05:04 2004
Picon

cvs commit: jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight HighlighterTest.java

ehatcher    2004/06/02 20:04:10

  Modified:    contributions/highlighter/src/java/org/apache/lucene/search/highlight
                        Fragmenter.java QueryScorer.java Scorer.java
               contributions/highlighter/src/test/org/apache/lucene/search/highlight
                        HighlighterTest.java
  Log:
  aesthetic/javadoc fixups

  Revision  Changes    Path
  1.2       +1 -1      jakarta-lucene-sandbox/contributions/highlighter/src/java/org/apache/lucene/search/highlight/Fragmenter.java

  Index: Fragmenter.java
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/highlighter/src/java/org/apache/lucene/search/highlight/Fragmenter.java,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- Fragmenter.java	9 Apr 2004 00:34:31 -0000	1.1
  +++ Fragmenter.java	3 Jun 2004 03:04:10 -0000	1.2
   <at>  <at>  -33,7 +33,7  <at>  <at> 

   	/**
   	 * Test to see if this token from the stream should be held in a new TextFragment
  -	 *  <at> param token
  +	 *  <at> param nextToken
   	 *  <at> return
   	 */
   	public boolean isNewFragment(Token nextToken);

(Continue reading)

Doug Cutting | 3 Jun 06:44 2004
Picon

Re: question on design for ordering of field names written to FieldInfos object

Peter M Cipollone wrote:
> I have a question about the following code from
> org.apache.lucene.index.SegmentMerger.  I would like to know if the ordering
> of the fields as they are stored to the FieldInfos object is critical to
> some other purpose.
> 
> In the code below (from a week+/- ago CVS pull), the fields are stored in
> the following order:
> 1. fields indexed=true, termVectors=true
> 2. fields indexed=true, termVectors=false
> 3. fields stored=true, indexed=false

I don't think it is critical to any other purpose, but nor do I think 
one should require these to always be ordered the same way without a 
very good reason.  They didn't used to be ordered this way, and 
implementations of Lucene in other language might order them differently.

Doug
Bernhard Messer | 3 Jun 12:15 2004
Picon

Re: IndexReader.getCurrentVersion() and IndexReader.lastModified()

Hi Dmitry,

from the view of keeping the interface clean, it would be much better to 
have a seperate method in IndexReader like "isCurrent()" or even nicer 
"isValid()" which combines the system time of the index creation (stored 
in SegmentInfos) and the current version number. I think the 
implementation is not do difficult and can be done in a short period of 
time. If wanted, i can try provide a new patch implementing a new method 
in IndexReader "isValid()" which does exactly that.

Bernhard

Dmitry Serebrennikov wrote:

> Well, I know I didn't think of this case back when we were discussion 
> this change. As a recap, the issue was mainly that on some 
> architectures, the clock was not granular enough to detect updates 
> reliably, so some test cases were failing some of the time. You are 
> right, Bernhard, we didn't consider longer running systems where 
> entire indexes might be deleted and recreated while the cache was 
> still around.
>
> I don't know, having version start out as a date and then get 
> incremented as a version leaves a bad taste in my mouth somehow. At 
> the time, we discussed other ideas that would use the date "most of 
> the time" but would increment it explicitly if the clock was seen as 
> not being granular enough. But the simple 0-based version number was 
> seen as a much cleaner and superior solution when it was proposed.
>
> Perhaps it would be cleaner to leave the version number 0-based and 
(Continue reading)

Christoph Goller | 3 Jun 16:07 2004
Picon

Re: IndexReader.getCurrentVersion() and IndexReader.lastModified()

Bernhard Messer wrote:
> Hi Dmitry,
> 
> from the view of keeping the interface clean, it would be much better to 
> have a seperate method in IndexReader like "isCurrent()" or even nicer 
> "isValid()" which combines the system time of the index creation (stored 
> in SegmentInfos) and the current version number. I think the 
> implementation is not do difficult and can be done in a short period of 
> time. If wanted, i can try provide a new patch implementing a new method 
> in IndexReader "isValid()" which does exactly that.
> 
> Bernhard

As Dmitry said, we didn´t have the case of deleted indices in mind when we
introduced the version number. I think that the solution of initializing
the version number with the current time in milliseconds is very elegant
(a minimal change with all the desired effects).

For a clean API I propose the following:
*) Keep Bernhard´s initialization of version number
*) Remove (deprecate) static IndexReader.getCurrentVersion() methods.
*) Introduce a new public non-static IndexReader member function
boolean isValid() that is similar to the current aquireLock and checks
whether the IndexReader is still valid.

Christoph
Bernhard Messer | 3 Jun 16:24 2004
Picon

Re: IndexReader.getCurrentVersion() and IndexReader.lastModified()


1+ to Christoph's proposal ;-)

Christoph Goller wrote:

> Bernhard Messer wrote:
>
>> Hi Dmitry,
>>
>> from the view of keeping the interface clean, it would be much better 
>> to have a seperate method in IndexReader like "isCurrent()" or even 
>> nicer "isValid()" which combines the system time of the index 
>> creation (stored in SegmentInfos) and the current version number. I 
>> think the implementation is not do difficult and can be done in a 
>> short period of time. If wanted, i can try provide a new patch 
>> implementing a new method in IndexReader "isValid()" which does 
>> exactly that.
>>
>> Bernhard
>
>
> As Dmitry said, we didn´t have the case of deleted indices in mind 
> when we
> introduced the version number. I think that the solution of initializing
> the version number with the current time in milliseconds is very elegant
> (a minimal change with all the desired effects).
>
> For a clean API I propose the following:
> *) Keep Bernhard´s initialization of version number
> *) Remove (deprecate) static IndexReader.getCurrentVersion() methods.
(Continue reading)


Gmane