bugzilla | 1 Aug 2002 10:48
Picon
Favicon

DO NOT REPLY [Bug 11359] New: - wildcard query lowercase

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=11359>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=11359

wildcard query lowercase

           Summary: wildcard query lowercase
           Product: Lucene
           Version: 1.2
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: QueryParser
        AssignedTo: lucene-dev <at> jakarta.apache.org
        ReportedBy: leon <at> level7.ro

We have a product which indexes some files. The indexer and the query parser use
the same analyzer. This analyzer applies the LowerCaseFilter to the terms. The
procedure works just fine for most of our queries, but there's a problem when a
more complex query is issued. I will describe the problem in the following examples:

Query: term1 +term2 term3
Result: Works

(Continue reading)

Doug Cutting | 1 Aug 2002 23:20

Re: setBoost Q.

Mike Tinnes wrote:
> I've been working on tying in a PageRank algo to
> my web crawler using lucene and have a few problems. If I don't know the
> boost factor until AFTER the crawl is it possible to still set the boost?

Why not: (1) crawl, saving pages to disk; (2) analyze links and compute 
boosts; then, finally, (3) build the Lucene index?

The API does not currently let you change a field's boost after a 
document is indexed.  It is in theory possible, but would require 
overwriting .fXX files, which further complicates inter-process 
synchronization of index access.  Perhaps this can be added as a caveat 
emptor API, but, in the meantime, I suggest the above approach.

> Also what does setBoost() actually do to the rank?

The rank is the position of a document in a hit list: the first hit has 
rank one, and so on.  Hits are sorted by score.  The boost is multiplied 
into score of hits.  So a boost which is greater than 1.0 will tend to 
increase the rank of hits on that field, while a boost which is less 
than 1.0 will tend to decrease the rank of hits on that field.

Doug
cutting | 5 Aug 2002 19:15
Picon
Favicon
Gravatar

cvs commit: jakarta-lucene/src/test/org/apache/lucene/search TestPositionIncrement.java

cutting     2002/08/05 10:15:00

  Modified:    src/java/org/apache/lucene/analysis Token.java
               src/java/org/apache/lucene/index DocumentWriter.java
  Added:       src/test/org/apache/lucene/search TestPositionIncrement.java
  Log:
  Added support for Token.setPositionIncrement(int).

  Revision  Changes    Path
  1.2       +37 -0     jakarta-lucene/src/java/org/apache/lucene/analysis/Token.java

  Index: Token.java
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/analysis/Token.java,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- Token.java	18 Sep 2001 16:29:50 -0000	1.1
  +++ Token.java	5 Aug 2002 17:14:59 -0000	1.2
   <at>  <at>  -74,6 +74,8  <at>  <at> 
     int endOffset;				  // end in source text
     String type = "word";				  // lexical type

  +  private int positionIncrement = 1;
  +
     /** Constructs a Token with the given term text, and start & end offsets.
         The type defaults to "word." */
     public Token(String text, int start, int end) {
   <at>  <at>  -89,6 +91,41  <at>  <at> 
       endOffset = end;
(Continue reading)

cutting | 5 Aug 2002 19:39
Picon
Favicon
Gravatar

cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis Token.java

cutting     2002/08/05 10:39:03

  Modified:    .        CHANGES.txt
               src/java/org/apache/lucene/analysis Token.java
  Log:
  Improved documentation.

  Revision  Changes    Path
  1.29      +16 -1     jakarta-lucene/CHANGES.txt

  Index: CHANGES.txt
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/CHANGES.txt,v
  retrieving revision 1.28
  retrieving revision 1.29
  diff -u -r1.28 -r1.29
  --- CHANGES.txt	29 Jul 2002 19:11:14 -0000	1.28
  +++ CHANGES.txt	5 Aug 2002 17:39:03 -0000	1.29
   <at>  <at>  -58,6 +58,21  <at>  <at> 
        for longer fields.  Once the index is re-created, scores will be
        as before. (cutting)

  + 13. Added new method Token.setPositionIncrement().
  +
  +     This permits, for the purpose of phrase searching, placing
  +     multiple terms in a single position.  This is useful with
  +     stemmers that produce multiple possible stems for a word.
  +
  +     This also permits the introduction of gaps between terms, so that
  +     terms which are adjacent in a token stream will not be matched by
(Continue reading)

cutting | 5 Aug 2002 20:05
Picon
Favicon
Gravatar

cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java

cutting     2002/08/05 11:05:56

  Modified:    .        CHANGES.txt
  Added:       src/java/org/apache/lucene/search QueryFilter.java
  Log:
  Added QueryFilter class.

  Revision  Changes    Path
  1.30      +13 -2     jakarta-lucene/CHANGES.txt

  Index: CHANGES.txt
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/CHANGES.txt,v
  retrieving revision 1.29
  retrieving revision 1.30
  diff -u -r1.29 -r1.30
  --- CHANGES.txt	5 Aug 2002 17:39:03 -0000	1.29
  +++ CHANGES.txt	5 Aug 2002 18:05:56 -0000	1.30
   <at>  <at>  -71,7 +71,18  <at>  <at> 
        have been removed.

        Finally, repeating a token with an increment of zero can also be
  -     used to boost scores of matches on that token.
  +     used to boost scores of matches on that token.  (cutting)
  +
  + 14. Added new Filter class, QueryFilter.  This constrains search
  +     results to only match those which also match a provided query.
  +     Results are cached, so that searches after the first on the same
  +     index using this filter are very fast.
  +
(Continue reading)

Scott Ganyo | 5 Aug 2002 20:12

RE: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java

I assume that this means that my suggestion (for making Hits work as a
Filter) was discarded.  Was there any particular reason why?  Just
curious...

Scott

> -----Original Message-----
> From: cutting <at> apache.org [mailto:cutting <at> apache.org]
> Sent: Monday, August 05, 2002 1:06 PM
> To: jakarta-lucene-cvs <at> apache.org
> Subject: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search
> QueryFilter.java
> 
> 
> cutting     2002/08/05 11:05:56
> 
>   Modified:    .        CHANGES.txt
>   Added:       src/java/org/apache/lucene/search QueryFilter.java
>   Log:
>   Added QueryFilter class.
>   
>   Revision  Changes    Path
>   1.30      +13 -2     jakarta-lucene/CHANGES.txt
>   
>   Index: CHANGES.txt
>   ===================================================================
>   RCS file: /home/cvs/jakarta-lucene/CHANGES.txt,v
>   retrieving revision 1.29
>   retrieving revision 1.30
>   diff -u -r1.29 -r1.30
(Continue reading)

Doug Cutting | 5 Aug 2002 20:23

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java

Scott Ganyo wrote:
> I assume that this means that my suggestion (for making Hits work as a
> Filter) was discarded.  Was there any particular reason why?  Just
> curious...

Sorry I never got back to you.

This filter keeps a BitSet of all the matching documents.  Hits does not 
do this, and it would cause it to use a lot more memory to make it do 
so.  All of the queries which would never be used as a filter would thus 
pay this penalty.

Perhaps we should add a getQuery() method to Hits so that folks can 
extract the query and use it to construct a filter or another query. 
Would that meet your needs?

Note however that filters do not affect scoring.  Using a QueryFilter is 
not the same as requiring that same query in a BooleanQuery: the ranking 
may be different.

Doug
Scott Ganyo | 5 Aug 2002 20:51

RE: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java

> This filter keeps a BitSet of all the matching documents.  
> Hits does not 
> do this, and it would cause it to use a lot more memory to make it do 
> so.  All of the queries which would never be used as a filter 
> would thus pay this penalty.

My thought was that the Filter.bits() method on Hits would only resolve the
BitSet if it was asked for (and probably wouldn't even cache it), so in the
common case Hits wouldn't suffer any ill effect.  Would that work?  (I feel
like I'm missing something obvious...)

Thanks,
Scott

> -----Original Message-----
> From: Doug Cutting [mailto:cutting <at> lucene.com]
> Sent: Monday, August 05, 2002 1:23 PM
> To: Lucene Developers List
> Subject: Re: cvs commit:
> jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java
> 
> 
> Scott Ganyo wrote:
> > I assume that this means that my suggestion (for making 
> Hits work as a
> > Filter) was discarded.  Was there any particular reason why?  Just
> > curious...
> 
> Sorry I never got back to you.
> 
(Continue reading)

Doug Cutting | 5 Aug 2002 22:23

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java

Scott Ganyo wrote:
> My thought was that the Filter.bits() method on Hits would only resolve the
> BitSet if it was asked for (and probably wouldn't even cache it), so in the
> common case Hits wouldn't suffer any ill effect.  Would that work?  (I feel
> like I'm missing something obvious...)

One could do this, but I'm not sure what the advantage would be.

In your original message on this topic, you wrote:

Scott Ganyo wrote:
 > But instead of adding a new class, why not change Hits to
 > inherit from Filter and add the bits() method to it?
 > Then one could "pipe" the output of one Query into another
 > search without modifying the Queries...

If that's the goal, then a bits() method is not a great way to do this, 
as it ignores the ranking in the first search when ranking the second. 
Since that is a material difference, I prefer to make it explicit.

Filters are not designed for searching within an arbitrary result set. 
For that you really should take the ranking for the first query into 
account: a new query should be formed by adding clauses to the original 
query.  Filters are instead designed to search subsets of an index 
defined by boolean criteria, criteria that do not affect ranking, like 
date, language, postal code, document type, etc.  They are particularly 
useful when the same criterion is used repeatedly, and the bit vector 
can be cached, as the construction and storage of a new bit-vector per 
query is expensive.  Thus the canonical uses of a filter should be to 
implement things like "modified in last week", or "written in english" 
(Continue reading)

Scott Ganyo | 7 Aug 2002 16:27

IndexReader.lastModified() not correct

The current implementation of IndexReader.lastModified() does not return the
results I am expecting.  Here is the implementation:

  /** Returns the time the index in the named directory was last modified.
*/
  public static long lastModified(File directory) throws IOException {
    return FSDirectory.fileModified(directory, "segments");
  }

The problem is that the "segments" file is apparently not updated when just
doing a delete from an index.  This is causing me problems because I am
attempting to rely on on the lastModified() command for IndexSearcher
caching.  The only solution that I have thought of so far (without changing
other parts of Lucene) is to make the lastModified() command look at all the
files in the last segment for the last modified date.

Thoughts?

Thanks,
Scott

Gmane