Kevin A. Burton | 1 Apr 2004 04:23

Re: RE : Performance of hit highlighting and finding term positions for a specific document

Rasik Pandey wrote:
Kevin,
http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index.
Can you explain in more detail what you mean here?
It uses the StandardAnalyzer again to re-index to find tokens... when it finds the same token that matched a search request it highlights it.

It works... just not too efficient.

Kevin
-- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Kevin A. Burton | 1 Apr 2004 04:43

Re: Performance of hit highlighting and finding term positions for

Doug Cutting wrote:

>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev <at> jakarta.apache.org&msgId=1413989 
>
>
> According to these, if your documents average 16k, then a 10-hit 
> result page would require just 66ms to generate highlights using 
> SimpleAnalyzer.

The whole search takes only 300ms... this means that if I highlight 5 
docs I've doubled my search time.

Note that Google has a whole subsection of their cluster dedicated to 
keyword in context extraction.

Kevin

--

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Kevin A. Burton | 1 Apr 2004 04:46

Re: RE : Performance of hit highlighting and finding term positions for a specific document

Rasik Pandey wrote:
Hello,
I've been meaning to look into good ways to store token offset information to allow for very efficient highlighting and I believe Mark may also be looking into improving the highlighter via other means such as temporary ram indexes. Search the archives to get a background on some of the idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to mind as threads that touch this topic).
I would be nice if CachingRewrittenQueryWrapper.java that I sent to lucene-dev (see below) last week became part of these highlighting effors, if appropriate. We use it to collect terms for a query that searches of multiple indices.
Actually I had to write one for my tests with the highlighter. I'm using a MultiSearcher and a WildcardQuery which the highlighter didn't have support for. 

My impl was fairly basic so I wouldn't suggest a contribution... I'm sure your's is better.  The suggested changes to the highlighter for providing tokens would make this work well together.

Kevin

--
Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Kevin A. Burton | 1 Apr 2004 04:50

Re: [patch] MultiSearcher should support getSearchables()

Erik Hatcher wrote:

>
> No question that it'd be unwise to do.  We could say the same argument 
> for making everything public access as well and say it'd be stupid to 
> override this method, but we made it public anyway.  I'd rather opt on 
> the side of safety.
>
> Besides, you haven't provided a use case for why you need to get the 
> searchers back from a MultiSearcher :)
>
Just ease of use really... I have our MultiSearcher reload transparently 
and this case I can verify that I'm using the right array of searchers 
not one that's already been reloaded behind me.

I can add some code to preserve the original searcher array but it's a pain.

Kevin

--

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Doug Cutting | 1 Apr 2004 05:22
Picon
Favicon
Gravatar

Re: Performance of hit highlighting and finding term positions for

Kevin A. Burton wrote:
> Doug Cutting wrote:
> 
>>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev <at> jakarta.apache.org&msgId=1413989 
>>
>> According to these, if your documents average 16k, then a 10-hit 
>> result page would require just 66ms to generate highlights using 
>> SimpleAnalyzer.
> 
> The whole search takes only 300ms... this means that if I highlight 5 
> docs I've doubled my search time.

My math was wrong, but yours seems even more so!  I meant 110ms to 
highlight ten docs.  If you only highlight 5, then it's 55ms.  If your 
query is taking 300ms, then this adds less than 20%.

> Note that Google has a whole subsection of their cluster dedicated to 
> keyword in context extraction.

I think that's that's for i/o reasons, not that it requires a lot of 
computation.

Doug
markharw00d | 1 Apr 2004 10:57
Picon
Favicon

Re: Performance of hit highlighting and finding term positions for

730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! 
The 11ms per doc figure in my post was for highlighlighting using a \
lower-case-filter-only analyzer. 5ms of this figure was the cost of the \
lower-case-filter-only analyzer.

73 msecs is the cost of JUST StandardTokenizer (no highlighting)
StandardAnalyzer uses StandardTokenizer so is probably used in a lot of apps. It \
tries to keep certain text eg email addresses as one term. I can live without it and \
I suspect most apps can too. I haven't looked into why its slow but I notice it does \
make use of Vectors. I think a lot of people's highlighter performance issues may \
extend from this.

Cheers
Mark
(apologies for the cross-post to lucene-dev - wrong group!)
Stefan Duffner | 1 Apr 2004 13:58
Picon
Favicon

Merge multiple indeces like a sql join

Hi,

I am testing for Lucene for a project and I am new in Lucene and I have 
a question. I read the documentation and searched the mailing test but I 
did not find any anwser for my question.

My problem:
I have to 2 indeces (I made a simple test example): First index is 
storing information about books and the second is storing information 
about authors. Both indeces have a field named number, which is a unique 
indentifier for an author. If these 2 indeces are tables in a database I 
would normally join tables, where number is the key of the author table 
and the foreign key of book table. Now  I want to do same with Lucene. I 
used the MultiSearcher and I get 2 hits (one from book index and the 
other from author table), if  I use an unique number. I think this is 
normal behaviour, but I am not sure. Can you give me an answer, if this 
is correct ? What can I do to solve this problem (The only idea I have 
for the moment is to determine which terms are for the one index and 
which are for the other and split them. Then make a search on one index 
and get the unique numbers of this search result. Then I will use this 
unique numbers with the other query parameters for the next search in 
the other index) ?

--

-- 
Thanks for any comments
Stefan
Terry Steichen | 1 Apr 2004 14:51

Re: Wierd Search Behavior

I did some more checking and uncovered what appears to be a serious Lucene
problem. (Either that or my merge code - below - is wrong.)  Appreciate any
help in figuring out what's wrong.  Here are the facts as I see them:

1) I put together a large number of canned queries (some rather complex) for
routine testing purposes.
2) I created a new compound file index and tested the queries.  All worked
fine.
3) I then indexed some new documents and merged the new index with the
original index.
4) I then tried the queries again.  Each time I did this, about 1-3% of the
queries no longer worked - the actual number appears to vary with each
merge.
5) The specific queries that fail change with each merge. Ones that failed
after the previous merge almost always appear to work again with the next
merge (which produces a new batch of failures).
6) In all cases I've so far examined, the offending part of the affected
queries is a single quoted phrase (even though there may be several such
phrases in the query) - remove it, and the (now modified) query works fine.
7) I tried the same thing using the original multi-file index format, with
the same results.
8) About a week and a half ago, I migrated from 1.3final to the latest CVS
head.
9) I've only just started checking this, so I don't know how long this
behavior has been going on.  The small percentage of errors and (apparent)
randomness of which query is affected make it hard to detect.
10) I have about 32 fields per document, most of which are tokenized,
indexed and stored.
11) My merge code (for the multi-file index format) is this:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;

class MergeIndices {
  public static void main(String[] args) {

 //args[0]: relative path to main index
 //args[1]: relative path to new index (to be merged with main)

 try {
  IndexWriter writer = new IndexWriter(args[0], new StandardAnalyzer(),
false);
 // writer.setUseCompoundFile(true); //used for compound format
  FSDirectory dir = FSDirectory.getDirectory(args[1], false);
  FSDirectory[] dirs = new FSDirectory[1];
  dirs[0] = dir;
  writer.addIndexes(dirs);
  writer.optimize();
  writer.close();
 } catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
    "\n with message: " + e.getMessage());
 }
  }

}

----- Original Message -----
From: "Terry Steichen" <terry <at> net-frame.com>
To: "Lucene Users List" <lucene-user <at> jakarta.apache.org>
Sent: Wednesday, March 31, 2004 11:47 AM
Subject: Re: Wierd Search Behavior

> No, they're typos in the e-mail.  In the application, all the colons are
> properly placed.  (Guess I was/am so frustrated I can't write right any
> more).
>
> Terry
>
> ----- Original Message -----
> From: "Erik Hatcher" <erik <at> ehatchersolutions.com>
> To: "Lucene Users List" <lucene-user <at> jakarta.apache.org>
> Sent: Wednesday, March 31, 2004 9:55 AM
> Subject: Re: Wierd Search Behavior
>
>
> > On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> > > I'm experiencing some very puzzling search behavior.  I am using the
> > > CVS head I pulled about a week ago.  I use the StandardAnalyzer and
> > > QueryParser.  I have a collection of XML documents indexed.  One field
> > > is "subhead", and here's what I find with different queries:
> > > subhead:(missile defense)    - works fine
> > > subhead("missile" "defense") - works fine
> > > subhead("missile defense") - fails
> > > subhead(missile defense "missile defense") - fails
> > > subhead(missile defense "missile dork") - works fine
> > > subhead(missile defense "missile defens") - works fine (note
> > > misspelling)
> >
> > I presume the missing colons on all but the first example is just a
> > typo in your e-mail?  If not, might that be the problem?
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe <at> jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help <at> jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe <at> jakarta.apache.org
> For additional commands, e-mail: lucene-user-help <at> jakarta.apache.org
>
>
Tate Avery | 1 Apr 2004 17:30
Favicon

Searching in "all"

Hello,

If I have, for example, 3 fields in my document (title, body, notes)... is there some easy what to search 'all'?

Below are the only 2 ideas I currently have/use:

1) If I want to search for 'x' in all, I do something like:
	title:x OR body:x OR notes:x

... but this does not really work if you are search for (a AND b) and a is in the title and b is in the notes, etc...
leading to an explosion of boolean combinations it seems.

2) Actually index an 'all' field for my document by just concatenating the content from the title, body, and
notes fields.
... but this doubles my index size.  :(

So, is there a better way out there?

Thanks,
Tate
Erik Hatcher | 1 Apr 2004 17:42
Favicon

Re: Searching in "all"

On Apr 1, 2004, at 10:30 AM, Tate Avery wrote:
> 2) Actually index an 'all' field for my document by just concatenating 
> the content from the title, body, and notes fields.
> ... but this doubles my index size.  :(

This is actually my recommended approach.  As for doubling the index 
size - not necessarily.  If you only use stored fields for the ones 
that truly need to be retrievable things are more efficient space-wise. 
  For example, you do not need to store the "all" field (use 
Field.UnStored).

	Erik

Gmane