Re: Lucene search benchmark/stress test tool
2006-05-01 00:28:20 GMT
On Apr 26, 2006, at 9:34 AM, Otis Gospodnetic wrote: > I'm about to write a little command-line Lucene search benchmark > tool. I'm interested in benchmarking search performance and the > ability to specify concurrency level (# of parallel search threads) > and response timing, so I can calculate min, max, average, and mean > times. Something like 'ab' (Apache Benchmark) tool, but for Lucene. > > Has anyone already written something like this? I'm about to. The predecessor to the indexing benchmarker tests I recently published results for was enormously helpful while streamlining the indexing process. Now that I'm considering modifications to search logic and file format which may have a substantial impact on search-time performance, I'll need a search benchmarker to complement the indexing benchmarker. I'll be writing a both a Perl/KinoSearch and a Java Lucene version, and they will use the Reuters corpus. Where are you at with your app? Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: SpanFirstQuery and SpanNotQuery
2006-05-01 18:39:51 GMT
After consulting LIA, I find the following sentence at the end of 5.4.2
"Finding spans at the beinning of a field" ...
The resulting span matches are the same as the orriginal SpanQuery
spans, ...
...which indicates that it is working as documented -- but it remains to
be seen if that's what was intended or not.
In any case, it begs the question, how would someone go about building a
SpanQuery where the intent is:
SpanQuery X must be in the first N positions
AND
SpanQuery Z must not come before SpanQuery X
The best i can figure would be something like...
SpanQuery q = new SpanNotQuery(
new SpanFirstQuery(X, N),
new SpanFirstQuery(new SpanNearQuery(new SpanQuery[]{Z, X.clone()},
N-1, true),
N-1);
...but this doesn't seem to work at the moment because of LUCENE-560.
Even if it did work, it seems kind of kludgy, anyone have a better
suggestion?
: I'm looking at SpanQueries as I work on new test cases for LUCENE-557, and
(Continue reading)Re: Lucene search benchmark/stress test tool
2006-05-01 19:34:59 GMT
Marvin, I wrote my Lucene search benchmarker, but will have to check with my employer about contributing it to Lucene. It's rather simple - I used Java 1.5 concurrency package's ThreadedPoolExecutor for executing N parallel search requests, measured elaphsed time for each request, and then when all searches were done, I calculated min/max/median/percentile/etc. Otis ----- Original Message ---- From: Marvin Humphrey <marvin <at> rectangular.com> To: java-user <at> lucene.apache.org Sent: Sunday, April 30, 2006 8:28:20 PM Subject: Re: Lucene search benchmark/stress test tool On Apr 26, 2006, at 9:34 AM, Otis Gospodnetic wrote: > I'm about to write a little command-line Lucene search benchmark > tool. I'm interested in benchmarking search performance and the > ability to specify concurrency level (# of parallel search threads) > and response timing, so I can calculate min, max, average, and mean > times. Something like 'ab' (Apache Benchmark) tool, but for Lucene. > > Has anyone already written something like this? I'm about to. The predecessor to the indexing benchmarker tests I recently published results for was enormously helpful while streamlining the indexing process. Now that I'm considering modifications to search logic and file format which may have a substantial impact on search-time performance, I'll need a search benchmarker to complement the indexing benchmarker. I'll be writing(Continue reading)
Stemming terms in SpanQuery
2006-05-01 19:37:50 GMT
Hi, I'm trying to build a SpanQuery using word stems. Is parsing each term with a QueryParser, constructed with an Analyzer giving stemmed tokenStream, the right approach? It just seems to me that QueryParser is designed to parse queries, and so my hunch is that there might be a better way. Any help will be much appreciated. Michael
Kneobase: open source, enterprise search
2006-05-02 10:38:46 GMT
Hi list,
I’m glad to announce Colaborativa.net has released Kneobase, an open source "enterprise search" product, based on Lucene.
Kneobase can accept many data sources as searchable elements, and can provide search results in multiple formats, including SOAP, which might make it a good search engine for use in a service-oriented environment, because it doesn't need search indexes published to a web server.
See this article in TSS for more information,
Cheers!
--mariano
|
||||||||||||||
OutOfMemoryError while enumerating through reader.terms(fieldName)
2006-05-02 10:55:06 GMT
Hi, I am getting OutOfMemoryError , while enumerating through TermEnum after invoking reader.terms(fieldName). Just to provide you more information, I have almost 10000 unique terms in field A. I can successfully enumerate around 5000terms but later I am gettting OutOfMemoryError. I set jvm max memory as 512MB , Ofcourse my index is bigger than this memory around 1GB-2GB.. How can I ask lucene to cleanup loaded index and traverse through remaining terms.. It seems while enumerating memory always grows in steps of some MBs. Any help would be really appreciable. Thanks in advance, Jelda
Re: Kneobase: open source, enterprise search
2006-05-02 12:52:25 GMT
I was quickly looking at its web page eariler this day and it looks good so far! Good news! However, I have one question: does Kneobase contain any kind of web crawler functionality (like Nutch) or do I have to feed it with all sources *manually*? How much can be gathering of web data automated? May be this questions should go to Kneobase maillist or directly to TSS site but they require registration... and I am too lazy about giving out information about budget and my position details nowRegards, Lukas On 5/2/06, Mariano Barcia <mariano.barcia <at> colaborativa.net> wrote: > > Hi list, > > > > I'm glad to announce Colaborativa.net has released Kneobase, an open > source "enterprise search" product, based on Lucene. > > > > *Kneobase can accept many data sources as searchable elements, and can > provide search results in multiple formats, including SOAP, which might make > it a good search engine for use in a service-oriented environment, because > it doesn't need search indexes published to a web server.* > > > > See this article in TSS<http://www.theserverside.com/news/thread.tss?thread_id=40160>for more information, > > > > Cheers! > > > > --mariano > > > > > > <http://www.colaborativa.net/> > > Lideramos la innovación > > > > *Mariano Barcia* > *Founder, CEO* > > *COLABORATIVA.NET* > Córdoba 1147 Piso 6 Oficinas 3 y 4 > (S2000AWO) Rosario, SF, > Argentina > > mariano.barcia <at> colaborativa.net > IM: mariano_barcia <at> hotmail.com > Skype: mbarcia > http://www.colaborativa.net > http://www.kneobase.com > > tel: > mobile: > > +54 (341) 528 9987 > +54 9 (341) 587 6695 > > > > >
RE: OutOfMemoryError while enumerating through reader.terms(fieldName)
2006-05-02 13:07:52 GMT
Hi,
I just debugged it closely.. Sorry I am getting OutOfMemoryError not because
of reader.terms()
But because of invoking QueryFilter.bits() method for each unique term.
I will try explain u with psuedo code.
while(term != null){
if(term.field().equals(name)){
String termText = term.text();
keys.addElement(termText);
}else{
break;
}
if(te.next()){
term = te.term();
}else{
break;
}
}
for(Iterator iter = keys.iterator(); iter.hasNext();){
String termText = (String) iter.next();
TermQuery termQuery = new TermQuery(new Term(fieldName, termText));
QueryFilter filter = new QueryFilter(termQuery);
final BitSet bits;
bits = filter.bits(ciaoReader.getIndexReader());
BitSet pr = cache.put(termText, bits);
}
}
Second for loop which gets BitSet using QueryFilter is now throwing
OutOfMemoryError.
Any advise is relly welcome.
Thx,
Jelda
> -----Original Message-----
> From: Ramana Jelda [mailto:ramana.jelda <at> ciao-group.com]
> Sent: Tuesday, May 02, 2006 12:55 PM
> To: java-user <at> lucene.apache.org
> Subject: OutOfMemoryError while enumerating through
> reader.terms(fieldName)
>
> Hi,
> I am getting OutOfMemoryError , while enumerating through
> TermEnum after invoking reader.terms(fieldName).
>
> Just to provide you more information, I have almost 10000
> unique terms in field A. I can successfully enumerate around
> 5000terms but later I am gettting OutOfMemoryError.
>
> I set jvm max memory as 512MB , Ofcourse my index is bigger
> than this memory around 1GB-2GB..
> How can I ask lucene to cleanup loaded index and traverse
> through remaining terms.. It seems while enumerating memory
> always grows in steps of some MBs.
>
> Any help would be really appreciable.
>
> Thanks in advance,
> Jelda
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
>
creating indexReader object
2006-05-02 13:38:13 GMT
i am trying to create an object of index reader class that reads my index. i
need this to further generate the document and term frequency vectors.
however when i try to print the contents of the documents (doc.get("contents"))
it shows -null .
any suggestions,
if i cant read the contents then i cannot create the other vectors.
any help will be apprecisted
cheers,
trupti mulajkar
MSc Advanced Computer Science
Regards,
Lukas
On 5/2/06, Mariano Barcia <mariano.barcia <at> colaborativa.net> wrote:
>
> Hi list,
>
>
>
> I'm glad to announce Colaborativa.net has released Kneobase, an open
> source "enterprise search" product, based on Lucene.
>
>
>
> *Kneobase can accept many data sources as searchable elements, and can
> provide search results in multiple formats, including SOAP, which might make
> it a good search engine for use in a service-oriented environment, because
> it doesn't need search indexes published to a web server.*
>
>
>
> See this article in TSS<
RSS Feed