Antony Bowesman | 1 Dec 2006 01:10

Re: indexing performance issue

Grant Ingersoll wrote:
> 
> On Nov 30, 2006, at 10:54 AM, spinergywmy wrote:
> 
>>    For my scenario will be every time the users upload the single file, I
>> need to index that particular file. Previously was because the previous
>> version of pdfbox integrate with log4j.jar file and I believe is the
>> log4j.jar cause the indexing performance and takes up a lot of memory
>> resources. However, the latest version of pdfbox doesn't need to 
>> integrate
>> with log4j.jar, and I thought that will actually speed up the indexing
>> performance but the result was no.
>>
> 
> I would isolate PDFBox and do some performance testing on it, then 
> submit your questions on the PDFBox forums, as they will know better 
> about PDFBox performance.

I had a performance problem when I first started with PDFbox, which I raised on 
the PDFbox forum

http://sourceforge.net/forum/message.php?msg_id=3947448

but then I debugged it to the logging issue (logging appears on, but no 
appenders).  I've not tried the non log4j version so don't know if that is free 
of the problem.  After I disabled the logging correctly, performance was 
comparable to the other Java offerings I tried.

Antony
(Continue reading)

Antony Bowesman | 1 Dec 2006 01:44

Re: indexing performance issue

spinergywmy wrote:
>    I have posted this question before and this time I found that it could be
> pdfbox problem and this pdfbox I downloaded doesn't use the log4j.jar. To
> index the app 2.13mb pdf file took me 17s and total time to upload a file is
> 18s.

Re: PFDBox.

I have a 2.5Mb test file that extracts in ~4 seconds.  CPU is AMD athlon 3500+ 
at 2200MHz.  While converting it creates 4.7MB file in temp directory 
(pdfboxnnnnn.tmp).

Another 3.7MB test file takes over 20 seconds to extract and it creates 15MB 
temp file.

It seems dependent on the data, but Cygwin's PDFtotext also has similar 
performance profile over the two documents.  I would revert to the PDFBox forum 
to continue the discussion.

Antony
Van Nguyen | 1 Dec 2006 02:25
Favicon

any ides on this type of analyzer?

I've been trying to brainstorm on this but could not figure out a way to
go about this.

Let's say I'm searching for "batman".  I want results that include:

batman

bat man

bat-man

etc.

or if I search screwdriver, I would want results to include:

screwdriver

screw drivers

etc.

I've tried using the SnowballAnalyzer.  I've thought about creating a
"SynonymAnalyzer" as described in the Lucene In Action book (but that
would mean I would have to know all the synonyms for each word I need to
index - at this point I do not).  Any suggestions on how to go about
this?

Van

(Continue reading)

Dennis Watson | 1 Dec 2006 02:35

Re: any ides on this type of analyzer?

NGramAnalyzer should do this.  I think there is one in the contribs area or in 
LUA.

Dennis

On Thursday 30 November 2006 17:25, Van Nguyen wrote:
> I've been trying to brainstorm on this but could not figure out a way to
> go about this.
>
>
>
> Let's say I'm searching for "batman".  I want results that include:
>
>
>
> batman
>
> bat man
>
> bat-man
>
> etc.
>
>
>
> or if I search screwdriver, I would want results to include:
>
>
>
> screwdriver
(Continue reading)

Yonik Seeley | 1 Dec 2006 03:06
Picon
Favicon

Re: 2.1-dev memory leak?

On 11/30/06, Michael McCandless <lucene <at> mikemccandless.com> wrote:
> I tested this in 1.9.1, 1.9.2, 2.0.0, and trunk, and all of these
> versions would run out of descriptors.  So I'm at a loss so far on
> where the regression is here ...

Right... no regression, just a Java limitation.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
Otis Gospodnetic | 1 Dec 2006 06:16
Picon
Favicon

Re: 2.1-dev memory leak?

Yeah, in this case, I'm running out of memory, and open file descriptors are, I think, just an indicator that
IndexSearchers are not getting closed properly.  I've already increased the open file descriptors
limit, but I'm limited to 2GB of RAM on a 32-bit box.

I'll try explicitly closing searchers now....some number of minutes later - bingo, that takes care of the leak.

For the curious, this is what I think happened (the app in question is http://www.simpy.com/ ).  The app was
burning the CPU like crazy (imagine nearly double-digit load).  The IO didn't seem crazy (vmstat/iostat
didn't show much bi/bo).  Still, after some digging around the source code, I found a place where I was just
going through a few thousand Hits on each search against one particular Lucene index.  I fixed that problem
(nothing to do with IndexSearcher opening/closing), and at the same time I upgraded to Lucene 2.1-dev. 
This fix doubled the performance of the whole service.  Instead of handling N requests/second, it was now
handling 2-3*N requests/second.  The CPU is still running pretty hot, but the service is handling more
search traffic.  My guess is that this increase in performance and throughput simply increased the number
of different Lucene indices (there are tens of thousands of them per box) being searches, and exposed an
issue that was either hidden before, or that the JVM was able to handle before, and now it no longer could.  I
wish I had a little better grip on what exactly happened here, but I've got a stable application again.

Thanks for the help!

Otis

----- Original Message ----
From: Yonik Seeley <yonik <at> apache.org>
To: java-user <at> lucene.apache.org
Sent: Thursday, November 30, 2006 6:06:50 PM
Subject: Re: 2.1-dev memory leak?

On 11/30/06, Chris Hostetter <hossman_lucene <at> fucit.org> wrote:
> : IndexSearchers open.  The other ones I "let go" without an explicit
(Continue reading)

Otis Gospodnetic | 1 Dec 2006 06:31
Picon
Favicon

Re: 2.1-dev memory leak?

Hi Mike,

Thanks for looking into this.  I think your stress test may match my production environment.
I think System.gc() never guarantees anything will happen, it's just a hint.

I've got the following in one of my classes now.  Maybe you can stick it in your stress test and look at that last
number at the bottom of this snippet.

    private static final MemoryMXBean bean = ManagementFactory.getMemoryMXBean();
...
        LOG.info("Bean: " + bean);
        LOG.info("Heap memory usage: " + bean.getHeapMemoryUsage());
        LOG.info("Non-heap memory usage: " + bean.getNonHeapMemoryUsage());
        LOG.info("Objects pending finalization: " + bean.getObjectPendingFinalizationCount());
        long freeMemBefore = Runtime.getRuntime().freeMemory();
        LOG.info("Running garbage collector via bean...");
        bean.gc();
        long freeMemAfter = Runtime.getRuntime().freeMemory();
        LOG.info("Total Memory      : " + Runtime.getRuntime().totalMemory());
        LOG.info("Max Memory        : " + Runtime.getRuntime().maxMemory());
        LOG.info("Free Memory Before: " + freeMemBefore);
        LOG.info("Free Memory After : " + freeMemAfter);
        long memDelta = freeMemAfter-freeMemBefore;
        LOG.info("Free Memory Now   : " + memDelta);
        LOG.info("Heap memory usage: " + bean.getHeapMemoryUsage());
        LOG.info("Non-heap memory usage: " + bean.getNonHeapMemoryUsage());
        LOG.info("Objects pending finalization: " + bean.getObjectPendingFinalizationCount());

Otis

(Continue reading)

Inderjeet Kalra | 1 Dec 2006 08:17
Favicon

Full text searching on documents saved in database as BLOB

Hi,

I have a query related to the full text searching on documents saved in
database as BLOB. In our application, we are planning to save our
documents in the database as BLOB and we have a requirement of searching
a document on it's meta data and the content of the document i.e. search
within document. 

Will Lucene serve our purpose, I have gone through the Lucene FAQs and
various issues on the site but I am unable to find any references to the
documents saved in the database as BLOB.

1. How can I search on the metadata of the document saved in the
database as BLOB

2. How can I search on the content of the document i.e. search within
the document saved in the database as BLOB

If anyone has implemented the same functionality, can you suggest me any
sample code so that i can start using lucene for the same.

There is another search engine i.e. Oracle Text search engine which also
does the same but which one is better ?

Thanks & Regards

Inderjeet

 
***The information transmitted is intended only for the person or entity to which it is addressed and may
(Continue reading)

Soeren Pekrul | 1 Dec 2006 11:07
Picon
Picon

Re: any ides on this type of analyzer?

Hello Van,

it looks like splitting of compound words. This topic was discussed in 
the thread "Analysis/tokenization of compound words" 
(http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).

The main idea is as follow:
You have a corpus (lexicon/dictionary). You want to split a word. Build 
letter wise two part words and search them in the corpus. If you can 
find them you can split your word at these parts. You can do that 
recursively finding all sub-parts. This method searches the smallest 
possible words. [1]
If you have a corpus without compound words you can have a better 
quality by searching the largest possible words by searching the word 
first and then splitting it if you couldn’t find it. Do this recursively 
as well. [2]

Warning: The results are usually not linguistically correct. How ever, 
the quality should be good enough for searching.

Sören

[1] C. Monz and M. De Rijke: Shallow Morphological Analysis in 
Monolingual Information Retrieval for Dutch, German and Italian, 
Language & Inference Technology, University of Amsterdam.

[2] Pasqualino Imbemba: A Splitter for German Compound Words, Free 
University Of Bolzano, Bolzano, 2006

Van Nguyen wrote:
(Continue reading)

Nils Höller | 1 Dec 2006 11:31
Picon

Incremental Index and Comparing different Scores from different Index

Hi,

I have some questions about the scoring function and about how different
scores can be compared.

I use Lucene for indexing an archive of the web.

/archive/day1/differentsites => indexday1
/archive/day2/differentsites => indexday2
/archive/day3/differentsites => indexday3
/archive/day4/differentsites => indexday4
/archive/day5/differentsites => indexday5

No I have a continious query for a certain topic.

Querying indexday1 gives me some Hits with the best having score a
Querying indexday2 gives me some Hits with the best having score b
and so on .....

Now how can I compare those scores, since the regular idf will be
different in the formular of every index?! Am I right here?

I ve read about the incremental idf. Is there a possibility to normalize
the scores so that different indexes/scores can be compared

My Goal is to plot the movement of the score a query has over the time. 
So when I ll look at the Hits on a certain day, I ll must be sure that
all the Hits are out of the day's archive.

I hope you understand my problem and can help me.
(Continue reading)


Gmane