Otis Gospodnetic | 1 Apr 2003 01:07
Picon
Favicon

Re: distributed search engine

Certainly doable.
Yes, indices can be merged.  I think I described that in the second
Lucene article on Onjava.com.  The method name is something like
addDirectories(String[])....or something similar.

Otis

--- Michael Wechner <michael.wechner <at> wyona.org> wrote:
> Hi
> 
> A friend of mine pointed me to the idea of building a distributed
> search 
> engine (similar to SETI <at> home), where the crawling or for instance the
> 
> real time indexing of weblogs (http://www.weblogs.com) and the
> searching
> could be distributed on various machines provided by "volunteers".
> 
> To start the thinking very pragmatically: Is it possible to merge 
> different Lucene indices?
> 
> Btw, I have found these projects, which sound very similar, but seem
> to 
> be a bit outdated:
> 
> http://www.hyperbee.com
> 
> http://harvest.sourceforge.net/harvest/doc/index.html
> 
> Thanks
(Continue reading)

Leo Galambos | 1 Apr 2003 14:25
Picon

Re: distributed search engine

On Tue, 1 Apr 2003, Michael Wechner wrote:

> A friend of mine pointed me to the idea of building a distributed search 
> 
> Btw, I have found these projects, which sound very similar, but seem to 
> be a bit outdated:
> 
> http://www.hyperbee.com
> 
> http://harvest.sourceforge.net/harvest/doc/index.html

one of the active projects is http://egothor.sf.net that is designed for 
P2P or any sort of distribution (see Dockyard, Dynamizer, Distributor, 
and Group classes and interfaces). The latest version (in CVS) 
emulates/simulates Harvest (or even Lucene) as one of the possible 
configurations, so I think it is ideal for dIRs.

-g-
Picon

A newbie..

Hi Guru's

I am using lucene's search engine to search into my set of HTML documents ..I have indexed my folder and got a
set of HTMLDocuments objects while searching..The problem I am facing
is regarding the content I tretrieve from the search..It gives  me "Url" and the "title" of the document but
when I fetch "summary" for the document it gave me summary in the
encoded form..I tried to decode  it using Entities.decode() method but it returned me nothing..Please let
me know where I am wrong?

Any sort of help is extremely useful.

Regards
Chandrashekhar
Amit Kapur | 2 Apr 2003 11:43
Picon

Behaviour of Lucene during Stress/Scalability Test

hi everybody

I am trying to index documents using Lucene generating about 30 MB of index (Optimized) which can be raised
to about 100 MB or More ( but that would be on a high end server machine).

Description of Current Case:
#---Each Document has four fields (One Text field, and 3 other Keyword Fields). 
#---The analyzer is based on a StopFilter and a PorterStemFilter.
#---I am using a Compaq PIII, 128 MB RAM, 650 MHz. 
#---mergeFactor is set to 25, and I am optimizing the index after adding about 20 Documents.
#---Using Lucene Release 1.2

Problem Faced
After adding about 4000 Documents generating an index of 30 MB, I initially got an error saying, ****
couldn't rename segments.new to segments **** after which the IndexReader or the IndexWriter to the
current index couldnot be opened.

Then I changed a couple of settings, 
#---mergeFactor=20 and Optimize was called after ever 10 documents.
#---Using Lucene Release 1.3

Problem Faced
After adding about 1500 Documents generating an index of 10 MB, I initially got an error saying, ****
F:\Program Files\OmniDocs Server\ftstest\_3cf.fnm (Too many open files)**** after which the
IndexWriter to the current index couldnot be opened.

Now my requirement needs to have a much much larger index (practically) and I am actually at the point where
these errors are coming unpredictably. 

Please if anyone could guide me on this ASAP.
(Continue reading)

Amit Kapur | 3 Apr 2003 07:13
Picon

Problem while indexing


hi all

I m facing problems like mentioned below while indexing, If anyone has any
help to offer i would to obliged....
**** couldn't rename segments.new to segments ****
**** F:\Program Files\OmniDocs Server\ftstest\_3cf.fnm (Too many open
files)****

I am trying to index documents using Lucene generating about 30 MB of index
(Optimized) which can be raised to about 100 MB or More ( but that would be
on a high end server machine).

Description of Current Case:
#---Each Document has four fields (One Text field, and 3 other Keyword
Fields).
#---The analyzer is based on a StopFilter and a PorterStemFilter.
#---I am using a Compaq PIII, 128 MB RAM, 650 MHz.
#---mergeFactor is set to 25, and I am optimizing the index after adding
about 20 Documents.
#---Using Lucene Release 1.2

Problem Faced
After adding about 4000 Documents generating an index of 30 MB, I initially
got an error saying, **** couldn't rename segments.new to segments ****
after which the IndexReader or the IndexWriter to the current index couldnot
be opened.

Then I changed a couple of settings,
#---mergeFactor=20 and Optimize was called after ever 10 documents.
(Continue reading)

Picon

Searching using lucene!

Hi Guru's

I have indexed the top level folder in which search is to happen..I have a query ..How can I search these
indexex so that my search would limit to a subfolder inside the top level
folder indexed. Is there any API for the same?

Any help would be extremely useful..

Regards
Chandrashekhar 
Tatu Saloranta | 3 Apr 2003 17:00

Re: Searching using lucene!

On Wednesday 02 April 2003 23:45, Gupta, Chandrashekhar (CAP, ELCOE) wrote:
> Hi Guru's
>
> I have indexed the top level folder in which search is to happen..I have a
> query ..How can I search these indexex so that my search would limit to a
> subfolder inside the top level folder indexed. Is there any API for the
> same?

You may need to read all the documentation to fully understand how Lucene 
works. It has no concept of folders or even files. It just indexes set of 
documents with one or more fields each, and allows for searching based on 
contents of one or more of those fields. So in a way, answer is no, there's 
no such API. However, implementing what you want is fairly easy with Lucene.

In your case you need to recursively index contents of all the fields, and add 
one or more fields detailing actual file hierarchy used to get the files.
One possibility is to store filename in one non-tokenizable field, and then 
use prefix query (in addition to actual query you want) to limit results to 
just documents that match files in specified directory/ies. The query to 
match filename/directory should be made required (not optional) to completely 
filter out irrelevant results.

Hope this helps,

-+ Tatu +-

ps. I think your question would really belong to users list, not developers... 
Shah, Vineel | 3 Apr 2003 22:04

RE: Lucene stress Testing (Searchable Index of 40 MB)

My system:
My search code runs in a JSP contained by Tomcat. It calls lucene as an added library. I'm currently using a
development build somewhere between 1.2 and 1.3 RC1. My index is 200mb and ~270,000 records. The index is
stored in a disk directory, and periodically read into RAM.

1. To get it to run at all, I had to change my Java runtime options to give the VM more RAM. I'm using -Xmx512m
-Xms512m. This means the min and max ram allotment when Tomcat invokes the jvm is 512 megabytes.

2. Incremental indexing, meaning adding and deleting candidates, works fine in RAM.

3. Optimizing the index after updating doesn't work. When my dataset is small (30m) it was fine, but when I
moved to the 200m set, it choked. I could probably up the RAM allotment even more to make it work, but I have
limits on that machine.

4. I have a seperate, command-line process that updates the disk index. I run this with a large RAM allotment
also, and it manages the optimization just fine. Every 30 updates, I reload the index into RAM.

Incidentally, I was getting .1 searches/sec on disk with 10 concurrent users, w/no delay between users.
When I searched on a RAMDirectory instead, I started getting 8-9 searches/sec.

I hope this helps.

Vineel Shah

-----Original Message-----
From: Amit Kapur [mailto:amitkapur <at> newgen.co.in]
Sent: Monday, March 31, 2003 6:21 AM
To: lucene-dev <at> jakarta.apache.org
Subject: Lucene stress Testing (Searchable Index of 40 MB)

(Continue reading)

Lixin Meng | 4 Apr 2003 02:56

Highlight Search Result

When I was looking for a solution that can highlight the query terms in the
search result, I came cross the following one.

http://www.iq-computing.de/lucene/highlight.jsp

It sounds a good solution to me. However, to make it working, one need to
modify Lucene source code (such as change some private declaration to
public). I guess you guys already know about it. Just wonder if there is any
plan (or there is any procedure) to incorporate the suggestions into Lucene
code base?

If the answer is no, anybody knows other solution, which doesn't require
code change, for highlighting?

I am hesitating to make a variation out of Lucene main stream, since I will
have to patch it everytime Lucene has an new release. After all, I just want
to use it.

Regards,
Lixin
none none | 4 Apr 2003 03:42
Picon

Re: Highlight Search Result

hi,
there are some plan (i guess), and there are a couple of proposal already in the mailing list.
search the dev mailing list for "term collector", you'll find a couple of zips.
The version you are talking about (www.iq-computing.de) doesn't work with the release 1.3.
Ciao

--

On Thu, 3 Apr 2003 16:56:14   
 Lixin Meng wrote:
>When I was looking for a solution that can highlight the query terms in the
>search result, I came cross the following one.
>
>http://www.iq-computing.de/lucene/highlight.jsp
>
>It sounds a good solution to me. However, to make it working, one need to
>modify Lucene source code (such as change some private declaration to
>public). I guess you guys already know about it. Just wonder if there is any
>plan (or there is any procedure) to incorporate the suggestions into Lucene
>code base?
>
>If the answer is no, anybody knows other solution, which doesn't require
>code change, for highlighting?
>
>I am hesitating to make a variation out of Lucene main stream, since I will
>have to patch it everytime Lucene has an new release. After all, I just want
>to use it.
>
>Regards,
>Lixin
(Continue reading)


Gmane