Otis Gospodnetic | 2 Jan 10:24 2007
Picon

Re: Clustering Lucene with 40 Servers

Yes, if you think about how blogs and Technorati work (new blog post -> ping -> ping server -> technorati ->
index ==> searchable blog post), Adam is correct.  Since Doug's implementation we (Technorati) have
changed our clusters a LOT (think major rewrites).  I can't talk about the details, obviously, but the
first paragraph in Mark's email sounds like what we have.

Biggy will want to provide more details about how they want things to work (e.g. how quickly do changes need
to be visible to searchers?  Is this a write + read system, or are there deletes? ...) before others can make
useful suggestions.

Otis

----- Original Message ----
From: Adam Fleming <aflem26 <at> hotmail.com>
To: java-user <at> lucene.apache.org
Sent: Thursday, December 28, 2006 2:33:37 AM
Subject: RE: Clustering Lucene with 40 Servers

Hello, 

I saw that Doug Cutting had an interesting solution for his Technorati website: 
http://www.mail-archive.com/lucene-user <at> jakarta.apache.org/msg12709.html

It sounds like it's a single-writer, many readers type of system, but quite robust and efficient.

Cheers, 

Adam 

----------------------------------------
> Date: Wed, 27 Dec 2006 10:46:47 -0800
(Continue reading)

Antony Bowesman | 2 Jan 10:29 2007

Re: IOException - The handle is invalid

Hi Mike,

>> I saw Mike McCandless JIRA issue
>>
>> http://issues.apache.org/jira/browse/LUCENE-669
>>
>> Is the patch referenced there useful for a 2.0 system.  I would like 
>> to use the lockless commit stuff, but am waiting until I get the core 
>> system working well.
>>
>> I am also getting IOException in some of my classes, but from the JIRA 
>> comments, it seems that Lucene may be the culprit.
> 
> This does sound very much like LUCENE-669 (and that bug is indeed
> present in Lucene 2.0).  That patch was fairly simple; it may apply
> cleanly or require only small fixes.  I would recommend try it &
> seeing if it resolves your IOExceptions?

I made the patch, re-run the tests and the IOExceptions have gone away, so this 
fix looks good, though with Christmas and New Year between the previous test and 
these ones, I'm not 100% sure I had the same setup...

I'm glad you fixed this one already ;) !

Thanks
Antony
Michael McCandless | 2 Jan 18:54 2007

Re: IOException - The handle is invalid

Antony Bowesman wrote:
> Hi Mike,
> 
>>> I saw Mike McCandless JIRA issue
>>>
>>> http://issues.apache.org/jira/browse/LUCENE-669
>>>
>>> Is the patch referenced there useful for a 2.0 system.  I would like 
>>> to use the lockless commit stuff, but am waiting until I get the core 
>>> system working well.
>>>
>>> I am also getting IOException in some of my classes, but from the 
>>> JIRA comments, it seems that Lucene may be the culprit.
>>
>> This does sound very much like LUCENE-669 (and that bug is indeed
>> present in Lucene 2.0).  That patch was fairly simple; it may apply
>> cleanly or require only small fixes.  I would recommend try it &
>> seeing if it resolves your IOExceptions?
> 
> I made the patch, re-run the tests and the IOExceptions have gone away, 
> so this fix looks good, though with Christmas and New Year between the 
> previous test and these ones, I'm not 100% sure I had the same setup...
> 
> I'm glad you fixed this one already ;) !

Awesome, I'm glad to hear that!  Let's tentatively hope that indeed
you were hitting that bug and now it's resolved :)

Mike
(Continue reading)

sdeck | 2 Jan 23:32 2007
Picon

Speed of grouped queries


Thanks for advanced on any insight on this one.

I have a fairly large query to run, and it takes roughly 20-40 seconds to
complete the way that i have it.
here is the best example I can give.

I have a set of roughly 25K documents indexed

I have queries that get documents matching a particular actor.

Then, I have a movie query that takes all of the documents found for each
actor query and combines them all together to say, here are all documents
that are relevant for this movie.

Then, and here is the time hog, I have a genre query that says, take all
movies and get their results and combine them together into this genre
result set.

The problem is, at indexing time, I do not have a way to say if a document
is a particular genre, or a particular actor, or movie etc.  If I try and
say for the genre query, get all documents and then filter for the queries
for movies and actors, I get heap space memory issues.

The query for collecting a specific actor is around 200-300 milliseconds,
and the movie one, that actually queries each actor, takes roughly 500-700
milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes
500 milliseconds*# of movies

Any ideas on how I could run these queries differently? For a given actor
(Continue reading)

escher2k | 3 Jan 03:32 2007
Picon

Customize scoring for additive effect...


I am trying to build a scoring function which is additive across multiple
fields that are searched.
For instance, if a user searches for "Web PHP", I want the search to happen
over fld1, fld2
and then compute the score as,
   score = similarity score(fld1) + similarity score(fld2) + <some system
constant>

I think I have figured out how to customize the similarity for the fields,
but I still don't know
how to get this additive effect. I didn't see any direct way to plug in a
custom scorer. 

Thanks.
--

-- 
View this message in context: http://www.nabble.com/Customize-scoring-for-additive-effect...-tf2911491.html#a8134948
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Peter W. | 3 Jan 05:58 2007

Re: Clustering Lucene with 40 Servers

Hello,

Don't have any of the scalability requirements mentioned in this  
thread but the problem is an interesting one.
Lucene needs a connection pool equivalent IMHO or a best practices  
method for load balancing.

Opening, locking, reading and writing to remote indexes over RMI  
seems good on paper but likely to melt
with anything approaching the kind of web traffic seen by a popular  
site. This is why you see people
running (so many) JVM's locally. Solr helps but passing long XML or  
JSON urls for thousands or millions of
requests between your own machines to maintain a Lucene index looks  
redundant to me.

Adding messaging layers to propagate changes or updates introduces  
more points of failure.

I wonder if a system where just a few machines capture say 100k  
updates (at a time) in memory then write .gz
to locally attached external drives would work. These separated data  
files would be exposed thru a web service
where load balanced remote boxes access them using servlets.

They connect in rotation downloading batched index updates. Heck,  
start splitting up big files using Hadoop's
HDFS and make it a party!

Regards,
(Continue reading)

Laurent Hoss | 3 Jan 11:57 2007
Picon

Re: New Lucene QueryParser

Hi Mark

As said in a previous mail, I'm very interested in your Parser and I'm 
happy to hear you made progress , and implemented
Paragraph/Sentence proximity search functionality. :)
This is the killer feature for me!
 and if the execution of the resulting query  ( a mix containing 
SpanQuery 's)  is not (much) slower  than using Boolean/Pharse-Query 
Combos,  it would allow me to forget  our current  "1 lucene-doc per 
paragraph" Indexing Model.
But also the other features are very cool, like the DateParsing which I 
strongly miss in the standard QueryParser !

So let me hear, when you have a version ready to be tested, and how I 
can help.

-Laurent

PS: Some other notes
> Query-time thesaurus expansion / General token to query expansion : 
> Takes advantage of a general find/replace feature, "expand" might map 
> to "(expander | expanded)" ... or any other valid syntax. 
This I could also use, if can also do following ?
right now I've a little utility class which expands special strings 
(syntax is to be disc.) to all combinations :
"fest[,e] hypothek[,en,a]"
-> fest hypothek;fest hypotheken;fest hypotheka;feste hypothek;feste 
hypotheken;feste hypotheka

> Note that there may be some limitations...but so far this has proved 
(Continue reading)

Joost Schouten | 3 Jan 12:42 2007

Field.TermVector usage

Hi,

I've just started with the implementation of Lucene in my Shale-Hibernate
application. From the demo I understand most of the Field constructor:

Field(String name, String value, Field.Store store, Field.Index index,
Field.TermVector termVector)

Except what the Field.TermVector termVector does and how it is used. Can
anyone give me lucene-newbie description on why and how I should use this
argument?

Thank you,
Joost Schouten
Director
 
JS Portal
Dasstraat 21
2623CB Delft
the Netherlands
P: +31 6 160 160 14
E: joost <at> jsportal.com
W: www.jsportal.com 
Mark Miller | 3 Jan 14:00 2007
Picon

Re: New Lucene QueryParser

Hey Laurent,

I am actually pretty much ready for a beta/preview release right about 
now. All of the features are in and I am pretty happy with most of the 
work. Over the past month I have been squashing bugs and could certainly 
use as much help as I can get making sure this thing is as perfect as it 
can be. I am currently in the middle of migrating to a new laptop, so I 
may take a couple days to get a distribution jar together with some 
simple documentation, but I plan on doing that as soon as I get a chance.

>
>> Query-time thesaurus expansion / General token to query expansion : 
>> Takes advantage of a general find/replace feature, "expand" might map 
>> to "(expander | expanded)" ... or any other valid syntax. 
> This I could also use, if can also do following ?
> right now I've a little utility class which expands special strings 
> (syntax is to be disc.) to all combinations :
> "fest[,e] hypothek[,en,a]"
> -> fest hypothek;fest hypotheken;fest hypotheka;feste hypothek;feste 
> hypotheken;feste hypotheka
>
I require a similar feature, although in the form mark{s es ing} -> 
marks markes marking. Unfortunately, the way I have done it (in the 
JavaCC grammer) is not easily configurable.

>> Note that there may be some limitations...but so far this has proved 
>> to be pretty powerful
> Would still be good to know the limitations you see right now...
>
I mentioned there might be limitations because I kept running into new 
(Continue reading)

Grant Ingersoll | 3 Jan 14:16 2007
Picon

Re: Customize scoring for additive effect...

This _may_ help: http://lucene.apache.org/java/docs/scoring.html

It has links into the javadocs for creating Custom Query/Scorers, etc.

-Grant

On Jan 2, 2007, at 9:32 PM, escher2k wrote:

>
> I am trying to build a scoring function which is additive across  
> multiple
> fields that are searched.
> For instance, if a user searches for "Web PHP", I want the search  
> to happen
> over fld1, fld2
> and then compute the score as,
>    score = similarity score(fld1) + similarity score(fld2) + <some  
> system
> constant>
>
> I think I have figured out how to customize the similarity for the  
> fields,
> but I still don't know
> how to get this additive effect. I didn't see any direct way to  
> plug in a
> custom scorer.
>
> Thanks.
> -- 
> View this message in context: http://www.nabble.com/Customize- 
(Continue reading)


Gmane