Daniel Taurat | 1 Nov 2004 16:29

jaspq: dashed numerical values tokenized differently

Hi,
I have just another stupid parser question:
There seems to be a special handling of the dash sign "-" different from
Lucene 1.2 at least in Lucene 1.4.RC3
StandardAnalyzer.

Examples (1.4RC3):

A document containing the string "dash-test" is matched by the following
search expressions:
dash
test
dash*
dash-test
It is _not_ matched by the following search expressions:
dash-*
dash-t*

If the string after the dash consists of digits, the behavior is
different.
E.g., a document containing the string "dash-123" is matched by:
dash*
dash-*
dash-123
It is not matched by:
dash
123

Question:
Is this, esp. the different behavior when parsing digits and characters,
(Continue reading)

Willy De Waele | 1 Nov 2004 17:55
Picon

RE: Search.jhtml ? (ok!)


	The lucene-demo is now working fine outside the tomcat/webapps dir.
	I created a folder, lets say 'c:/Test'. Copied the lucene-doxs to
that folder. 
	Afterwards I deployed the war file to this folder.
	I created an index to 'c:/test/index' and modified the
configuration.jsp
	Last I added an application to tomcat (lucene.xml in the conf
folder).
	And ... lucene is working fine!

	Thanks a lot for your help
	Regards
	Willy

-----Original Message-----
From: Erik Hatcher [mailto:erik <at> ehatchersolutions.com] 
Sent: zondag 31 oktober 2004 09:55
To: Lucene User
Subject: Fwd: Search.jhtml ?

Begin forwarded message:

> From: Erik Hatcher <erik <at> ehatchersolutions.com>
> Date: October 31, 2004 3:53:39 AM EST
> To: <willydw <at> pandora.be>
> Subject: Re: Search.jhtml ?
>
> On Oct 30, 2004, at 9:07 AM, Willy De Waele wrote:
>> 	Indeed, when I follow the procedure it is working fine ...
(Continue reading)

Jeff Munson | 1 Nov 2004 21:02

Search speed

I'm looking for tips on speeding up searches since I am a relatively new
user of Lucene.  

I've created a single index with 4.5 million documents.  The index has
about 22 fields and one of those fields is the contents of the body tag
which can range from 5K to 35K.  When I create the field (named
"contents") that houses the contents of the body tag, the field is
stored, indexed, and tokenized.  The term position vectors are not
stored.  

Single word searches return pretty fast, but when I try phrases,
searching seems to slow considerably.  When constructing the query I am
using the standard query object where analyzer is the StandardAnalyzer:

Code Example:
Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer);

For example, the following query,  contents:Zanesville, it returns over
163,000 hits in 78 milliseconds.  

However, if I use this query, contents:"all parts including picture tube
guaranteed", it returns hits in 2890 millseconds.  Other phrases take
longer as well.  

My question is, are there any indexing tips (storing term vectors?) or
query tips that I can use to speed up the searching of phrases?

Thanks in advance for any tips.....
Jackson Earnst | 2 Nov 2004 04:10
Picon

commit lock, graceful handler

I'm testing fault tollerance aspects of an application using Lucene. 
Consider if power is pulled form the server/workstation and it
immediately shuts down hard or crashes.

I'm faced with a situation of a commit.lock file exising in the temp
directory.  Lucene is throwing an exception when a writer is first
created against this index.  An IOexception comes about and "Lock
obtained timed out" error occurs.

REading some docs anf FAQs I see that this could be deleted and the
index will be in a usable state.

Any advice/comments/thoughts?

Is there a graceful way to handle this?  

Thanks
sergiu gordea | 2 Nov 2004 07:56
Picon

Re: jaspq: dashed numerical values tokenized differently

Daniel Taurat wrote:

>Hi,
>I have just another stupid parser question:
>There seems to be a special handling of the dash sign "-" different from
>Lucene 1.2 at least in Lucene 1.4.RC3
>StandardAnalyzer.
>  
>
 From the behaviour you describe I think that the dash sign is removed 
from the text by the analyzer.
This is quite correct because dash is used to separate two words. 
Without its elimination you won't be able to
get the "dash-test" in results if you search for: dash or/and test

I suggest you to use LUKE ... see contributors page in order to see what 
exactly you have in the index, then you will understand
why search is working like that.

 Sergiu

>Examples (1.4RC3):
>
>A document containing the string "dash-test" is matched by the following
>search expressions:
>dash
>test
>dash*
>dash-test
>It is _not_ matched by the following search expressions:
(Continue reading)

Morus Walter | 2 Nov 2004 08:17
Picon

Re: Locks and Readers and Writers

yahootintin.1247688 <at> bloglines.com writes:
> Hi Christoph,
> 
> Thats what I thought.  But what I'm seeing is this:
> - open
> reader for searching
> (the reader is opening an index on a remote machine
> (via UNC) which takes a couple seconds)
> - meanwhile the other service opens
> an IndexWriter and adds a document
> (the index writer determines that it needs
> to merge so it tries to get a lock.  since the reader is still opening, the
> IO exception is thrown)
> 
> I believe that increasing the merge factor will
> reduce the opportunity for this to occur.  But it will still occur at some
> point.
> 
I'm not sure what you mean by `opening an index on a remote machine (via 
UNC)' but have you made sure that lock files are put in the same directory
for both processes (see the mailing list archive for details)?
Also note, that lucene's locking is known not to work on NFS (also see the
list archive). I don't know if it works on SMB mounts.

Morus
Paul Elschot | 2 Nov 2004 09:04
Picon
Picon
Favicon

Re: Search speed

On Monday 01 November 2004 21:02, Jeff Munson wrote:
> I'm looking for tips on speeding up searches since I am a relatively new
> user of Lucene.  
> 
> I've created a single index with 4.5 million documents.  The index has
> about 22 fields and one of those fields is the contents of the body tag
> which can range from 5K to 35K.  When I create the field (named
> "contents") that houses the contents of the body tag, the field is
> stored, indexed, and tokenized.  The term position vectors are not
> stored.  
> 
> Single word searches return pretty fast, but when I try phrases,
> searching seems to slow considerably.  When constructing the query I am
> using the standard query object where analyzer is the StandardAnalyzer:
> 
> Code Example:
> Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer);
> 
> For example, the following query,  contents:Zanesville, it returns over
> 163,000 hits in 78 milliseconds.  
> 
> However, if I use this query, contents:"all parts including picture tube
> guaranteed", it returns hits in 2890 millseconds.  Other phrases take
> longer as well.  
> 
> My question is, are there any indexing tips (storing term vectors?) or
> query tips that I can use to speed up the searching of phrases?

Term vectors should not influence search times for phrases.

(Continue reading)

Morus Walter | 2 Nov 2004 09:21
Picon

Re: jaspq: dashed numerical values tokenized differently

Daniel Taurat writes:
> Hi,
> I have just another stupid parser question:
> There seems to be a special handling of the dash sign "-" different from
> Lucene 1.2 at least in Lucene 1.4.RC3
> StandardAnalyzer.
> 
> Examples (1.4RC3):
> 
> A document containing the string "dash-test" is matched by the following
> search expressions:
> dash
> test
> dash*
> dash-test
> It is _not_ matched by the following search expressions:
> dash-*
> dash-t*
> 
> If the string after the dash consists of digits, the behavior is
> different.
> E.g., a document containing the string "dash-123" is matched by:
> dash*
> dash-*
> dash-123
> It is not matched by:
> dash
> 123
> 
> Question:
(Continue reading)

Nader Henein | 2 Nov 2004 10:58

Re: commit lock, graceful handler

Graceful, no, I started a discussion on this about two years ago, what 
I'm doing is a batched indexing so if a crash occurs the next time the 
application starts up I have an  LuceneInit class that goes and ensures 
that all indecies have no locks on them by simply deleting the lock file 
and optimizing the index, this has worked for us well for the past two 
years in a production environment and the next indexing run will pick up 
the same batch and re-index it, which doesn't hurt the index because 
every time I add a document to the index, I actually delete it first to 
ensure that there are no repetitions, we've never had an index go 
corrupt on us but we do have six indecies being updated in parallel in 
addition to nightly backups by our hosting facility during a one hour 
window where we do no updates/deletes on the index to ensure that the 
backup is kosher.

It may not be graceful as Oracle Rollback Tables but it's functional and 
a lot less complicated.

Nader

Jackson Earnst wrote:

>I'm testing fault tollerance aspects of an application using Lucene. 
>Consider if power is pulled form the server/workstation and it
>immediately shuts down hard or crashes.
>
>I'm faced with a situation of a commit.lock file exising in the temp
>directory.  Lucene is throwing an exception when a writer is first
>created against this index.  An IOexception comes about and "Lock
>obtained timed out" error occurs.
>
(Continue reading)

Marcel Hofmann | 2 Nov 2004 16:18
Picon

Content-based similarity search in vector-space for Lucene

Hello!

For my diploma (available in german), i have written a similarity
search, that for an given document (query) returns documents, which
content is gradual similar to the query-document. With this
functionality, e.g. different versions from an document, plagiats of a
publication or related articels in the archiv of an scientific magazin
can be found.

The documents where indexed with lucene 1.4 and represented as
termvectors inside the lucene-index. For searching, an real
vectorspace-retrievalmodell (not an advanced boolean model) based on the
SMART-Retrievalsystem from Gerard Salton was implemented, including
tf-idf weighting and cosine-similarity-function. The whole search-space
is explored, no heuristical methods are used at time, but can be retrofited.

I have deployed an shortened version of the diploma-prototype, which
includes a GUI, one sample document-collection (CIA Factbook) but not
the sources of the project:

http://www.informatik.htw-dresden.de/~s4328/pub/diploma_Marcel_Hofmann.zip

The prototype can be started with the prototype/deploy/diploma.bat
(sorry to all non Windows users). The included readme.txt lists the
original content of the prototype, not the shortened version.

I would like to deploy an library to the lucene-project, which contains 
the core of the implementation (vector-space, cosine-similarity,...).
All you have to do is answer this mail, ask for this library and givee 
my some hints...
(Continue reading)


Gmane