Yagnesh Shah | 1 Apr 2005 01:36

RE: HTML pages highlighter

Hi! Eric,
	I have modified HTMLDocument.java try section to used doc.add(Field.Text("contents", l)); I am able to
compile with following warning about depricated API. But I am still unable to see any value.

############ compile warning #########
compile-demo:
    [javac] Compiling 1 source file to /opt/dynamo/trunk/build/classes/demo
    [javac] Note: /opt/dynamo/trunk/src/demo/org/apache/lucene/demo/HTMLDocument
.java uses or overrides a deprecated API.
    [javac] Note: Recompile with -deprecation for details.

jar-demo:
      [jar] Building jar: /opt/dynamo/trunk/build/lucene-demos-1.9-rc1-devYS.jar

############### code change #############

    try {
      fis = new FileInputStream(f);
      HTMLParser parser = new HTMLParser(fis);

      // Add the tag-stripped contents as a Reader-valued Text field so it will
      // get tokenized and indexed.
//      doc.add(new Field("contents", parser.getReader()));
      LineNumberReader reader = new LineNumberReader(parser.getReader());
      for (String l = reader.readLine(); l != null; l = reader.readLine())
//        System.out.println(l);
      doc.add(Field.Text("contents", l));

      // Add the summary as a field that is stored and returned with
      // hit documents for display.
(Continue reading)

Erik Hatcher | 1 Apr 2005 03:03
Favicon

Re: HTML pages highlighter


On Mar 31, 2005, at 6:36 PM, Yagnesh Shah wrote:
>     try {
>       fis = new FileInputStream(f);
>       HTMLParser parser = new HTMLParser(fis);
>
>       // Add the tag-stripped contents as a Reader-valued Text field 
> so it will
>       // get tokenized and indexed.
> //      doc.add(new Field("contents", parser.getReader()));
>       LineNumberReader reader = new 
> LineNumberReader(parser.getReader());
>       for (String l = reader.readLine(); l != null; l = 
> reader.readLine())
> //        System.out.println(l);
>       doc.add(Field.Text("contents", l));

Notice that your loop here is adding a "contents" field for *every* 
line read since that is where the first semi-colon is.

Look at using Luke to explore your index.  Try indexing just a dummy 
String:

	doc.add(Field.Text("contents", "some dummy text"));

to show that it works.  Always always always simplify a complicated 
situation by doing the most obvious thing that _should_ work.

Also, the demo Lucene code is not really designed to be used in a 
production application (sadly), so you're better off borrowing code 
(Continue reading)

Erik Hatcher | 1 Apr 2005 03:08
Favicon

Re: Analyzer don't work with wildcard queries, snowball analyzer.

On Mar 31, 2005, at 12:26 PM, Ernesto De Santis wrote:
> Hi Erik

Finally, my name spelled correctly..... :))

> Ok, in PrefixQuery cases, non analyze is right.
>
> But you think that non analyze in WildcardQuery is right?

Do I think its right?  That's just the way it is.  Whether that is 
right or not I don't know for sure.  I don't think analyzing a wildcard 
expression is going to do the right thing in most cases - consider 
analyzers that split on special characters like ? and * - in fact I'd 
bet your analyzer currently does that!

> You search "example" and obtain x results.
> You search "ex?mple" and don't obtain any result.
> This is correct for you?
> It is difficult to analyze wildcard queries in lucene code?

Your free to subclass QueryParser and override getWildcardQuery and 
analyze the term text.  I suspect you won't have much success though.  
Please let us know what you find.

	Erik
Erik Hatcher | 1 Apr 2005 03:15
Favicon

Re: using different analyzer for searching


On Mar 31, 2005, at 11:44 AM, pashupathinath wrote:

>   is it possible to index using a predefined analyzer
> and search using a custom analyzer ??

Yes, its perfectly fine to do so with the caveat that you end up 
searching for the terms exactly as they were indexed.

I end up doing this in most applications, actually, primarily because 
untokenized fields need to use the KeywordAnalyzer during searching.

>   i'm searching using the built in whitespace
> analyser. the problem is when i'm searching for a part
> of a string the search results are zero.
>   i'm using white space analyzer. for example if the
> statement is "my name is abc123" the search for abc or
> 123 doesnt return any hits.
>   anyinsight into this ??

The exact terms indexed using WhitespaceAnalyzer are like this (using 
the Lucene in Action AnalyzerDemo - "ant AnalyzerDemo"):

     [input] String to analyze: [This string will be analyzed.]
my name is abc123
      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "my name is abc123"
      [java]   WhitespaceAnalyzer:
      [java]     [my] [name] [is] [abc123]

(Continue reading)

pashupathinath | 1 Apr 2005 05:49
Picon
Favicon

Re: using different analyzer for searching

hi erik,
   i'm creating a blogger application, where the users
can create blogs, upload pictures and post comments
etc etc.
   i'm storing all the information using mysql
database. i'm indexing the database contents and
searching on this index.i'm using lucene to implement
this feature.
   i give the user options to search based on
BlogTitle, Blogdesc,blogcategory. my main purpose of
search is ..whenever a user enters any query related
to blogtitle or blogdesc or blogcategory, it should
return all the matching documents for that search
string. 
   the real problem i'm facing is ..whenever the user
enters some part of the mainstring, the search returns
 zero because i was using a whitespaceanalyser, which
needs the complete string. i should look into using
wildcardquery which i think will solve my problem to
some extent. 
   i should do even more analysis as suggested by you
before i should come to a decision of which analyser i
should be using to solve this. what about writing a
custom analyzer to solve this ??? how can i go abt the
logic of implementing this in a custom analyzer..
where this returns all the documents that has even a
part of  the search string. 
   any insight into this would be very helpful
especially in terms of performance wise.

(Continue reading)

Morus Walter | 1 Apr 2005 08:25
Picon

Re: Analyzer don't work with wildcard queries, snowball analyzer.

Ernesto De Santis writes:
> Hi Erik
> 
> Ok, in PrefixQuery cases, non analyze is right.
> 
It creates the same problems.
'example*' should find 'example' but does not if 'example' is stemmed
to 'exampl' and you don't analyze the prefix query.

> 
> You search "example" and obtain x results.
> You search "ex?mple" and don't obtain any result.
> This is correct for you?
> It is difficult to analyze wildcard queries in lucene code?
> 
This has nothing to do with lucene code.
If you can write such an analyzer, do so. Erik already showed you, how
to integrate it with QP. If you're successful share the code.
It looks easy for 'ex?ample' but how would you analyze 'exampl?s'?
Assuming 'examples' get's stemmed to 'exampl' you would have to guess,
that ? might expand to 'e' and 'exampl?s' should be analyzed to 'exampl'
and 'exampl?s' (and probably 'exampl?'). Or 'exa*s'?

IMO you have to either avoid stemming or wildcards or live with the 
different handling.

Morus
Karthik N S | 1 Apr 2005 10:41
Picon

LUCENE & eCommerce

 

Hi Guys

Apologies.................

Has Any body out on the Form ,have implemented  Lucene API in  eCommerce ( Search based Shopping)

Something similar to   http://www.bizrate.com/  .

 

Please Help me ???

 

 


WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]

Karthik N S | 1 Apr 2005 10:45
Picon

RE: using different analyzer for searching

Hi

Try First Try Using the AnalysisDemo.java code from
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html?page=last#thre
ad from java.net for the Contents u seems to experiment with and verify
which analyzer to use.

Probably this will give  u some Idea on Analyzers.

with regaards
Karthik

-----Original Message-----
From: pashupathinath [mailto:pashupathinathk <at> yahoo.com]
Sent: Friday, April 01, 2005 9:19 AM
To: java-user <at> lucene.apache.org
Subject: Re: using different analyzer for searching

hi erik,
   i'm creating a blogger application, where the users
can create blogs, upload pictures and post comments
etc etc.
   i'm storing all the information using mysql
database. i'm indexing the database contents and
searching on this index.i'm using lucene to implement
this feature.
   i give the user options to search based on
BlogTitle, Blogdesc,blogcategory. my main purpose of
search is ..whenever a user enters any query related
to blogtitle or blogdesc or blogcategory, it should
return all the matching documents for that search
string.
   the real problem i'm facing is ..whenever the user
enters some part of the mainstring, the search returns
 zero because i was using a whitespaceanalyser, which
needs the complete string. i should look into using
wildcardquery which i think will solve my problem to
some extent.
   i should do even more analysis as suggested by you
before i should come to a decision of which analyser i
should be using to solve this. what about writing a
custom analyzer to solve this ??? how can i go abt the
logic of implementing this in a custom analyzer..
where this returns all the documents that has even a
part of  the search string.
   any insight into this would be very helpful
especially in terms of performance wise.

thanks,
pashupathinath.k

--- Erik Hatcher <erik <at> ehatchersolutions.com> wrote:
>
> On Mar 31, 2005, at 11:44 AM, pashupathinath wrote:
>
> >   is it possible to index using a predefined
> analyzer
> > and search using a custom analyzer ??
>
> Yes, its perfectly fine to do so with the caveat
> that you end up
> searching for the terms exactly as they were
> indexed.
>
> I end up doing this in most applications, actually,
> primarily because
> untokenized fields need to use the KeywordAnalyzer
> during searching.
>
> >   i'm searching using the built in whitespace
> > analyser. the problem is when i'm searching for a
> part
> > of a string the search results are zero.
> >   i'm using white space analyzer. for example if
> the
> > statement is "my name is abc123" the search for
> abc or
> > 123 doesnt return any hits.
> >   anyinsight into this ??
>
> The exact terms indexed using WhitespaceAnalyzer are
> like this (using
> the Lucene in Action AnalyzerDemo - "ant
> AnalyzerDemo"):
>
>      [input] String to analyze: [This string will be
> analyzed.]
> my name is abc123
>       [echo] Running lia.analysis.AnalyzerDemo...
>       [java] Analyzing "my name is abc123"
>       [java]   WhitespaceAnalyzer:
>       [java]     [my] [name] [is] [abc123]
>
>       [java]   SimpleAnalyzer:
>       [java]     [my] [name] [is] [abc]
>
>       [java]   StopAnalyzer:
>       [java]     [my] [name] [abc]
>
>       [java]   StandardAnalyzer:
>       [java]     [my] [name] [abc123]
>
> So you indexed "abc123" and searches must search for
> that term
> *exactly*.  You can search for "abc*" as a
> PrefixQuery or WildcardQuery
> and find "abc123".  "*123" will also find it though
> QueryParser does
> not support leading wildcard characters (but the API
> does).  Wildcard
> queries are not ideally what you want as it tends to
> be much slower for
> large indexes.
>
> You may need to do specialized analysis.  Perhaps
> you could share you
> real needs with the list and we could offer
> recommendations.  It is
> possible to index "abc123", "abc", and "123" all
> within the same
> position in the index if you do some clever analysis
> and that meshes
> with what you're after.
>
> 	Erik
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail:
> java-user-help <at> lucene.apache.org
>
>

Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
For additional commands, e-mail: java-user-help <at> lucene.apache.org
Fabien Le Floc'h | 1 Apr 2005 11:43
Favicon

indexing performance of little documents

Hello,

I want to index a 1GB file that contains a list of lines of
approximately 100 characters each, so that i can later get lines
containing some particular text. The natural way of doing it with lucene
would be to create 1 lucene Document per line. It works well except it
is too slow for my needs, even after tweaking all possible parameters of
IndexWriter and using cvs version of lucene. 

I can get 10x the indexing performance by indexing the file as 1 lucene
Document. Lucene builds a good index with all the terms and I am able to
get the number of terms matching a query but not the absolute position
in the original file (I only get the token relative position). A minor
quirk with this approach is that i need to split the document in order
to avoid outofmemory exception when the document is too big. It would be
probably possible for me to customize lucene for my needs (create a more
flexible Term class), that's just a hack. But I was wondering why there
should be such a performance difference.

I see that for each document plenty of work is done, but that seems
necessary, and then there is even more work while merging segments.
Things could probably be faster if documents were first aggregated and
then work done on them. But I think this would imply huge changes in
Lucene. Any advice for indexing millions of tiny docs?

 

Regards,

Fabien.
Sven Duzont | 1 Apr 2005 15:09

Re[2]: Analyzer don't work with wildcard queries, snowball analyzer.

Hello Erik,

Since wilcard queries are not analyzed, how can we deal with accents ?
For instance (in french) a query like "ingé*" will not match documents containing
"ingénieur" but the query "inge*" will.

Thanks

---
 sven

Le jeudi 31 mars 2005 à 17:51:25, vous écriviez :

EH> Wildcard terms simply are not analyzed.  How could it be possible to do
EH> this?  What if I search for "a*" - how could you stem that?

EH> 	Erik

EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:

>> Hi
>>
>> I get an unexpected behavior when use wildcards in my queries.
>> I use a EnglishAnalyzer developed with SnowballAnalyzer. version 
>> 1.1_dev from Lucene in Action lib.
>>
>> Analysis case:
>> When use wildcards in the middle of one word, the word in not analyzed.
>> Examples:
>>
>>            QueryParser qp = new QueryParser("body", analyzer);
>>            Query q = qp.parse("ex?mple");
>>            String strq = q.toString();
>>            assertEquals("body:ex?mpl", strq);
>> //FAIL strq == body:ex?mple
>>
>>            qp = new QueryParser("body", analyzer);
>>            q = qp.parse("ex*ple");
>>            strq = q.toString();
>>            assertEquals("body:ex*pl", strq);
>> //FAIL strq == body:ex*ple
>>
>> With this behavior, the search does not find any document.
>>
>> Bye
>> Ernesto.
>>
>> -- 
>> Ernesto De Santis - Colaborativa.net
>> Córdoba 1147 Piso 6 Oficinas 3 y 4
>> (S2000AWO) Rosario, SF, Argentina.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
>> For additional commands, e-mail: java-user-help <at> lucene.apache.org

EH> ---------------------------------------------------------------------
EH> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
EH> For additional commands, e-mail: java-user-help <at> lucene.apache.org

Gmane