Erik Hatcher | 1 Feb 2004 13:22
Favicon

Re: cvs commit: jakarta-lucene/src/test/org/apache/lucene/search TestBasics.java TestSimilarity.java

On Jan 30, 2004, at 5:10 PM, cutting <at> apache.org wrote:
> cutting     2004/01/30 14:10:00
>
>   Modified:    .        CHANGES.txt
>                src/java/org/apache/lucene/search Similarity.java
>                src/test/org/apache/lucene/search TestBasics.java
>                         TestSimilarity.java
>   Added:       src/java/org/apache/lucene/search/spans NearSpans.java
>                         SpanFirstQuery.java SpanNearQuery.java
>                         SpanNotQuery.java SpanOrQuery.java 
> SpanQuery.java
>                         SpanQueue.java SpanScorer.java 
> SpanTermQuery.java
>                         SpanWeight.java Spans.java package.html
>   Log:
>   Added new span-based query API.

Wow!  You're simply amazing, Doug.  This is a very sweet addition to 
Lucene.

	Erik
ehatcher | 1 Feb 2004 20:55
Picon
Favicon

cvs commit: jakarta-lucene-sandbox/contributions/miscellaneous/src/test - New directory

ehatcher    2004/02/01 11:55:22

  jakarta-lucene-sandbox/contributions/miscellaneous/src/test - New directory
ehatcher | 1 Feb 2004 20:56
Picon
Favicon

cvs commit: jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org - New directory

ehatcher    2004/02/01 11:56:29

  jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org - New directory
ehatcher | 1 Feb 2004 20:56
Picon
Favicon

cvs commit: jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache - New directory

ehatcher    2004/02/01 11:56:49

  jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache - New directory
ehatcher | 1 Feb 2004 20:56
Picon
Favicon

cvs commit: jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache/lucene - New directory

ehatcher    2004/02/01 11:56:58

  jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache/lucene - New directory
ehatcher | 1 Feb 2004 20:57
Picon
Favicon

cvs commit: jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache/lucene/misc - New directory

ehatcher    2004/02/01 11:57:08

  jakarta-lucene-sandbox/contributions/miscellaneous/src/test/org/apache/lucene/misc - New directory
karl wettin | 1 Feb 2004 22:07

N-gram layer


Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with. 

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The 
algorithm is very similair to the "TextCat" C-libray. 

And then I though, maybe it would be possible to use the same N-gram 
classifier to make an automatic stemmer that works on all languages. 
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
find some tutorials on how to write my own? I didn't check with Google,
maybe I should before posting here. Since the stemmer (and metaphone)
data would have to be indexed in their own field(?) querying the stemmed
would require one to stem the query too. Can I create a subclass of 
Query (or so), or do I need to create my own Query-class that handles
the stemming all the way for the user? The last option is my current
approach, so I would appreciate some hints and pointers here.

Great project! 

karl
(Continue reading)

Otis Gospodnetic | 1 Feb 2004 22:12
Picon
Favicon

Re: N-gram layer

The best Analyzer documentation so far is Erik Hatcher's "Parser Rulez"
article.  Link is under Resources page on Lucene's site.

Looking forward to the contribution.

Otis

--- karl wettin <kalle <at> snigel.dnsalias.net> wrote:
> 
> Hello list,
> 
> I'm Karl, and I just started testing Lucene the other day. It's a
> great
> core engine, but feel there are some things missing I'd be happy to
> contribute with. 
> 
> I stated with writing a simple N-gram classifier to detect language
> of
> a text in order to automatically cluster documents by langauge. The 
> algorithm is very similair to the "TextCat" C-libray. 
> 
> And then I though, maybe it would be possible to use the same N-gram 
> classifier to make an automatic stemmer that works on all languages. 
> Hopefully I'll have something up and running for tests by next
> weekend.
> 
> The same classifier could be used for a simple metaphone index.
> 
> However, I need some help on understanding the Analyzer. Where can I
> find some tutorials on how to write my own? I didn't check with
(Continue reading)

Robert Engels | 2 Feb 2004 05:15
Picon

RE: N-gram layer

Actually, you do not always need to store it in a field.

See the Phonetic Query patch I posted (which does Soundex, Metaphone, and
can actually do any 'secondary' info query).

Robert Engels

-----Original Message-----
From: karl wettin [mailto:kalle <at> snigel.dnsalias.net]
Sent: Sunday, February 01, 2004 3:07 PM
To: lucene-dev <at> jakarta.apache.org
Subject: N-gram layer

Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The
algorithm is very similair to the "TextCat" C-libray.

And then I though, maybe it would be possible to use the same N-gram
classifier to make an automatic stemmer that works on all languages.
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
(Continue reading)

karl wettin | 2 Feb 2004 05:44

Re: N-gram layer

On Sun, 1 Feb 2004 22:15:26 -0600
"Robert Engels" <rengels <at> ix.netcom.com> wrote:

> Actually, you do not always need to store it in a field.
> 
> See the Phonetic Query patch I posted (which does Soundex, Metaphone,
> and can actually do any 'secondary' info query).

Now it hit me, I really don't need to store the stemmed document at all,
it would save quite a bit of disk to stem the indexed data in real time.

Silly me.

Is it an Analyzer or Query I want to subclass?

--

-- 
karl

http://sf.net/projects/silvertejp/ 

[abstract Human]<|--+--[Woman]<>-- +mother +child {0..*} --[Human]
                    \--[Man]<>-- +father +child {0..*} --[Human]

"arghhh .. it's all in geek" - objectmonkey.com 

Gmane