Erik Hatcher | 1 Jun 2005 01:08
Favicon

Re: Indexing multiple languages

Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It  
will keep English as-is (removing stop words, lowercasing, and such)  
and separate CJK characters into separate tokens also.

     Erik

On May 31, 2005, at 5:49 PM, jian chen wrote:

> Hi,
>
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
>
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
>
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
>
> I would like to learn more thoughts about this!
>
> Thanks,
>
> Jian
>
> On 5/31/05, Tansley, Robert <robert.tansley <at> hp.com> wrote:
>
(Continue reading)

Erik Hatcher | 1 Jun 2005 01:12
Favicon

Re: Indexing multiple languages

Robert,

I'm very likely going to be using DSpace and some related  
technologies from the SIMILE project very soon :)

On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:
> Hi all,
>
> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> (Dublin Core standard) and extracted full-text content of documents
> stored in it.  Now the system is being used globally, it needs to
> support multi-language indexing.
>
> I've looked through the mailing list archives etc. and it seems it's
> easy to plug in analyzers for different languages.
>
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
>
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so  
> searches
> can be constrained to a particular language
> 3/ separate indices for each language?

I would vote for option #2 as it gives the most flexibilty - you can  
query with or without concern for language.

> I'm also not sure of the storage and performance consequences of 2/.

(Continue reading)

jian chen | 1 Jun 2005 01:16
Picon

Re: Indexing multiple languages

Hi, Erik,

Thanks for your info. 

No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.

Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.

It is yet another mailing list search for Velocity, but I combined
date as well as full text search together.

I only used lucene for indexing the textual content, and combined
database search with lucene search in returning the results.

The other interesting thought I have is: maybe it is possible to use
Lucene's merge segments mechanism to write a java based simple file
system. Which of course, does not require constant compact operation.
The file system could be based on one file only, where segments are
just part of the big file. It might be really efficient in terms of
adding/deleting the objects all the time.

Lastly, any comments welcome for www.jhsystems.net Velocity search.

Thanks,

Jian
www.jhsystems.net

(Continue reading)

Markus Wiederkehr | 1 Jun 2005 06:35
Picon

Re: managing docids for ParallelReader (was Augmenting an existing index)

On 5/31/05, Doug Cutting <cutting <at> apache.org> wrote:
> Matt Quail wrote:
> > I have wondered about this as well. Are there any *sure fire* ways of
> > creating (and updating) two indices so that doc numbers in one index
> > deliberately correspond to doc numbers in the other index?
> 
> If you add the documents in the same order to both indexes and perform
> the same deletions on both indexes then they'll have the same numbers.

The Javadoc says that ParallelReader is useful with collections that
have large fields which change rarely and small fields that change
more frequently. IMO that implies that you do *not* always apply the
same operations on both indexes.

> If this is not convenient, then you could add an id field to all
> documents in the primary index.  Then create (or re-create) the
> secondary index by iterating through the values in a FieldCache of this
> id field.

I guess I am too new to Lucene to understand how that is supposed to
work. What exactely is the purpose of a FieldCache and how is it
created and used? Could you elaborate on that, please?

> ParallelReader was not really designed to support incremental updates of
> fields, but rather to accellerate batch updates.  For incremental
> updates you're probably better served by updating a single index.

I would be happy with a single index if it were possible to change
fields of a document without affecting other fields. When I lookup a
document using an IndexSearcher, manipulate some fields and save that
(Continue reading)

Paul Libbrecht | 1 Jun 2005 10:10

Re: Indexing multiple languages

Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so 
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
> I would vote for option #2 as it gives the most flexibilty - you can 
> query with or without concern for language.

The way I've solved this is to make a different field-name per-language 
as our documents can be multilingual.
What's then done is query expansion at query time: given a term-query 
for text, I duplicate it for each accepted language of the user with a 
factor related to the preference of the language (e.g. the q factor in 
Accept-Language http header). Presumably I could be using solution 2/ 
as well if my queries become too big, making several documents for each 
language of the document.

I think it's very important to care about guessing the accepted 
languages of the user. Typically, the default behaviour of Google is to 
only give you matches in your primary language but then allow expansion 
in any language.

>> On the other hand, if people are searching for proper nouns in 
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at 
>> once.

This one may need particular treatment.

(Continue reading)

Dawid Weiss | 1 Jun 2005 10:24
Picon

Re: Clustering Carrot2 vs TermVector Analysis


Hi Andrew,

Coming up with an answer... sorry for the delay.

> By using the carrot demo: 
> http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
> 
> 
> I was able to easliy cluster search results based on the fields used
> by carrot( url, title, and summary). However I was wondering if there
> was a way to do something similar using term vector analysis and the
> built in TermVector / Similarity api.

Yes, most clustering methods are based just on that (term-vector
matrix). Carrot also uses this internally, but builds its own data
structure from the provided data instead of relying on Lucene's. It
shouldn't be a problem to write a clustering plugin to Carrot that
actually uses the term-vector data from Lucene.

> After doing a typical lucene search how can I get the  top 5 "key
> terms" for each of the top ten documents.  I was thinking that I sum
> these and then have a type of cluster.

The question is ill-defined, I'm afraid. "top 5 key terms" are very 
subjecting and depend on the strategy of score calculation, the way 
you're pruning stop words, etc.

I also don't get the: "each of the top ten documents". Do you mean: each 
of the ten top documents within a cluster?
(Continue reading)

Kevin Burton | 1 Jun 2005 10:30

Re: Ability to load a document with ONLY a few fields for performance?

Andrew Boyd wrote:

>The numbers look impressive.  If I build from the 1.9 trunck will I get the patch?
>
>  
>
Funny... I went ahead and imoplemented this myself and it didn't work.  
Of course I may have implemented it incorrectly.  I'll look at the patch 
source and try it out!

Something fun to do tomorrow!  w00t!

Kevin

--

-- 

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 
Reece Wilton | 1 Jun 2005 02:38
Picon
Favicon

SpanTermQuery issue?

Hi,

Using a BooleanQuery to combine two SpanTermQuery objects causes
unexpected results on Lucene 1.9 RC1.  Is this a problem that is
already known about or has already been fixed?  

I have a test case and more info if this is a new issue.

Thanks.
Erik Hatcher | 1 Jun 2005 11:27
Favicon

Re: SpanTermQuery issue?


On May 31, 2005, at 8:38 PM, Reece Wilton wrote:

> Hi,
>
> Using a BooleanQuery to combine two SpanTermQuery objects causes
> unexpected results on Lucene 1.9 RC1.  Is this a problem that is
> already known about or has already been fixed?
>
> I have a test case and more info if this is a new issue.

Interesingly I just encountered this very issue myself.  Yes, please  
open an issue and attach your test case.  I'll add the test case and  
start looking for the cause of this.

     Erik
Shey Rab Pawo | 1 Jun 2005 16:10
Picon

Re: Stemming at Query time

If your stemmer worked on indexing, then won't the "breath" entry
automatically pick up all of these?  So, isn't the project unnecessary
and otiose?

On 5/31/05, Daniel Naber <lucenelist <at> danielnaber.de> wrote:
> On Monday 30 May 2005 18:54, Andrew Boyd wrote:
> 
> > Now that the QueryParser knows about position increments has anyone
> > used this to do stemming at query time and not at indexing time? I
> > suppose one would need a reverse stemmer. Given the query breath it
> > would need to inject breathe, breathes, breathing etc.
> 
> There are two things to consider: queries will get more complicated and
> thus slower and the implementation isn't that easy:  while stemming can be
> done with a simple algorithm (for English), you'll need a dictionary with
> at least part-of-speech information for adding suffixes. That's because
> you cannot just add "ing" to any word, otherwise you'd end up with car +
> ing  = caring. (But once you have this dictionary the quality of your
> solution can be better than that of a stemmer, as stemmers also suffer
> form over-stemming, i.e. mapping two non-related words to the same form).
> 
> Regards
>  Daniel
> 
> --
> http://www.danielnaber.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
(Continue reading)


Gmane