1 Jun 2005 01:08
Re: Indexing multiple languages
Erik Hatcher <erik <at> ehatchersolutions.com>
2005-05-31 23:08:37 GMT
2005-05-31 23:08:37 GMT
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.
Erik
On May 31, 2005, at 5:49 PM, jian chen wrote:
> Hi,
>
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
>
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
>
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
>
> I would like to learn more thoughts about this!
>
> Thanks,
>
> Jian
>
> On 5/31/05, Tansley, Robert <robert.tansley <at> hp.com> wrote:
>
(Continue reading)
RSS Feed