Andrzej Bialecki | 1 Jul 2010 07:54

Re: Document Order in IndexWriter.addIndexes

On 2010-06-30 22:16, Apoorv Sharma wrote:
> This implies there is no way to merge two parallel indexes(based on parallel
> reader) to get a new parallel index. Correct me if I am wrong.

Do you mean sequentially, using IW.addIndexes(new
IndexReader[]{index1,index2}), so that the fields from documents in one
index become attached to the fields in the other index under the same
doc IDs? No, it absolutely doesn't work this way.

Or do you mean to merge the two indexes in parallel? Then just open them
with ParallelReader ;) and submit this parallel reader to
IW.addIndexes(). ParallelReader doesn't expose sub-readers to
SegmentMerger, it returns null from getSequentialSubReaders() so that
SegmentMerger has to use the documents as they are returned from
ParallelReader.document(int). Net effect is that fields from sub-readers
become merged on the same doc id, and will be thus recorded in the
output index.

--

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Picon
Favicon

Lucene and Chinese language


Hi!

We are using lucene in our project to search through information objects which works fine. For indexing we
use the StandardAnalyzer.
Now, we have to support the Chinese language. I found out that the Chinese words and letters are correctly
saved in the index but the query to search for them does not work. Example: in English language the query is
“text” which we parse to “*text*”. If we search for Chinese words / phrases like
“佛山东方书城”the query is “*佛山东方书城*“ but there are no search results. If
the query places blanks between the single letters / symbols like this “*佛 山 东 方 书 城*“ we
are getting results. Does the StandardAnalyzer interpret each Chinese letter as one word? What are best
practices for this case? Shall we use another analyzer (Chinese analyzer)? Or is it better to replace the
query parser in this case?

Regards,
Jacqueline.
lwl | 1 Jul 2010 11:30
Picon

Re: Lucene and Chinese language

yes, the StandardAnalyzer interpret each Chinese letter as one word.
Better analyzers for chinese are here:
http://hi.baidu.com/lewutian/blog/item/ca61060a06914b1394ca6b25.html

在 2010年7月1日 下午5:19,Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>写道:

>
> Hi!
>
> We are using lucene in our project to search through information objects
> which works fine. For indexing we use the StandardAnalyzer.
> Now, we have to support the Chinese language. I found out that the Chinese
> words and letters are correctly saved in the index but the query to search
> for them does not work. Example: in English language the query is “text”
> which we parse to “*text*”. If we search for Chinese words / phrases like
> “佛山东方书城”the query is “*佛山东方书城*“ but there are no search results. If the
> query places blanks between the single letters / symbols like this “*佛 山 东 方
> 书 城*“ we are getting results. Does the StandardAnalyzer interpret each
> Chinese letter as one word? What are best practices for this case? Shall we
> use another analyzer (Chinese analyzer)? Or is it better to replace the
> query parser in this case?
>
> Regards,
> Jacqueline.
>
Danil ŢORIN | 1 Jul 2010 11:30
Picon

Re: Lucene and Chinese language

Try to use CJK analyzer for both indexing and searching chinese language.
Then you won't need "text"->"*text*" transformation.

There might be some false positives in the results though.
You can also may want to try smartcn analyzer which is dictionary based, but
I have no expertise to evaluate the results (we still use CJK for asian
languages, as there are no complains so far)

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>

>
> Hi!
>
> We are using lucene in our project to search through information objects
> which works fine. For indexing we use the StandardAnalyzer.
> Now, we have to support the Chinese language. I found out that the Chinese
> words and letters are correctly saved in the index but the query to search
> for them does not work. Example: in English language the query is “text”
> which we parse to “*text*”. If we search for Chinese words / phrases like
> “佛山东方书城”the query is “*佛山东方书城*“ but there are no search results. If the
> query places blanks between the single letters / symbols like this “*佛 山 东 方
> 书 城*“ we are getting results. Does the StandardAnalyzer interpret each
> Chinese letter as one word? What are best practices for this case? Shall we
> use another analyzer (Chinese analyzer)? Or is it better to replace the
> query parser in this case?
>
> Regards,
> Jacqueline.
>
(Continue reading)

Robert Muir | 1 Jul 2010 12:33
Picon
Gravatar

Re: Lucene and Chinese language

This is a bug in the queryparser. (
https://issues.apache.org/jira/browse/LUCENE-2458)

the problem has nothing to do with your choice of analyzer, it has to do
with how the query is formed.

Currently the queryparser uses a convoluted algorithm involving whitespace
(and not just the double quote operator as you would expect) to form phrase
queries. So, queries like this with no whitespace form phrase queries
always.

The only workaround for reasonably good results consists of two steps:
1. at query time (only!) add a
org.apache.lucene.analysis.position.PositionFilter (from contrib/analyzers)
to your analyzer. don't do this at index-time, just query-time!
2. this will make all terms in the query "synonyms" of each other to bypass
this problem, but will screw up scoring, so you might want to also extend
QueryParser in a custom way:

 <at> Override
 protected BooleanQuery newBooleanQuery(boolean disableCoord) {
   // intentionally ignore disabled
   // coord() factor from the PositionFilter hack.
   return new BooleanQuery(false);
 }

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>

>
> Hi!
(Continue reading)

Picon
Favicon

AW: Lucene and Chinese language

How can I add this PositionFilter? I can't see anything in the API. I use lucene version 3.0.1, this is my
query parser:

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname" , new StandardAnalyzer(Version.LUCENE_30));

-----Ursprüngliche Nachricht-----
Von: Robert Muir [mailto:rcmuir <at> gmail.com] 
Gesendet: Donnerstag, 1. Juli 2010 12:34
An: java-user <at> lucene.apache.org
Betreff: Re: Lucene and Chinese language

This is a bug in the queryparser. (
https://issues.apache.org/jira/browse/LUCENE-2458)


the problem has nothing to do with your choice of analyzer, it has to do
with how the query is formed.

Currently the queryparser uses a convoluted algorithm involving whitespace
(and not just the double quote operator as you would expect) to form phrase
queries. So, queries like this with no whitespace form phrase queries
always.

The only workaround for reasonably good results consists of two steps:
1. at query time (only!) add a
org.apache.lucene.analysis.position.PositionFilter (from contrib/analyzers)
to your analyzer. don't do this at index-time, just query-time!
2. this will make all terms in the query "synonyms" of each other to bypass
this problem, but will screw up scoring, so you might want to also extend
QueryParser in a custom way:

(Continue reading)

Robert Muir | 1 Jul 2010 13:50
Picon
Gravatar

Re: Lucene and Chinese language

you can make your own analyzer, or do something like the below at
query-time.

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname" ,
new PositionHackAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_30)));

public class PositionHackAnalyzerWrapper extends Analyzer {
  Analyzer wrapped;

  public PositionHackAnalyzerWrapper(Analyzer wrapped) {
    this.wrapped = wrapped;
  }

   <at> Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = wrapped.tokenStream(fieldName, reader);
    return new PositionFilter(ts);
  }
}

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>

> How can I add this PositionFilter? I can't see anything in the API. I use
> lucene version 3.0.1, this is my query parser:
>
> QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname"
> , new StandardAnalyzer(Version.LUCENE_30));
>
> -----Ursprüngliche Nachricht-----
> Von: Robert Muir [mailto:rcmuir <at> gmail.com]
(Continue reading)

Picon
Favicon

AW: Lucene and Chinese language

Hm, that doesn't work for me. 

Maybe we misunderstood:

in my query there aren't any double quotes (just add them to show that this is my query).

I search in a field for a special value (substring) like 在电力虎

My query looks like this:

+anotherfieldname:description +myfieldname:*在电力虎*

We always add the multiple wildcard character (*).
Query String is 在电力虎

With the first way

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname"
 , new StandardAnalyzer(Version.LUCENE_30));

we got no results.

If we do not add the * it works with StandardAnalyzer for Chinese, query is:
+anotherfieldname:description +myfieldname:"在 电 力 虎"

As you can see, the query parser automatically added double quotes and blanks. But this does not work for our
English or German queries.

If I use the PositionHackAnalyzerWrapper and the case with * I got no results, query is:
+anotherfieldname:description +myfieldname:*在电力虎*
(Continue reading)

Robert Muir | 1 Jul 2010 14:35
Picon
Gravatar

Re: Lucene and Chinese language

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>

> As you can see, the query parser automatically added double quotes and
> blanks. But this does not work for our English or German queries.
>
> If I use the PositionHackAnalyzerWrapper and the case with * I got no
> results, query is:
> +anotherfieldname:description +myfieldname:*在电力虎*
>
> If I remove the * the query is:
> + anotherfieldname: description
> +(myfieldname:在myfieldname:电myfieldname:力myfieldname:虎)
>
> and I got results but not for German or English queries.
>

Weird?
>
>
its working correctly, your chinese wildcard query doesnt make sense, as you
havent indexed the text in a way to do queries like that (you have indexed
individual chars).
in practice this is where you would do a chinese phrase query of "在电力虎"
(with quotes) instead of *... but if you use the positionfilterhack, you
cant do phrase queries.

--

-- 
Robert Muir
rcmuir <at> gmail.com
(Continue reading)

Picon
Favicon

AW: Lucene and Chinese language

Ok, understand!

So it is better to use another analyzer in the chinese case at index-time or do you suggest to use another
"QueryParser" at query-time?

-----Ursprüngliche Nachricht-----
Von: Robert Muir [mailto:rcmuir <at> gmail.com] 
Gesendet: Donnerstag, 1. Juli 2010 14:35
An: java-user <at> lucene.apache.org
Betreff: Re: Lucene and Chinese language

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff <at> encoway.de>

> As you can see, the query parser automatically added double quotes and
> blanks. But this does not work for our English or German queries.
>
> If I use the PositionHackAnalyzerWrapper and the case with * I got no
> results, query is:
> +anotherfieldname:description +myfieldname:*在电力虎*
>
> If I remove the * the query is:
> + anotherfieldname: description
> +(myfieldname:在myfieldname:电myfieldname:力myfieldname:虎)
>
> and I got results but not for German or English queries.
>

Weird?
>
>
(Continue reading)


Gmane