Christopher Condit | 1 May 2010 01:05
Picon

RE: Modify TermQueries or Tokens

> 2) if I have to accept whole input string with all logic (AND, OR, ..) inside,
>    I think it is easier to change TermQuery afterwards than parsing the string,
>    since final result from query parser should be BooleanQuery (in your
> example),
>    then we iterate through each BooleanClause, if the clause is still
> BooleanQuery,
>    recursively go deep, if TermQuery, we may try to convert to
> WildCardQuery?

protected static void wildcardQuery(Query query) {
		if (query instanceof BooleanQuery) {
			BooleanQuery bQuery = (BooleanQuery)query;
			for (BooleanClause clause: bQuery.getClauses()) {
				if (clause.getQuery() instanceof TermQuery) {
					Term term = ((TermQuery)clause.getQuery()).getTerm();
					clause.setQuery(new WildcardQuery(new Term(term.field(), "*" + term.text() + "*")));
				}
				else
					wildcardQuery(clause.getQuery());
			}			
		} 
	}

Does the trick (as long as the original query is a BooleanQuery). Thanks for the suggestion!
-Chris
Zhang, Lisheng | 1 May 2010 05:15
Favicon

RE: Modify TermQueries or Tokens

Hi,

It looks good to me, but I did not test, when testing,
we may print out both

initialQuery.toString() // query produced by QueryParser
finalQuery.toString()   // query after your new function

as comparison, besides testing the query result.

Best regards, Lisheng

-----Original Message-----
From: Christopher Condit [mailto:condit <at> sdsc.edu]
Sent: Friday, April 30, 2010 4:06 PM
To: java-user <at> lucene.apache.org
Subject: RE: Modify TermQueries or Tokens

> 2) if I have to accept whole input string with all logic (AND, OR, ..) inside,
>    I think it is easier to change TermQuery afterwards than parsing the string,
>    since final result from query parser should be BooleanQuery (in your
> example),
>    then we iterate through each BooleanClause, if the clause is still
> BooleanQuery,
>    recursively go deep, if TermQuery, we may try to convert to
> WildCardQuery?

protected static void wildcardQuery(Query query) {
		if (query instanceof BooleanQuery) {
			BooleanQuery bQuery = (BooleanQuery)query;
(Continue reading)

Koji Sekiguchi | 1 May 2010 05:18
Picon

FieldCache memory estimation - term values are interned?

Hello,

Are Strings that are got via FieldCache.DEFAULT.getStrings( reader,
field ) interned?

Since I have a requirement for having FieldCaches of some
fields in 250M docs index, I'd like to estimate memory
consumed by FieldCache.

By looking at FieldCacheImpl source code, it seems that
field names are interned, but values are not?

Than you,

Koji

--

-- 
http://www.rondhuit.com/en/
Yonik Seeley | 1 May 2010 16:27

Re: FieldCache memory estimation - term values are interned?

2010/4/30 Koji Sekiguchi <koji <at> r.email.ne.jp>:
> Are Strings that are got via FieldCache.DEFAULT.getStrings( reader,
> field ) interned?
>
> Since I have a requirement for having FieldCaches of some
> fields in 250M docs index, I'd like to estimate memory
> consumed by FieldCache.
>
> By looking at FieldCacheImpl source code, it seems that
> field names are interned, but values are not?

Values are not interned, but in a single field cache entry (String[])
the same String object is used for all docs with that same value.

But... I think StringIndex is more commonly used in both Lucene and
Solr than String[] (sorting, faceting, etc) so double check that it's
not StringIndex you should be looking at.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague
Christopher Condit | 1 May 2010 19:42
Picon

RE: Modify TermQueries or Tokens

> It looks good to me, but I did not test, when testing, we may print out both
> 
> initialQuery.toString() // query produced by QueryParser
> finalQuery.toString()   // query after your new function
> 
> as comparison, besides testing the query result.

Yes - it's exactly what I wanted:
Test Input: "foo OR baz AND (\"bar trumpet test\" OR (cheese AND nacho~.8))"
Initialquery.toString(): (test1:foo test2:foo) +(test1:baz test2:baz) +((test1:"bar trumpet test"
test2:"bar trumpet test") (+(test1:cheese test2:cheese) +(test1:nacho~0.5 test2:nacho~0.5)
(test1:8 test2:8)))
Finaqueryl.toString(): (test1:*foo* test2:*foo*) +(test1:*baz* test2:*baz*) +((test1:"bar
trumpet test" test2:"bar trumpet test") (+(test1:*cheese* test2:*cheese*) +(test1:nacho~0.5
test2:nacho~0.5) (test1:*8* test2:*8*)))

Thanks again,
-Chris
Koji Sekiguchi | 2 May 2010 02:23
Picon

Re: FieldCache memory estimation - term values are interned?

Yonik Seeley wrote:
> 2010/4/30 Koji Sekiguchi <koji <at> r.email.ne.jp>:
>   
>> Are Strings that are got via FieldCache.DEFAULT.getStrings( reader,
>> field ) interned?
>>
>> Since I have a requirement for having FieldCaches of some
>> fields in 250M docs index, I'd like to estimate memory
>> consumed by FieldCache.
>>
>> By looking at FieldCacheImpl source code, it seems that
>> field names are interned, but values are not?
>>     
>
> Values are not interned, but in a single field cache entry (String[])
> the same String object is used for all docs with that same value.
>
>   
Yeah, you are right. Because I could see the arbitrary two Strings that have
same value were identical (== is true) in String[] gotten by 
getStrings() but
I coudn't see explicit intern() for them, I asked. Can you elaborate who 
done it?

> But... I think StringIndex is more commonly used in both Lucene and
> Solr than String[] (sorting, faceting, etc) so double check that it's
> not StringIndex you should be looking at.
>
>   
Yes, I think StringIndex is better than String[] for my requirement in terms
(Continue reading)

Yonik Seeley | 2 May 2010 02:42

Re: FieldCache memory estimation - term values are interned?

On Sat, May 1, 2010 at 8:23 PM, Koji Sekiguchi <koji <at> r.email.ne.jp> wrote:
> Yonik Seeley wrote:
>> Values are not interned, but in a single field cache entry (String[])
>> the same String object is used for all docs with that same value.
>
> Yeah, you are right. Because I could see the arbitrary two Strings that have
> same value were identical (== is true) in String[] gotten by getStrings()
> but
> I coudn't see explicit intern() for them, I asked. Can you elaborate who
> done it?

FieldCacheImpl:675 on trunk.

For every term, a string value in constructed... then TermDocs (now
DocsEnum in trunk) steps over every doc with that value and sets the
same value.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

>> But... I think StringIndex is more commonly used in both Lucene and
>> Solr than String[] (sorting, faceting, etc) so double check that it's
>> not StringIndex you should be looking at.
>>
>>
>
> Yes, I think StringIndex is better than String[] for my requirement in terms
> of memory consumption as well (an extreme case, male/female field etc.).
>
(Continue reading)

Koji Sekiguchi | 2 May 2010 04:03
Picon

Re: FieldCache memory estimation - term values are interned?

Yonik Seeley wrote:
> On Sat, May 1, 2010 at 8:23 PM, Koji Sekiguchi <koji <at> r.email.ne.jp> wrote:
>   
>> Yonik Seeley wrote:
>>     
>>> Values are not interned, but in a single field cache entry (String[])
>>> the same String object is used for all docs with that same value.
>>>       
>> Yeah, you are right. Because I could see the arbitrary two Strings that have
>> same value were identical (== is true) in String[] gotten by getStrings()
>> but
>> I coudn't see explicit intern() for them, I asked. Can you elaborate who
>> done it?
>>     
>
> FieldCacheImpl:675 on trunk.
>
> For every term, a string value in constructed... then TermDocs (now
> DocsEnum in trunk) steps over every doc with that value and sets the
> same value.
>
>   
Uh, I looked that code but I didn't wise up.
Thank you very much for taking your time!

Koji

--

-- 
http://www.rondhuit.com/en/
(Continue reading)

Avi Rosenschein | 2 May 2010 14:50
Picon

Re: Relevancy Practices

On 4/30/10, Grant Ingersoll <gsingers <at> apache.org> wrote:
>
> On Apr 30, 2010, at 8:00 AM, Avi Rosenschein wrote:
>> Also, tuning the algorithms to the users can be very important. For
>> instance, we have found that in a basic search functionality, the default
>> query parser operator OR works very well. But on a page for advanced
>> users,
>> who want to very precisely tune their search results, a default of AND
>> works
>> better.
>
> Avi,
>
> Great example.  Can you elaborate on how you arrived at this conclusion?
> What things did you do to determine it was a problem?
>
> -Grant

Hi Grant,

Sure. On http://wiki.answers.com/, we use search in a variety of
places and ways.

In the basic search box (what you get if you look stuff up in the main
Ask box on the home page), we generally want the relevancy matching to
be pretty fuzzy. For example, if the user looked up "Where can you see
photos of the Aurora Borealis effect?" I would still want to show them
"Where can you see photos of the Aurora Borealis?" as a match.

However, the advanced search page,
(Continue reading)

Ahmet Arslan | 2 May 2010 21:48
Picon
Favicon
Gravatar

Re: Lucene QueryParser and Analyzer

> I think I've figured out what the
> problem is. Given the inputs,
> 
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
> 
> Input1 gets parsed as
> Query1: (text: "C1C2  C3C4  C5C6  C7 
> C8C9C10")
> whereas Input2 gets parsed as
> Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text:
> "C7") (text: 
> "C8C9C10")
> 
> That is, Lucene constructs the query and then pass the
> query text 
> through the analyzer. Is there any way to
> force QueryParser to pass the input string through the
> analyzer before 
> creating the query? That is, force Lucene
> to create Query2 for both Input1 and Input2.

QueryParser tokenizes input query string using white-spaces. You can alter this behavior to ways: 

1-) escaping whitespaces with backslash, e.g. C1C2\ C3C4\ C5C6\ C7\ C8C9C10
2-) using quotes, e.g. "C1C2  C3C4  C5C6  C7  C8C9C10"

      

Gmane