Chris Hostetter | 1 Jan 2010 04:26

Re: numFound is changing when query across distributed-seach with the same query.


: You probably have duplicates (docs on different shards with the same id).
: Deeper paging will detect more of them.
: It does raise the question of if we should be changing numFound, or
: indicating a separate duplicate count.  Duplicates aren't eliminated

random thought (from someone whose never really considered distributed 
searching in much depth) ...

why do we bother detecthing/removing the duplicates?

strictly speaking docs with duplicate IDs on multiple shards is a "garbage 
in" situation, i can understanding Solr taking a little extra effort to 
not fail hard if this situation is encountered, but why update the 
numFound at all, or remove the duplicates from the list? ... why not leave 
them in as is?  (then numFound would never change)

-Hoss

Chris Hostetter | 1 Jan 2010 04:30

RE: Can solr web site have multiple versions of online API doc?


: This doesn't solve my problem.  I can't write my javadoc comments
: referencing to Solr API doc located in my local hard drive. 

Why not?

: are available online at well-defined URLs.  I'd like to have
: Solr API docs available in the similar manner.

patches (to the website) are welcome! ...

http://svn.apache.org/repos/asf/lucene/solr/trunk/src/site/
http://svn.apache.org/repos/asf/lucene/solr/trunk/site/
http://wiki.apache.org/solr/Website_Update_HOWTO

-Hoss

Chris Hostetter | 1 Jan 2010 04:36

RE: Search both diacritics and non-diacritics


: I have done follow it, but if I query with diacritic it respose only
: non-diacritic. But I want to query without diacritic anh then solr will be
: response both of diacritic and without diacritic :( 

What is "it" that you have done? ... can you show us your config?

The diatric folding issue is essentially the same as lowercasing ... you 
have a "strict" fieldtype where you don't lowercase/fold at index or query 
time, and then you have a "flattened" field where you lowercase/fold at 
indextime but not at query time ... then you query both using dismax 
(possibly with tie=0.0)

if the user searches w/diatrics, they will only get a match against the 
strict field for documents that also had those exact diatrics (they won't 
get any match on the flattened field).  if they query w/o diatrics they 
will get a match on the strict field for nay doc that didn't have 
diatrics, and they will also get matches on the flattened field for docs 
that did have diatrics (but still match because they were flattened)

-Hoss

Chris Hostetter | 1 Jan 2010 04:39

Re: serialize SolrInputDocument to java.io.File and back again?


: I want to store a SolrInputDocument to the filesystem until it can be sent
: to the solr server via the solrj client.

out of curiosity: why? ... what is the anticipated delay that leads you to 
the expectation that there will be an "until it can be sent to the solr 
server" situation?

-Hoss

Chris Hostetter | 1 Jan 2010 04:47

Re: how to do a Parent/Child Mapping using entities


: You could easily write your own query parser (QParserPlugin, in Solr's
: terminology) that internally translates queries like
: 
: 	 q = res_url:url AND res_rank:rank
: 
: into
: 	q = res_ranked_url:"rank url"
: 
: thus hiding the res_ranked_url field from the user/client.
: 
: I'm not sure, but maybe it's possible to utilize the order of values within
: the multi-valued field res_url directly in the newly created parser. This

It is possible to use SpanMaskingQuery ... it lets you build a 
SpanNearQuery that requires a match in one field to be "near" a match in 
another field (ie: at the same position, or within some amount of slop)

so then you could find all docs where "url:A" and "rank:2" both occur at 
the same position (in a multi-valued field) but SpanQueries don't play 
nicely with range queries, so you wouldn't be able to find docs where 
url:A and rank:[* TO 5] at the same position.

-Hoss

Chris Hostetter | 1 Jan 2010 04:55

Re: Using IDF to find Collactions and SIPs . . ?


: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write

Assuming I understand what you want, I would look into using the 
SingleFilter to build up Tokens consisting of N->M tokens, then you could 
just facet on that field to see the really common "phrases" or use the 
TermsComponent to get them as well...

: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging. 

...as i mentioned, you can use the TermsComponent to get terms and their 
document count ... it has a terms.maxcount param so you can use that to 
limit the output to only terms that appear in no more then X documents.

That said: These are possible ways of solving these types of problems 
using Solr, which can be handy if you are building a Solr for other things 
in general -- but if you are just trying to do a one-time analysis of a 
large corpus of data (or even a many-time analysis of a corpus that 
changes very frequently) w/o needing any of Solr's other features then you 
may find that you can accomplish this type of task much simpler (and 
probably faster) with some simple map/reduce jobs in Hadoop.

-Hoss

Chris Hostetter | 1 Jan 2010 05:04

RE:Delete, commit, optimize doesn't reduce index file size


: Is there another way to make this happen without making further changes 
: to the index? Maybe a bounce of the servlet server?

Lucene only actively tries to delete these files when it needs to write to 
the directory for some reason, so unfortunately no.

I've opened a Jira issue with some misc thoughts on how we might improve 
this...

https://issues.apache.org/jira/browse/SOLR-1691

-Hoss

Chris Hostetter | 1 Jan 2010 05:06

Re: Automating implementation of SolrInfoMBean


: Is there a standard way to automatically update the values returned by
: the methods in SolrInfoMBean? Particularly those concerning revision
: control etc. I'm assuming folks don't just update that by hand every
: commit...

I'm not sure that i understand your question, but it really depends on 
your revision control system...

http://svnbook.red-bean.com/en/1.4/svn.advanced.props.special.keywords.html

-Hoss

johnson hong | 1 Jan 2010 09:31

Re: numFound is changing when query across distributed-seach with the same query.


Yonik Seeley-2 wrote:
> 
> On Thu, Dec 31, 2009 at 2:29 AM, johnson hong
> <hong.jincheng <at> goodhope.net> wrote:
>>
>> Hi,all.
>>    I found a problem on distributed-seach.
>>    when i use "?q=keyword&start=0&rows=20" to query across
>> distributed-seach,it will return numFound="181" ,then I
>>    change the start param from 0 to 100,it will return numFound="131".
> 
> You probably have duplicates (docs on different shards with the same id).
> Deeper paging will detect more of them.
> It does raise the question of if we should be changing numFound, or
> indicating a separate duplicate count.  Duplicates aren't eliminated
> from things like faceting or statistics, so it might be nice to have a
> number that was consistent with those numbers.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
Thank you Yonik,Happy New Year all.
I will check the index soon after the festival.
--

-- 
View this message in context: http://old.nabble.com/numFound-is-changing-when-query-across-distributed-seach-with-the-same-query.-tp26976128p26984236.html
Sent from the Solr - User mailing list archive at Nabble.com.

(Continue reading)

Yonik Seeley | 1 Jan 2010 16:10

Re: numFound is changing when query across distributed-seach with the same query.

On Thu, Dec 31, 2009 at 10:26 PM, Chris Hostetter
<hossman_lucene <at> fucit.org> wrote:
> why do we bother detecthing/removing the duplicates?
>
> strictly speaking docs with duplicate IDs on multiple shards is a "garbage
> in" situation, i can understanding Solr taking a little extra effort to
> not fail hard if this situation is encountered, but why update the
> numFound at all, or remove the duplicates from the list? ... why not leave
> them in as is?  (then numFound would never change)

Distrib search keys some things off of the unique id, so when we
encountered duplicates in the past it failed hard.  IIRC only keeping
one doc with the same id was actually the easiest way to not fail
hard.

-Yonik
http://www.lucidimagination.com


Gmane