Zheng Lin Edwin Yeo | 13 Oct 11:04 2015

Highlighting content field problem when using JiebaTokenizerFactory


I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
Solr. It works fine with the segmentation when I'm using
the Analysis function on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not
highlighting in the correct place. For example, when I search of 自然环境与企业本身,
it highlight 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的

Even when I search for English character like  responsibility, it highlight
 <em> *responsibilit<em>*y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.
Does anyone knows what could be causing the issue?

I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.

Zheng Lin Edwin Yeo | 13 Oct 10:35 2015

Indexing Solr in production


What is the best practice to do indexing in Solr for production system.I'm
using Solr 5.3.0.

I understand that post.jar does not have things like robustness checks and
retires, which is important in production, as sometimes certain records
might failed during the indexing, and we need to re-try the indexing for
those records that fails.

Normally, do we need to write a new custom handler in order to achieve all
Want to find out what most people did before I decide on a method and
proceed on to the next step.

Thank you.

vit | 12 Oct 22:34 2015

EdgeNGramFilterFactory for phrases

I use Solr 4.2
I creted a field with the following analyzer :
			<tokenizer class="solr.WhitespaceTokenizerFactory"/> 
			<filter class="solr.LowerCaseFilterFactory"/> 
			<filter class="solr.KStemFilterFactory" />
			<filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
for both index and search.
Maybe KStem is an overkill but I do not think it is important here. 

On phrase search "Peak physical" it returns result:
"Peak Physical Therapy Physical Therapy Of Brooklyn"

For "Peak Physica"
it returns the same result, 

BUT for "Pea Physical"
it does not return anything, Why?

View this message in context: http://lucene.472066.n3.nabble.com/EdgeNGramFilterFactory-for-phrases-tp4234168.html
Sent from the Solr - User mailing list archive at Nabble.com.

Salman Ansari | 12 Oct 22:24 2015

AutoComplete Feature in Solr


I have been trying to get the autocomplete feature in Solr working with no
luck up to now. First I read that "suggest component" is the recommended
way as in the below article (and this is the exact functionality I am
looking for, which is to autocomplete multiple words)

Then I tried implementing suggest as described in the following articles in
this order
1) https://wiki.apache.org/solr/Suggester#SearchHandler_configuration
2) http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/  (I
implemented suggesting phrases)

With no luck, after implementing each article when I run my query as

I get
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>

 Although I have an entry for Barack Obama in my index. I am posting my
Solr configuration as well

(Continue reading)

Mark Fenbers | 12 Oct 21:37 2015

File-based Spelling


I'm attempting to use a file-based spell checker.  My sourceLocation is 
/usr/share/dict/linux.words, and my spellcheckIndexDir is set to 
./data/spFile.  BuildOnStartup is set to true, and I see nothing to 
suggest any sort of problem/error in solr.log.  However, in my 
./data/spFile/ directory, there are only two files: segments_2 with only 
71 bytes in it, and a zero-byte write.lock file.  For a source 
dictionary having 480,000 words in it, I was expecting a bit more 
substance in the ./data/spFile directory.  Something doesn't seem right 
with this.

Moreover, I ran a query on the word Fenbers, which isn't listed in the 
linux.words file, but there are several similar words.  The results I 
got back were odd, and suggestions included the following:
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

But I expected suggestions like fenders, embers, and fenberry, etc. I 
also ran a query on Mark (which IS listed in linux.words) and got back 
two suggestions in a similar format.  I played with configurables like 
changing the fieldType from text_en to string and the characterEncoding 
from UTF-8 to ASCII, etc., but nothing seemed to yield any different 
(Continue reading)

elisabeth benoit | 12 Oct 14:39 2015

catchall fields or multiple fields


We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
MOIS Martin (MORPHO | 12 Oct 10:01 2015

Replication and soft commits for NRT searches


I am running Solr 5.2.1 in a cluster with 6 nodes. My collections have been created with
replicationFactor=2, i.e. I have one replica for each shard. Beyond that I am using
autoCommit/maxDocs=10000 and autoSoftCommits/maxDocs=1 in order to achieve near realtime search behavior.

As far as I understand from section "Write Side Fault Tolerance" in the documentation
(https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance), I
cannot enforce that an update gets replicated to all replicas, but I can only get the achieved replication
factor by requesting the return value rf.

My question is now, what exactly does rf=2 mean? Does it only mean that the replica has written the update to
its transaction log? Or has the replica also performed the soft commit as configured with
autoSoftCommits/maxDocs=1? The answer is important for me, as if the update would only get written to the
transaction log, I could not search for it reliable, as the replica may not have added it to the searchable index.

My second question is, does rf=1 mean that the update was definitely not successful on the replica or could
it also represent a timeout of the replication request from the shard leader? If it could also represent a
timeout, then there would be a small chance that the replication was successfully despite of the timeout.

Is there a way to retrieve the replication factor for a specific document after the update in order to check
if replication was successful in the meantime?

Thanks in advance.

Best Regards,
Martin Mois
" This e-mail and any attached documents may contain confidential or proprietary information. If you are
not the intended recipient, you are notified that any dissemination, copying of this e-mail and any
(Continue reading)

Prasanna S. Dhakephalkar | 12 Oct 09:47 2015

How to formulate query


I am trying to make a solr search query to get result as under I am unable
to get do

I have a search term say "pit"

The result should have (in that order)

All docs that have "pit" as first WORD in search field  (pit\ *)+

All docs that have first WORD that starts with "pit"  (pit*\  *)+

All docs that have "pit" as WORD anywhere in search field  (except first)
(*\ pit\ *)+

All docs that have  a WORD starting with "pit" anywhere in search field
(except first) (*\ pit*\ *)+

All docs that have "pit" as string anywhere in the search field except cases
covered above (*pit*)

Example :

Pit the pat 

Pit digger

Pitch ball

(Continue reading)

Arnon Yogev | 12 Oct 09:33 2015

Spell Check and Privacy


Our system supports many users from different organizations and with 
different ACLs. 
We consider adding a spell check ("did you mean") functionality using 
DirectSolrSpellChecker. However, a privacy concern was raised, as this 
might lead to private information being revealed between users via the 
suggested terms. Using the FileBasedSpellChecker is another option, but 
naturally a static list of terms is not optimal.

Is there a best practice or a suggested method for these kind of cases?

Zheng Lin Edwin Yeo | 12 Oct 06:05 2015

Indexing logs when using post,jar


I am using Solr 5.3.0, and I would like to find out, is the logs for the
indexing using post.jar stored anywhere in Solr?

I would need to know which files has been successfully indexed and which
has not, so that I can re-run the indexing for those files which has not
been indexed successfully due to various reasons.

Thank you.

Peter Sturge | 11 Oct 16:51 2015

Grouping facets: Possible to get facet results for each Group?

Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,