Dilemma - Very Frequent Synonym updates for Huge Index
Ravi Kiran <ravi.bhaskar <at> gmail.com>
2010-07-01 04:57:24 GMT
Hello,
Hoping some solr guru can help me out here. We are a news
organization trying to migrate 10 million documents from FAST to solr. The
plan is to have our Editorial team add/modify synonyms multiple times during
a day as they deem appropriate. Hence we plan on using query time synonyms
as we cannot reindex every time they modify the synonyms file(for the
entities extracted by OpenNLP like locations/organizations/person names from
article body) . Since the synonyms are for names Iam concerned that the
multi-phrase issue crops up with the query-time synonyms. for example
synonyms could be as follows
The Washington Post Co., The Washington Post, Washington Post, The Post,
TWP, WAPO
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
Clinton,Sen. Clinton
William J. Clinton,William Jefferson Clinton,President Clinton,President
Bill Clinton
Virginia, Va., VA
D.C,Washington D.C, District of Columbia
I have the following fieldType in schema.xml for the keywords/entites...What
issues should I be aware off ? And is there a better way to achieve it
without having to reindex a million docs on each synonym change. NOTE that I
use tokenizerFactory="solr.KeywordTokenizerFactory" for the
SynonymFilterFactory to keep the words intact without splitting
(Continue reading)