Implementing custom analyzer for multi-language stemming
Eugene <beyondcompute <at> gmail.com>
2014-07-30 17:47:32 GMT
Hello, fellow Solr and Lucene users and developers!
In our project we receive text from users in different languages. We
detect language automatically and use Google Translate APIs a lot (so
having arbitrary number of languages in our system doesn't concern us).
However we need to be able to search using stemming. Having nearly hundred
of fields (several fields for each language with language-specific
stemmers) listed in our search query is not an option. So we need a way to
have a single index which has stemmed tokens for different languages. I
have two questions:
1. Are there already (third-party) custom multi-language stemming
analyzers? (I doubt that no one else ran into this issue)
2. If I'm going to implement such analyzer myself, could you please
suggest a better way to 'pass' detected language value into such analyzer?
Detecting language in analyzer itself is not an option, because: a) we
already detect it in other place b) we do it based on combined values of
many fields ('name', 'topic', 'description', etc.), while current field can
be to short for reliable detection c) sometimes we just want to specify
language explicitly. The obvious hack would be to prepend ISO 639-1 code to
field value. But I'd like to believe that Solr allows for cleaner solution.
I could think about either: a) custom query parameter (but I guess, it will
require modifying request handlers, etc. which is highly undesirable) b)
getting value from other field (we obviously have 'language' field and we
do not have mixed-language records). If it is possible, could you please
describe the mechanism for doing this or point to relevant code examples?
Thank you very much and have a good day!