1 May 2012 14:06
Re: Multi language full text: Which stemming language should be used?
Richard R. Liu <richard.liu <at> pueo-owl.ch>
2012-05-01 12:06:54 GMT
2012-05-01 12:06:54 GMT
<at> Manoj M, By "multi language" which of these do you mean: A. Each document in the corpus is in a single language, but not all documents are in the same language. B. Passages (e.g., paragraphs) in a single document may be in different languages. At any rate, since the whole point of stemming is conflation, i.e., ignoring different forms of the same word, stemming depends on the language, since it dictates how different forms of a word are built. So, assuming you have some way of determining the language of a word and you use the correct stemmer to stem it, you will have to store not only the stem, but also its language, since, just as two languages may have words that are spelled the same, two stems in two different languages may also be spelled the same. With regard to determining the language, if you are dealing with A (above), you might have some meta data with this information; however, you will have to judge how dependable it is. If you are dealing with B, there are some heuristics that operate on the sentence level. You could apply stop word lists in all the possible languages to the sentence, then select the language that produces the most (unique) stop words. Or you could apply stemmers and select the language that produces the most stemming. On the query side, you also have the problem of determining the language -- presumably just one! -- of the query terms. In this case, if you resort to one of the heuristics described above, it would probably have to be the one based on stemmers, since people rarely use stop words in queries. Regards, Richard On May 1, 2012, at 13:00 , snowball-discuss-request <at> lists.tartarus.org wrote: > Send Snowball-discuss mailing list submissions to > snowball-discuss <at> lists.tartarus.org >(Continue reading)
RSS Feed