26 Apr 2010 14:22
German stemmer: hemmung -> hemmung, enzymhemmung -> enzymhemm
Richard R. Liu <richard.liu <at> pueo-owl.ch>
2010-04-26 12:22:40 GMT
2010-04-26 12:22:40 GMT
Having read the German stemmer rules, I understand how "hemmung" is left as is, but "enzymhemmung" becomes "enzymhemm". However, this anomaly causes problems in a procedure that I am adapting from English. The procedure should conflate "hemmung von enzymen", "enzymhemmung", "enzymhemmend" by applying these steps to each: 1. Decompound 2. Stem each morpheme 3. Discard stopwords 4. Sort the morphemes Thus: * "hemmung von enzymen" -> "hemm von enzym" -> "enzym hemm" * "enzymhemmung" -> "enzym hemmung" -> "enzym hemm" * "enzymhemmend" -> "enzym hemmend" -> "enzym hemm" Unfortunately, in the last two cases, the Snowball stemmer leaves "hemmung" and "hemmend" unchanged. Stemming before decompounding is not an option, because the decompounder is unsupervised and knowledge-free, i.e., it derives a list of words that occur in other words from the corpus itself, in which "hemm" does not occur as a word. In effect, "hemmend" and "hemmung" are left unchanged because "end" and "ung" do not occur in R2. By creating a compound word from "hemmung" R2 is extended left, and "end" and "ung" now occur within R2, so they are deleted. Thus, the Porter stemmer for German is liable to produce different stems for such words, depending on whether the word occurs alone or in a compound. Is the consenus within the community nevertheless that the German stemmer is correct? Has any work been done to develop a stemmer that solves the problem described above? In "A Fast and Simple Stemming Algorithm for German Words", J. Caumanns, Frei Universität Berlin, 1998, an alternative is described that could deal with the problem. Does anybody know about more recent work done in this(Continue reading)
RSS Feed