Richard R. Liu | 26 Apr 2010 14:22
Picon

German stemmer: hemmung -> hemmung, enzymhemmung -> enzymhemm

Having read the German stemmer rules, I understand how "hemmung" is left as is,
but "enzymhemmung" becomes "enzymhemm".  However, this anomaly causes problems
in a procedure that I am adapting from English.  The procedure should conflate
"hemmung von enzymen", "enzymhemmung", "enzymhemmend" by applying these steps to
each: 
1.  Decompound
2.  Stem each morpheme
3.  Discard stopwords
4.  Sort the morphemes
Thus:
*   "hemmung von enzymen" -> "hemm von enzym" -> "enzym hemm"
*   "enzymhemmung" -> "enzym hemmung" -> "enzym hemm"
*   "enzymhemmend" -> "enzym hemmend" -> "enzym hemm"
Unfortunately, in the last two cases, the Snowball stemmer leaves "hemmung" and
"hemmend" unchanged.  Stemming before decompounding is not an option, because
the decompounder is unsupervised and knowledge-free, i.e., it derives a list of
words that occur in other words from the corpus itself, in which "hemm" does not
occur as a word.

In effect, "hemmend" and "hemmung" are left unchanged because "end" and "ung" do
not occur in R2.  By creating a compound word from "hemmung" R2 is extended
left, and "end" and "ung" now occur within R2, so they are deleted.  Thus, the
Porter stemmer for German is liable to produce different stems for such words,
depending on whether the word occurs alone or in a compound.

Is the consenus within the community nevertheless that the German stemmer is
correct?  Has any work been done to develop a stemmer that solves the problem
described above?  In "A Fast and Simple Stemming Algorithm for German Words", J.
Caumanns, Frei Universität Berlin, 1998, an alternative is described that could
deal with the problem.  Does anybody know about more recent work done in this
(Continue reading)

Martin Porter | 27 Apr 2010 16:27
Picon

Re: German stemmer: hemmung -> hemmung, enzymhemmung -> enzymhemm


Richard,

The stemming anomalies you note don't matter so much for general IR work,
but do for the work that you are doing. It seems to me that you need a
German word splitter, so that enzymhemmung is split to enzym+hemmung etc.
Lemmatization systems do this. You can find sources by typing "german
lemmatization" into Google. In the past I've been involved with two
companies that do this work, Inxight and Teragram. Since working with them,
both have been taken over by larger companies. Their work was proprietory,
with a licence fee arrangement for their use. 

Are there open source solutions here? I do not know. If you, or anyone else,
can share better information than I have it would be useful,

Martin
F Wolff | 27 Apr 2010 20:50
Picon

Re: German stemmer: hemmung -> hemmung, enzymhemmung -> enzymhemm

Op Di, 2010-04-27 om 16:27 +0200 skryf Martin Porter:
> Richard,
> 
> The stemming anomalies you note don't matter so much for general IR work,
> but do for the work that you are doing. It seems to me that you need a
> German word splitter, so that enzymhemmung is split to enzym+hemmung etc.
> Lemmatization systems do this. You can find sources by typing "german
> lemmatization" into Google. In the past I've been involved with two
> companies that do this work, Inxight and Teragram. Since working with them,
> both have been taken over by larger companies. Their work was proprietory,
> with a licence fee arrangement for their use. 
> 
> Are there open source solutions here? I do not know. If you, or anyone else,
> can share better information than I have it would be useful,
> 
> Martin

I am partly assuming that it will work, but the Hunspell spell checker
used in OpenOffice.org, Mozilla products and elsewhere can do
morphological analysis, which should include supports for compounding
for German. It might provide a good start.

Keep well
Friedel

--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/how-should-we-do-high-contrast-application-icons
Richard R. Liu | 28 Apr 2010 19:55
Picon

Re: Snowball-discuss Digest, Vol 63, Issue 2

Friedel,

Thanks for the suggestion.  I've spot-checked some words in the online version of Hunspell
(http://www.j3e.de/cgi-bin/spellchecker).  It fails on the kind of specialized vocabulary with which
I am dealing.  It seems very dependent not only on its dictionary, but also upon some rules which don't seem
to apply to my documents.  For instance, "Enzym" and "Hemmung" are correct, but not "Enzymhemmung".  Yet
the author of the paper that contains "Enzymhemmung" and those who read such papers understand what is meant.

Friedel, Martin,

I have created a decompounder that builds a vocabulary based on the actual text.  It determines how often a
word occurs in the corpus, both alone and as a substring in other words.  It then assigns such words -- I call
them "infixes" -- two values:  the frequency of occurence and "stickiness," the ratio of the number of
times it occurs as a substring to its frequency.  When decompounding a word, I build all possible
decompositions, then use the log-average of the frequencies, the log-average of the stickinesses, and
the ratio of the two, to select the best one.  The search for decompositions is iterative and proceeds from
left to right, seeking infixes that match the beginning of the word, removing them, etc.  Whenever no such
infix can be found, I move to the end of the (rest of the) word and look for infixes that end at the end.  This
process ends when no such infixes can be found.  At that point, I might have some characters left.  I regard
them as a yet-unseen word.  Such words are assigned a frequency of 0.5 (rarer than the rarest) and a
stickiness of 30 (the mean in my list of infixes, but 99% of the infixes have lower stickiness, this 30 means
so sticky that it almost never occurs alone).  This algorithm produces good results and is
language-agnostic.  Thus, it could be applied to English texts in the field of chemistry, where such
compounds are often "invented."

With this I'm back to stemming.  I'm investigating an algorithm proposed by Jörg Caumanns, Freie
Universität Berlin, in 1998.  I am comparing its performance to the Porter stemmer for German.  I have run
it on the German Words data.  In order to measure how close the results are, I imagine that I have two
different clusterings, one determined by the Porter stemmer and one determined by mine.  In each case, a
cluster consists of all the words that have the same stem.  Then I compute the corrected Rand index of the two
(Continue reading)

Xiao, Juan | 22 Apr 2010 19:44

errors during compilation

Hi there,

 

I’m trying to compile the source code from a windows command window. It gives me 3 errors:

 

 

 

 

Any idea?

 

Thanks,

-Jenny

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

Gmane