1 Feb 2007 20:49
Re: Japanese stemmer?
Micah Bly <micah.j.bly <at> medtronic.com>
2007-02-01 19:49:10 GMT
2007-02-01 19:49:10 GMT
The word-splitting is a real problem if you need to do it. In my case though, I'm working with a set of pre-broken words (a terminology list), which I need to compare to a blob of text (usually) a sentence. My goal is solely to determine if the words in the first list are present in the text blob. So I basically have a free pass to word-splitting.
When I do this with English, I have to stem both sets of strings. But with Japanese, I think it will be enough to stem the terminology list words, since we ignore whitespace in Japanese anyway.
For example, if we start with the word:
xxxx-sareta (it was xxxx-ed)
We want to get to down the xxxx word
Other things we might run into:
xxxx-shita
xxxx-saseta ([i] forced [him/it/her] to xxxx.)
xxxx-sasemashita (same as above, but polite verb ending)
xxxx-saserareta (I was forced to xxxx)
xxxx-saseraremashita (same as above, but polite verb ending)
xxxx-suru (I will xxxx)
xxxx-site-iru (i am xxxx'ing)
xxxx-site-ita (I was xxxx'ing)
xxxx-sasete-iru (I am forcing [him] to xxxx)
xxxx-saserarete-ita (I was being forced to xxxx)
etc etc.
plus
xxxx-da, xxxx-desu: [it is a xxxx]
Is it enough to simply put together a big list of possible verb endings, and remove them all? Is there a smart way to do something like that?
Micah Bly
On Jan 29, 2007, at 4:00 AM, Martin Porter wrote:
At least in principle, I'm interested myself in collaborating to make aJapanese stemmer. However I must add a few caveats. I am currentlyrather busy with other work, and I tried a little while ago to get intoArabic sufficiently to try coding up a stemmer, and eventually abandonedit. I found the language to difficult. So I'm not sure how well I'd geton with Japanese.And what about the problem of word-splitting?
_______________________________________________ Snowball-discuss mailing list Snowball-discuss <at> lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss
RSS Feed