Micah Bly | 1 Feb 2007 20:49
Favicon

Re: Japanese stemmer?

The word-splitting is a real problem if you need to do it.  In my case though, I'm working with a set of pre-broken words (a terminology list), which I need to compare to a blob of text (usually) a sentence. My goal is solely to determine if the words in the first list are present in the text blob.  So I basically have a free pass to word-splitting. 

When I do this with English, I have to stem both sets of strings. But with Japanese, I think it will be enough to stem the terminology list words, since we ignore whitespace in Japanese anyway. 

For example, if we start with the word:
xxxx-sareta (it was xxxx-ed)

We want to get to down the xxxx word

Other things we might run into:
xxxx-shita
xxxx-saseta ([i] forced [him/it/her] to xxxx.)
xxxx-sasemashita (same as above, but polite verb ending)
xxxx-saserareta (I was forced to xxxx)
xxxx-saseraremashita (same as above, but polite verb ending)
xxxx-suru (I will xxxx)
xxxx-site-iru (i am xxxx'ing)
xxxx-site-ita (I was xxxx'ing)
xxxx-sasete-iru (I am forcing [him] to xxxx)
xxxx-saserarete-ita (I was being forced to xxxx)
etc etc. 
plus
xxxx-da, xxxx-desu: [it is a xxxx]

Is it enough to simply put together a big list of possible verb endings, and remove them all? Is there a smart way to do something like that? 

Micah Bly

On Jan 29, 2007, at 4:00 AM, Martin Porter wrote:

At least in principle, I'm interested myself in collaborating to make a
Japanese stemmer. However I must add a few caveats. I am currently
rather busy with other work, and I tried a little while ago to get into
Arabic sufficiently to try coding up a stemmer, and eventually abandoned
it. I found the language to difficult. So I'm not sure how well I'd get
on with Japanese. 

And what about the problem of word-splitting? 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Olly Betts | 7 Feb 2007 08:20
Favicon
Gravatar

Re: Japanese stemmer?

On Thu, Feb 01, 2007 at 01:49:10PM -0600, Micah Bly wrote:
> For example, if we start with the word:
> xxxx-sareta (it was xxxx-ed)
> 
> We want to get to down the xxxx word
> 
> Other things we might run into:
> xxxx-shita
> xxxx-saseta ([i] forced [him/it/her] to xxxx.)
> xxxx-sasemashita (same as above, but polite verb ending)
> xxxx-saserareta (I was forced to xxxx)
> xxxx-saseraremashita (same as above, but polite verb ending)
> xxxx-suru (I will xxxx)
> xxxx-site-iru (i am xxxx'ing)
> xxxx-site-ita (I was xxxx'ing)
> xxxx-sasete-iru (I am forcing [him] to xxxx)
> xxxx-saserarete-ita (I was being forced to xxxx)
> etc etc.
> plus
> xxxx-da, xxxx-desu: [it is a xxxx]
> 
> Is it enough to simply put together a big list of possible verb  
> endings, and remove them all?

One problem with this approach is when there are words which look like a
verb form but aren't.  For example, in English "herring" looks like a
verb form but isn't, "comply" looks like an adverb but isn't, etc.

Another is that sometimes words have the same linguistic root but the
meanings are now sufficiently different that you don't want to conflate
them.  An English example is "probe" and "probable".

It's probably worth doing a search for existing academic papers on the
subject - there are a number of specialised search sites: citebase,
citeseer, google scholar, etc. - and see if someone has already analysed
the situation.

> Is there a smart way to do something like that?

In snowball, you could do something like:

    backwordmode (
        define remove as (
            [substring] among (
                'sareta' 'shita' 'saseta' 'sasemashita' 'saserareta' (delete)
            )
        )
    )

If you understand how the endings build up, you can probably remove the
compound ones piece-by-piece to avoid combinatorial explosion of the
number of endings in the list.

Cheers,
    Olly
Olly Betts | 11 Feb 2007 23:56
Favicon
Gravatar

More patches

I'm currently updating Xapian to use UTF-8 stemmers generated by the
latest version of snowball.  I've patched the snowball compiler to
generate the stemmers as C++ classes, and I'm embedding the patched
compiler in the Xapian build system, so Xapian users can easily drop
in new stemmers.

For ease of forward maintenance I'd like to merge changes back into
the official snowball tree where possible, so I've unpicked the changes
into patches with a single purpose each, the first batch of which are
described and linked to below.

I'm not expecting you'll want all to apply all the changes I've made,
but I'm offering them all for completeness.  It'll also serve as
documentation for what is different.

Let me know if you've any questions!

OK, here are the patches:

Fix a typo of a function name in a comment:

http://oligarchy.co.uk/xapian/patches/snowball-add_to_b-comment-correction.patch

This improves the shortcutting of backwards among - if there are fewer
characters available than the shortest string in the among, there's
no way it can match.  It also includes a cosmetic tweak (avoiding
generating "z->c - 0" in the output) which makes the generated source
a little more readable (of course the C compiler will optimise the "- 0"
away anyway):

http://oligarchy.co.uk/xapian/patches/snowball-min-length-shortcut-backwards-among.patch

ISO C doesn't require that pointers to different types have the same
representation, except that void* and pointers to character types must
have the same representation, as must pointers to qualified and
unqualified versions of compatible types.

So, for example, if sizeof(int) is 4, it's possible that int* pointers
could contain the same bit pattern as a void* pointer to the same
object, but shifted right by 2 bits.

Therefore it's not portable to just cast the type of a function pointer
if the argument types are pointers with potentially different
represenations as the arguments won't get converted.  I'm not sure if
any current platforms actually use different pointer representations, so
the portability improvement may be a somewhat theoretical one, but it's
easy enough to make the code standard conforming:

http://oligarchy.co.uk/xapian/patches/snowball-sort-function-casting.patch

This patch changes snowball to use the ISO C qsort() function instead
of the custom sort routine in sort.c.  My motivation is to keep
the number of source files down, but it seems very similar in
performance (cachegrind suggests sort.c is a few cycles faster on
x86-64 linux, but the difference is so small you couldn't measure it
by normal means) so maybe it's worth considering the change for the
mainstream snowball compiler too:

http://oligarchy.co.uk/xapian/patches/snowball-use-qsort.patch

This patch disables java support, which we don't use and so can save a
source file.  Really just included for completeness, though perhaps it
would be useful to apply with a "#define JAVA_SUPPORT" added to make it
easy for others to disable the java support if they don't need it:

http://oligarchy.co.uk/xapian/patches/snowball-disable-java-support.patch

Cheers,
    Olly
Richard Boulton | 12 Feb 2007 02:38

Re: More patches

Olly Betts wrote:
> OK, here are the patches:

> Fix a typo of a function name in a comment:
> 
> http://oligarchy.co.uk/xapian/patches/snowball-add_to_b-comment-correction.patch

Applied, thanks (in snowball revision 427).

> This improves the shortcutting of backwards among - if there are fewer
> characters available than the shortest string in the among, there's
> no way it can match.  It also includes a cosmetic tweak (avoiding
> generating "z->c - 0" in the output) which makes the generated source
> a little more readable (of course the C compiler will optimise the "- 0"
> away anyway):
> 
> http://oligarchy.co.uk/xapian/patches/snowball-min-length-shortcut-backwards-among.patch

This is the most interesting of the bunch, but I'm not awake enough to 
consider it in detail.  I'll take a closer look at this patch tomorrow.

> ISO C doesn't require that pointers to different types have the same
> representation, except that void* and pointers to character types must
> have the same representation, as must pointers to qualified and
> unqualified versions of compatible types.
> 
> So, for example, if sizeof(int) is 4, it's possible that int* pointers
> could contain the same bit pattern as a void* pointer to the same
> object, but shifted right by 2 bits.
> 
> Therefore it's not portable to just cast the type of a function pointer
> if the argument types are pointers with potentially different
> represenations as the arguments won't get converted.  I'm not sure if
> any current platforms actually use different pointer representations, so
> the portability improvement may be a somewhat theoretical one, but it's
> easy enough to make the code standard conforming:
> 
> http://oligarchy.co.uk/xapian/patches/snowball-sort-function-casting.patch

I've applied this (in snowball revision 428), but I also needed to 
change the signature of the sort function (to accept "const void *" 
pointers in the comparison function instead of "void *" pointers) to 
avoid a compiler warning.  Of course, you won't see this in Xapian 
because of the change to qsort.

> This patch changes snowball to use the ISO C qsort() function instead
> of the custom sort routine in sort.c.  My motivation is to keep
> the number of source files down, but it seems very similar in
> performance (cachegrind suggests sort.c is a few cycles faster on
> x86-64 linux, but the difference is so small you couldn't measure it
> by normal means) so maybe it's worth considering the change for the
> mainstream snowball compiler too:
> 
> http://oligarchy.co.uk/xapian/patches/snowball-use-qsort.patch

This seems a reasonable patch, but I'm very slightly concerned about 
whether qsort really is ubiquitous.  The man page indicates that it 
conforms to SVr4, 4.3BSD, C99, which is pretty good.  If it had C89 as 
well, I'd have no qualms.  Performance isn't particularly critical in 
the compiler, anyway, for present usage.

Reducing the amount of code needed to be maintained is the main benefit 
of this, and that's a worthy goal given the amount of time we have to 
maintain snowball.  I'm going to apply this patch for now - Martin, or 
anyone else, if you can think of any reason not to apply it, I can 
easily revert the change.

(Olly - applied in snowball revision 429 - I removed the prototype of 
"sort" entirely from snowball/compiler/header.h, rather than commenting 
it out, so I've also applied this change to Xapian to avoid confusion.)

> This patch disables java support, which we don't use and so can save a
> source file.  Really just included for completeness, though perhaps it
> would be useful to apply with a "#define JAVA_SUPPORT" added to make it
> easy for others to disable the java support if they don't need it:
> 
> http://oligarchy.co.uk/xapian/patches/snowball-disable-java-support.patch

In principle, I'm happy with the idea of having the Java support 
conditionally compiled.  I have a half completed patch to snowball lying 
around which adds a python output module, which I would also 
conditionally compile.  However, I think it would be better to "default 
to enabled", so that a straightforward compilation will enable all 
output modes, but certain backends could be explicitly turned off with a 
compiler option.  That way, we're unlikely to get support requests from 
people who've not managed to enable the backend they're trying to use.

I suggest the following patch instead:
http://www.tartarus.org/~richard/xapian-patches/snowball-disable-java-support2.patch
which I've applied to snowball.

Xapian's "languages/Makefile.mk" would then be modified to include 
"-DDISABLE_JAVA" on the compile line for the snowball compiler.

--

-- 
Richard
Marvin Humphrey | 12 Feb 2007 02:45

Re: More patches


On Feb 11, 2007, at 5:38 PM, Richard Boulton wrote:

> This seems a reasonable patch, but I'm very slightly concerned  
> about whether qsort really is ubiquitous.  The man page indicates  
> that it conforms to SVr4, 4.3BSD, C99, which is pretty good.  If it  
> had C89 as well, I'd have no qualms.

Rest easy.  :)

qsort() is C89: <http://www.schweikhardt.net/identifiers.html>

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Olly Betts | 12 Feb 2007 06:04
Favicon
Gravatar

Re: More patches

On Mon, Feb 12, 2007 at 01:38:26AM +0000, Richard Boulton wrote:
> Olly Betts wrote:
> >This improves the shortcutting of backwards among - if there are fewer
> >characters available than the shortest string in the among, there's
> >no way it can match.  It also includes a cosmetic tweak (avoiding
> >generating "z->c - 0" in the output) which makes the generated source
> >a little more readable (of course the C compiler will optimise the "- 0"
> >away anyway):
> >
> >http://oligarchy.co.uk/xapian/patches/snowball-min-length-shortcut-backwards-among.patch
> 
> This is the most interesting of the bunch, but I'm not awake enough to 
> consider it in detail.  I'll take a closer look at this patch tomorrow.

I didn't say before but all the stemmers which have test vocabularies
behave the same way with this patch applied.

> I've applied this (in snowball revision 428), but I also needed to 
> change the signature of the sort function (to accept "const void *" 
> pointers in the comparison function instead of "void *" pointers) to 
> avoid a compiler warning.  Of course, you won't see this in Xapian 
> because of the change to qsort.

Ah yes, sorry.

> >http://oligarchy.co.uk/xapian/patches/snowball-use-qsort.patch
> 
> This seems a reasonable patch, but I'm very slightly concerned about 
> whether qsort really is ubiquitous.  The man page indicates that it 
> conforms to SVr4, 4.3BSD, C99, which is pretty good.  If it had C89 as 
> well, I'd have no qualms.  Performance isn't particularly critical in 
> the compiler, anyway, for present usage.

The qsort man page I have says "ISO 9899", which is actually just "the
ISO C standard", not C99 (the "99" is coincidental) - look at the
"Revises:" lists here, and you'll see that there's a 1990 version as
well as the current 1999 revision:

http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29237&showrevision=y

The "89" in "C89" is the date of the ANSI standard (1989).  ISO approved
it in 1990.

Anyway, as Martin (not Porter) says, qsort was in the 1989/1990 standard
too.

> Reducing the amount of code needed to be maintained is the main benefit 
> of this, and that's a worthy goal given the amount of time we have to 
> maintain snowball.  I'm going to apply this patch for now - Martin, or 
> anyone else, if you can think of any reason not to apply it, I can 
> easily revert the change.

The merge sort is (I think) a stable sort, whereas qsort() may not be
(despite the "q" in the name, the standard doesn't require it to be
implemented as a quick sort, but does explicitly say that the order of
two elements which compare equal is unspecified).  But it's only used to
sort the strings in an "among", and the snowball manual says "The 'Si'
must all be different" (and adding a quick check to the comparison
function confirms this for the current algorithms).

> (Olly - applied in snowball revision 429 - I removed the prototype of 
> "sort" entirely from snowball/compiler/header.h, rather than commenting 
> it out, so I've also applied this change to Xapian to avoid confusion.)

Sure.

> I suggest the following patch instead:
> http://www.tartarus.org/~richard/xapian-patches/snowball-disable-java-support2.patch
> which I've applied to snowball.
> 
> Xapian's "languages/Makefile.mk" would then be modified to include 
> "-DDISABLE_JAVA" on the compile line for the snowball compiler.

Works for me.

Thanks for handling these so promptly!

Cheers,
    Olly
Olly Betts | 12 Feb 2007 09:30
Favicon
Gravatar

Re: More patches

OK, here are some more:

This changes the generated C code to avoid shadowed local variables.  In
particular it generates int c1, c2, c3 instead of int c and the same for
m.  Also, the existing use of m3 is renamed to mlimit (which is clearer
anyway).  The motivations are that this makes the generated code easier
to understand to the human reader, and allows it to be compiled without
warnings with the GCC option -Wshadow, which avoids some special casing
in Xapian's build system:

http://oligarchy.co.uk/xapian/patches/snowball-no-shadowed-variables.patch

This patch eliminates a shadowed variable in the snowball compiler
itself, and 2 cases in the runtime:

http://oligarchy.co.uk/xapian/patches/snowball-rename-shadowed-variables-in-compiler-and-runtime.patch

This fixes various typos in the website (the change to r1r2.html
reflects that some of the stemmers now shipped don't actually use r1 and
r2 - lovins is the most obvious example):

http://oligarchy.co.uk/xapian/patches/snowball-website-typos.patch

This adds a "make check" rule which verifies that the UTF-8 and
ISO-8859-1 versions of the stemmers actually produce the expected
output on the test vocabulary.  To simplify the implementation
of this, the patch also converts all the voc.txt and output.txt
files to UTF-8 (the romanian ones were already) - I just ran them
through iconv with suitable options to do this:

http://oligarchy.co.uk/xapian/patches/snowball-make-check.patch

Cheers,
    Olly
Richard Boulton | 12 Feb 2007 11:17

Re: More patches

Olly Betts wrote:
> This fixes various typos in the website (the change to r1r2.html
> reflects that some of the stemmers now shipped don't actually use r1 and
> r2 - lovins is the most obvious example):
> 
> http://oligarchy.co.uk/xapian/patches/snowball-website-typos.patch

Thanks, applied.

> This adds a "make check" rule which verifies that the UTF-8 and
> ISO-8859-1 versions of the stemmers actually produce the expected
> output on the test vocabulary.  To simplify the implementation
> of this, the patch also converts all the voc.txt and output.txt
> files to UTF-8 (the romanian ones were already) - I just ran them
> through iconv with suitable options to do this:
> 
> http://oligarchy.co.uk/xapian/patches/snowball-make-check.patch

Martin and I discussed moving the sample data to UTF-8 long ago, but 
never quite got round to it.  I even had some of the sample data in my 
working directory converted.

I've committed this, thanks very much.

I'll look at the remaining three patches later, probably this evening.

--

-- 
Richard
Olly Betts | 12 Feb 2007 13:09
Favicon
Gravatar

Re: More patches

I just thought we should be testing the koi8r version of the Russian
stemmer too (I also moved the utf8 versions to be checked first, as
they're probably the more interesting ones these days):

http://oligarchy.co.uk/xapian/patches/snowball-check-koi8r.patch

Cheers,
    Olly
Richard Boulton | 12 Feb 2007 13:31

Re: More patches

Olly Betts wrote:
> I just thought we should be testing the koi8r version of the Russian
> stemmer too (I also moved the utf8 versions to be checked first, as
> they're probably the more interesting ones these days):
> 
> http://oligarchy.co.uk/xapian/patches/snowball-check-koi8r.patch

Applied.

--

-- 
Richard

Gmane