iolalla | 21 Jun 2011 18:06
Favicon

Article about using snowball stemmer to do language identificaction

Hi,

        We've been playing with snowball stemmers to do language identification and the results were very good, well at least were very quick and we decide to share the resulting analysis with everybody, try to give back a little bit to the community.

        As far as we've tested works smoothly and can be used with very small sets of words.

        Here you can read the article and leave your feedback: http://israelolalla.blogspot.com/2011/06/language-identification-system-how-to.html

Kind Regards.
<at> font-face { font-family: "Tahoma"; } <at> font-face { font-family: "Verdana"; } <at> font-face { font-family: "Webdings"; } <at> font-face { font-family: "Humnst777 Lt BT"; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }a:link, span.MsoHyperlink { color: blue; text-decoration: underline; }a:visited, span.MsoHyperlinkFollowed { color: purple; text-decoration: underline; }span.SpellE { }span.GramE { }div.Section1 { page: Section1; }

---------------------------------------------------------------------------------
ISrael Olalla chiCOte
iolalla <at> isoco.com
#Linkedin  http://es.linkedin.com/in/iolalla

#M +34  664 300 116

#T  +34 948 102 408
Parque Tomás Caballero, 2, 6º 4ª

31006 Pamplona, España

iSOCO 
            enabling the networked economy
           
www.isoco.com

 P Please consider your environmental responsibility before printing this e-mail

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Pola Podzharsky | 28 Jun 2011 14:54
Favicon

stemmer for Ukrainian language

Good day dear list .

I am looking russian-ukrainian stemmer. I've seen that Russian stemmer exists in snowball project, but Ukrainian does not. I've also seen relevant discussion on this mailing list [1] dated back to 2003. Was there any progress since then on Ukrainian stemmer?

Thank you in advance,
Pola.

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
iolalla | 30 Jun 2011 12:37
Favicon

Re: Article about using snowball stemmer to dolanguage identificaction

Hi Martin,

        Yes you are right what we do is to count removed endings, but when we are talking about very similar languages, Italian, Portuguese, Catalan, etc. we need another criteria and mixing stopwords and Stemming gives very good results.

        With Jan from cominvent we are analyzing the idea of using a mix of ngrams (hunspell), Snowball and StopWords in order to decide. In the next weeks we'll have a test environment to see which is the best approach and will share with everybody the results and the code.

Kind Regards.
El 30/06/2011 12:00, Martin Porter escribió:
Iolalla, Very interesting. In the past I've used stopword lists for language identification, and the results were adequate. But I did notice that in Finnish there aren't so very many stopwords! But I don't quite see how applying a stemmer leads to a language identification. Do you count the number of valid endings removed, and use that as a measure? Like Cominvent, I wondered how the mixing was done in the hybrid approach, Martin
<at> font-face { font-family: "Tahoma"; } <at> font-face { font-family: "Verdana"; } <at> font-face { font-family: "Webdings"; } <at> font-face { font-family: "Humnst777 Lt BT"; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }a:link, span.MsoHyperlink { color: blue; text-decoration: underline; }a:visited, span.MsoHyperlinkFollowed { color: purple; text-decoration: underline; }span.SpellE { }span.GramE { }div.Section1 { page: Section1; }

---------------------------------------------------------------------------------
ISrael Olalla chiCOte
iolalla <at> isoco.com
#Linkedin  http://es.linkedin.com/in/iolalla

#M +34  664 300 116

#T  +34 948 102 408
Parque Tomás Caballero, 2, 6º 4ª

31006 Pamplona, España

iSOCO 
            enabling the networked economy
           
www.isoco.com

 P Please consider your environmental responsibility before printing this e-mail

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 30 Jun 2011 12:00
Picon

Re: Article about using snowball stemmer to dolanguage identificaction

Iolalla,

Very interesting. In the past I've used stopword lists for language
identification, and the results were adequate. But I did notice that
in Finnish there aren't so very many stopwords!

But I don't quite see how applying a stemmer leads to a language
identification. Do you count the number of valid endings removed, and
use that as a measure?

Like Cominvent, I wondered how the mixing was done in the hybrid approach,

Martin
Martin Porter | 30 Jun 2011 11:37
Picon

Re: stemmer for Ukrainian language

Pola,

At the snowball site there has been no progress with a Ukranian
stemmer, I'm afraid.

Martin

Gmane