1 Sep 2006 15:00
1 Sep 2006 15:47
Re: SMS corpus
Emmanuel PROCHASSON <eprochasson <at> free.fr>
2006-09-01 13:47:36 GMT
2006-09-01 13:47:36 GMT
Alexander Osherenko a écrit : > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it > should be emotional, but at first it doesn't matter much. I have seen this one : http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/ in english. There is one in french too, but it is not available at the moment, from Université de Louvain http://www.smspourlascience.be/ Hope that help, -- -- Emmanuel Prochasson
1 Sep 2006 15:56
Re: SMS corpus
Iztok Kosem <iztok.kosem <at> volja.net>
2006-09-01 13:56:43 GMT
2006-09-01 13:56:43 GMT
Hello
I think Caroline Tagg at the University of Birmingham is
doing research in the language of SMS messages and has compiled a corpus for
this purpose.
I think her email is cxt491 <at> bham.ac.uk.
Hope this helps
Best
Iztok Kosem
Aston University
----- Original Message -----From: Alexander OsherenkoSent: Friday, September 01, 2006 2:00 PMSubject: [Corpora-List] SMS corpusHello,
has anybody heard of a text corpus with SMS messages? Actually it should
be emotional, but at first it doesn't matter much.
Best
Alexander
1 Sep 2006 16:06
Re: SMS corpus
Cédrick Fairon <cedrick.fairon <at> uclouvain.be>
2006-09-01 14:06:29 GMT
2006-09-01 14:06:29 GMT
Dear Alexander, The Centre for natural language processing at the University of Louvain (http://cental.fltr.ucl.ac.be) has collected a corpus of 75.000 French sms (more than 2400 authors, aged 12 to 65). Details about the project are available online: http://www.smspourlascience.be A subset of this corpus (30.000 SMS) has been released and published on a CD-ROM at the Louvain University Press and is available from http://www.i6doc.com/doc/sms (licence for non-profit organisations only, others may contact us). Two interesting remarks about the corpus: - it contains information about the authors'profile (sex, age, occupation, mother tongue, second language, place of living, etc.). These profiles are linked to the messages, so that you can select a subset of the corpus corresponding to given sociolinguistic details; - each message was linked to a "transcribed" version in "standard" French so that you can search for a word and get all the variants present in the corpus. All the info in C. Fairon, S. Paumier (2006). "A translated corpus of 30,000 French SMS". In Proceedings of LREC 2006. Genova. Best Regards, Cédrick Le 01-sept.-06 à 15:00, Alexander Osherenko a écrit : > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it > should be emotional, but at first it doesn't matter much. > > Best > > Alexander > Cédrick Fairon cedrick.fairon <at> uclouvain.be Directeur du CENTAL Centre de traitement automatique du langage Université catholique de Louvain Place Blaise Pascal, 1 1348 Louvain-la-Neuve Belgique tel: +32 10 47 37 88 fax: +32 10 47 26 06 http://cental.fltr.ucl.ac.be http://glossa.fltr.ucl.ac.be
1 Sep 2006 16:13
Re: SMS corpus
Cédrick Fairon <cedrick.fairon <at> uclouvain.be>
2006-09-01 14:13:25 GMT
2006-09-01 14:13:25 GMT
Dear Alexander, The Centre for natural language processing at the University of Louvain (http://cental.fltr.ucl.ac.be) has collected a corpus of 75.000 French sms (more than 2400 authors, aged 12 to 65). Details about the project are available online: http://www.smspourlascience.be A subset of this corpus (30.000 SMS) has been released and published on a CD-ROM at the Louvain University Press and is available from http://www.i6doc.com/doc/sms (licence for non-profit organisations only, others may contact us). Two interesting remarks about the corpus: - it contains information about the authors'profile (sex, age, occupation, mother tongue, second language, place of living, etc.). These profiles are linked to the messages, so that you can select a subset of the corpus corresponding to given sociolinguistic details; - each message was linked to a "transcribed" version in "standard" French so that you can search for a word and get all the variants present in the corpus. All the info in C. Fairon, S. Paumier (2006). "A translated corpus of 30,000 French SMS". In Proceedings of LREC 2006. Genova. Best Regards, Cédrick Le 01-sept.-06 à 15:00, Alexander Osherenko a écrit : > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it > should be emotional, but at first it doesn't matter much. > > Best > > Alexander > Cédrick Fairon cedrick.fairon <at> uclouvain.be Directeur du CENTAL Centre de traitement automatique du langage Université catholique de Louvain Place Blaise Pascal, 1 1348 Louvain-la-Neuve Belgique tel: +32 10 47 37 88 fax: +32 10 47 26 06 http://cental.fltr.ucl.ac.be http://glossa.fltr.ucl.ac.be
1 Sep 2006 16:20
Re: SMS corpus
Sébastien Paumier <sebastien.paumier <at> univ-mlv.fr>
2006-09-01 14:20:57 GMT
2006-09-01 14:20:57 GMT
Hello, the URL given by Cédrick Fairon is wrong. The correct one that leads a corpus of 30,000 French SMS is: http://www.i6doc.com/doc/smscd Best regards, Sébastien Paumier Institut Gaspard-Monge - Université de Marne-la-Vallée Alexander Osherenko wrote: > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it > should be emotional, but at first it doesn't matter much. > > Best > > Alexander
1 Sep 2006 17:06
Re: SMS corpus
Min-Yen Kan <knmnyn <at> gmail.com>
2006-09-01 15:06:11 GMT
2006-09-01 15:06:11 GMT
Hi all: I think Emmanuel Prochasson already mentioned the corpus that we have collected at NUS. It is a medium sized corpus with about 10K messages sent by students in Singapore. We are still in the process of enlarging the corpus, but also would like to hear what corpus researchers are looking to find with such corpora. For example, would a collection of more messages from a few individuals be of more use than a collection with few messages from a wider variety of contributors? Most of the messages that we have collected are self-selected by university students to be made public in the corpus, so there's we believe that there is likely a bias towards messages that are less personal than what actually occurs in real life. So you may have less luck finding emotional messages in our corpus. Have you thought of supplementing your corpus studies with chat language? My past student was looking at some chat logs from commercial sites to supplement his studies and corpus collection. The SMS corpus is here (as stated by Emmanuel) http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/ Min-Yen Kan Assistant Professor Web / IR / NLP Group (WING), School of Computing National University of Singapore On 9/1/06, Alexander Osherenko <osherenko <at> gmx.de> wrote: > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it should > be emotional, but at first it doesn't matter much. > > Best > > Alexander > >
1 Sep 2006 18:31
Re: SMS corpus
Susana Sotillo <sotillos <at> mail.montclair.edu>
2006-09-01 16:31:05 GMT
2006-09-01 16:31:05 GMT
Hi, I am also in the process of compiling an SMS corpus (in English). I know that there must be a very large Italian corpus of SMS since I purchased the TreoDeskTop in order to download all my messages. Apparently a lot has been written about the pragmatics of SMS among Italian teenagers. I have downloaded several articles. Susana Sotillo Iztok Kosem wrote: > Hello > > I think Caroline Tagg at the University of Birmingham is doing > research in the language of SMS messages and has compiled a corpus for > this purpose. > I think her email is cxt491 <at> bham.ac.uk <mailto:cxt491 <at> bham.ac.uk>. > > Hope this helps > > Best > > Iztok Kosem > Aston University > > > > ----- Original Message ----- > *From:* Alexander Osherenko <mailto:osherenko <at> gmx.de> > *To:* corpora <at> hd.uib.no <mailto:corpora <at> hd.uib.no> > *Sent:* Friday, September 01, 2006 2:00 PM > *Subject:* [Corpora-List] SMS corpus > > Hello, > > has anybody heard of a text corpus with SMS messages? Actually it > should > be emotional, but at first it doesn't matter much. > > Best > > Alexander >
2 Sep 2006 09:52
UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.
Cyrus Shaoul <cyrus.shaoul <at> ualberta.ca>
2006-09-02 07:52:51 GMT
2006-09-02 07:52:51 GMT
Hello Again, ** IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED VERSION. SEE THE NOTE BELOW. ** A "thank you" to all the folk who downloaded the first version of our USENET word list. Some people made requests for a larger list of types, not restricted to my original dictionary. I have now finished the list of all types with frequency greater than 3 tokens/million tokens. It is large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of the types in this list are URLs, e-mail addresses and other cruft that are artifacts of my overly simplistic text processing (delete punctuation, and split on whitespace.) I know this list is not for everyone, but if you are interested in seeing a lot of types, please download the file from here, and please send me any feedback you have. I sorted the list by decreasing type frequency. http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html WARNING: File size is 28 Mb, compressed ** NOTE: In doing this run, I noticed that my corpus grew in size from 5.9 to 7.8 billion words, despite the fact that I was using the same raw data. I then discovered my bug: I forgot to count non-words in my original program. So if you downloaded the original list of 111,627 words, the corpus size and freq/million numbers are WRONG! The counts were correct, though. Please download the corrected list here (914k, compressed): http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html I also sorted this list by decreasing frequency for ease of use. Thanks for your understanding, Cyrus =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=} Cyrus Shaoul http://www.psych.ualberta.ca/~westburylab/ University of Alberta 780-492-5843 =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
2 Sep 2006 13:23
New book: Corpus Technology and Language Pedagogy
Ute Römer <ute.roemer <at> engsem.uni-hannover.de>
2006-09-02 11:23:26 GMT
2006-09-02 11:23:26 GMT
Dear all, On behalf of Joybrato Mukherjee (University of Giessen), I would like to announce the publication of a new book that might be of interest to some of you (TOC attached below): Sabine Braun, Kurt Kohn and Joybrato Mukherjee, eds. (2006): Corpus Technology and Language pedagogy: New Resources, New Tools, New Methods. (English Corpus Linguistics, Volume 3) Frankfurt/Main: Peter Lang. ISBN: 3-631-54720-X 38 EUR http://www.peterlang.com About the volume: The use of corpora and corpus technology for language learning and teaching purposes has been on the agenda of researchers, lexicographers and pedagogues for more than two decades now. The present volume is intended to take stock of some major developments in corpus-informed language pedagogy and brings together a number of contributions, many of which were originally presented at the Language Technology section of the LearnTec Conference in Karlsruhe/Germany in 2005. The contributions present new resources, new tools and new methods for corpus-informed language pedagogy. In general, the papers demonstrate a noticeable shift from the more 'traditional' uses of corpora and corpus technology in linguistic research towards uses with specific pedagogical goals in mind. Contents: Sabine Braun, Kurt Kohn and Joybrato Mukherjee: Introduction Joybrato Mukherjee: Corpus linguistics and language pedagogy: the state of the art - and beyond Sabine Braun: ELISA: a pedagogically enriched corpus for language learning purposes Sandra Götz and Joybrato Mukherjee: Evaluation of Data-Driven Learning in university teaching: a project report Ulrike Gut: Learner speech corpora in language teaching Josef Schmied: Corpus linguistics and grammar learning: tutor versus learner perspectives Christopher Tribble: Using keywords to read the news Christiane Brand and Susanne Kämmerer: The Louvain International Database of Spoken English Interlanguage (LINDSEI): compiling the German component Nadja Nesselhauf: Researching L2 production with ICLE Yvonne Breyer: My Concordancer: tailor-made software for language learners and teachers Sebastian Hoffmann and Stefan Evert: BNCweb (CQP-edition): the marriage of two corpus tools Christoph Müller and Michael Strube: Multi-level annotation of linguistic data with MMAX2 With best wishes... Ute ************************************************************ Dr. Ute Römer English Department Leibniz University of Hanover Königsworther Platz 1 30167 Hannover Germany Phone: +49 (0)511 762 2997 Fax: +49 (0)511 762 2996 Please note NEW e-mail address: ute.roemer <at> engsem.uni-hannover.de http://www.uteroemer.com http://www.fbls.uni-hannover.de/angli/ NEW conference website: "Exploring the Lexis-Grammar Interface" (ELeGI, 5-7 October 2006) http://www.elegi-2006.com
RSS Feed