Alexander Osherenko | 1 Sep 2006 15:00
Picon
Picon

SMS corpus

Hello,

has anybody heard of a text corpus with SMS messages? Actually it should 
be emotional, but at first it doesn't matter much.

Best

Alexander

Emmanuel PROCHASSON | 1 Sep 2006 15:47
Picon
Favicon

Re: SMS corpus

Alexander Osherenko a écrit :
> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it 
> should be emotional, but at first it doesn't matter much.
I have seen this one :
http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/
in english.

There is one in french too, but it is not available at the moment, from
Université de Louvain
http://www.smspourlascience.be/

Hope that help,

--

-- 
Emmanuel Prochasson

Iztok Kosem | 1 Sep 2006 15:56
Favicon

Re: SMS corpus

Hello
 
I think Caroline Tagg at the University of Birmingham is doing research in the language of SMS messages and has compiled a corpus for this purpose.
I think her email is cxt491 <at> bham.ac.uk.
 
Hope this helps
 
Best
 
Iztok Kosem
Aston University
 
 
----- Original Message -----
Sent: Friday, September 01, 2006 2:00 PM
Subject: [Corpora-List] SMS corpus

Hello,

has anybody heard of a text corpus with SMS messages? Actually it should
be emotional, but at first it doesn't matter much.

Best

Alexander

Cédrick Fairon | 1 Sep 2006 16:06
Picon
Favicon

Re: SMS corpus

Dear Alexander,

The Centre for natural language processing at the University of  
Louvain (http://cental.fltr.ucl.ac.be) has collected a corpus of  
75.000 French sms (more than 2400 authors, aged 12 to 65). Details  
about the project are available online: http://www.smspourlascience.be

A subset of this corpus (30.000 SMS) has been released and published  
on a CD-ROM at the Louvain University Press and is available from  
http://www.i6doc.com/doc/sms (licence for non-profit organisations  
only, others may contact us).

Two interesting remarks about the corpus:
- it contains information about the authors'profile (sex, age,  
occupation, mother tongue, second language, place of living, etc.).  
These profiles are linked to the messages, so that you can select a  
subset of the corpus corresponding to given sociolinguistic details;
- each message was linked to a "transcribed" version in "standard"  
French so that you can search for a word and get all the variants  
present in the corpus.

All the info in C. Fairon, S. Paumier (2006). "A translated corpus of  
30,000 French SMS". In Proceedings of LREC 2006. Genova.

Best Regards,

Cédrick

Le 01-sept.-06 à 15:00, Alexander Osherenko a écrit :

> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it  
> should be emotional, but at first it doesn't matter much.
>
> Best
>
> Alexander
>

Cédrick Fairon
cedrick.fairon <at> uclouvain.be

Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06

http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be

Cédrick Fairon | 1 Sep 2006 16:13
Picon
Favicon

Re: SMS corpus

Dear Alexander,

The Centre for natural language processing at the University of  
Louvain (http://cental.fltr.ucl.ac.be) has collected a corpus of  
75.000 French sms (more than 2400 authors, aged 12 to 65). Details  
about the project are available online: http://www.smspourlascience.be

A subset of this corpus (30.000 SMS) has been released and published  
on a CD-ROM at the Louvain University Press and is available from  
http://www.i6doc.com/doc/sms (licence for non-profit organisations  
only, others may contact us).

Two interesting remarks about the corpus:
- it contains information about the authors'profile (sex, age,  
occupation, mother tongue, second language, place of living, etc.).  
These profiles are linked to the messages, so that you can select a  
subset of the corpus corresponding to given sociolinguistic details;
- each message was linked to a "transcribed" version in "standard"  
French so that you can search for a word and get all the variants  
present in the corpus.

All the info in C. Fairon, S. Paumier (2006). "A translated corpus of  
30,000 French SMS". In Proceedings of LREC 2006. Genova.

Best Regards,

Cédrick

Le 01-sept.-06 à 15:00, Alexander Osherenko a écrit :

> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it  
> should be emotional, but at first it doesn't matter much.
>
> Best
>
> Alexander
>

Cédrick Fairon
cedrick.fairon <at> uclouvain.be

Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06

http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be

Sébastien Paumier | 1 Sep 2006 16:20
Picon

Re: SMS corpus

Hello,
the URL given by Cédrick Fairon is wrong. The correct one that leads a 
corpus of 30,000 French SMS is:

http://www.i6doc.com/doc/smscd

Best regards,

Sébastien Paumier
Institut Gaspard-Monge - Université de Marne-la-Vallée

Alexander Osherenko wrote:

> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it 
> should be emotional, but at first it doesn't matter much.
>
> Best
>
> Alexander

Min-Yen Kan | 1 Sep 2006 17:06
Picon
Gravatar

Re: SMS corpus

Hi all:

I think Emmanuel Prochasson already mentioned the corpus that we have
collected at NUS.  It is a medium sized corpus with about 10K messages
sent by students in Singapore.  We are still in the process of
enlarging the corpus, but also would like to hear what corpus
researchers are looking to find with such corpora.  For example, would
a collection of more messages from a few individuals be of more use
than a collection with few messages from a wider variety of
contributors?

Most of the messages that we have collected are self-selected by
university students to be made public in the corpus, so there's we
believe that there is likely a bias towards messages that are less
personal than what actually occurs in real life.  So you may have less
luck finding emotional messages in our corpus.

Have you thought of supplementing your corpus studies with chat
language?  My past student was looking at some chat logs from
commercial sites to supplement his studies and corpus collection.

The SMS corpus is here (as stated by Emmanuel)

http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/

Min-Yen Kan
Assistant Professor
Web / IR / NLP Group (WING), School of Computing
National University of Singapore

On 9/1/06, Alexander Osherenko <osherenko <at> gmx.de> wrote:
> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it should
> be emotional, but at first it doesn't matter much.
>
> Best
>
> Alexander
>
>

Susana Sotillo | 1 Sep 2006 18:31

Re: SMS corpus

Hi,

I am also in the process of compiling an SMS corpus (in English).  I 
know that there must be a very large Italian corpus of SMS since I 
purchased the TreoDeskTop in order to download all my messages.  
Apparently a lot has been written about the pragmatics of SMS among 
Italian teenagers.  I have downloaded several articles.

Susana Sotillo

Iztok Kosem wrote:
> Hello
>  
> I think Caroline Tagg at the University of Birmingham is doing 
> research in the language of SMS messages and has compiled a corpus for 
> this purpose.
> I think her email is cxt491 <at> bham.ac.uk <mailto:cxt491 <at> bham.ac.uk>.
>  
> Hope this helps
>  
> Best
>  
> Iztok Kosem
> Aston University
>  
>  
>
>     ----- Original Message -----
>     *From:* Alexander Osherenko <mailto:osherenko <at> gmx.de>
>     *To:* corpora <at> hd.uib.no <mailto:corpora <at> hd.uib.no>
>     *Sent:* Friday, September 01, 2006 2:00 PM
>     *Subject:* [Corpora-List] SMS corpus
>
>     Hello,
>
>     has anybody heard of a text corpus with SMS messages? Actually it
>     should
>     be emotional, but at first it doesn't matter much.
>
>     Best
>
>     Alexander
>

Cyrus Shaoul | 2 Sep 2006 09:52
Picon
Picon
Favicon

UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.

Hello Again,

**
IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED 
VERSION. SEE THE NOTE BELOW.
**

  A "thank you" to all the folk who downloaded the first version of our 
USENET word list. Some people made requests for a larger list of types, 
not restricted to my original dictionary. I have now finished the list 
of all types with frequency greater than 3 tokens/million tokens. It is 
large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of 
the types in this list are URLs, e-mail addresses and other cruft that 
are artifacts of my overly simplistic text processing (delete 
punctuation, and split on whitespace.)

I know this list is not for everyone, but if you are interested in 
seeing a lot of types, please download the file from here, and please 
send me any feedback you have. I sorted the list by decreasing type 
frequency.

http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

WARNING: File size is 28 Mb, compressed

**
NOTE: In doing this run, I noticed that my corpus grew in size from 5.9 
to 7.8 billion words, despite the fact that I was using the same raw 
data. I then discovered my bug: I forgot to count non-words in my 
original program. So if you downloaded the original list of 111,627 
words, the corpus size and freq/million numbers are WRONG! The counts 
were correct, though. Please download the corrected list here (914k, 
compressed):

http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html

I also sorted this list by decreasing frequency for ease of use.

Thanks for your understanding,

Cyrus

=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}

Ute Römer | 2 Sep 2006 13:23
Picon
Picon

New book: Corpus Technology and Language Pedagogy


Dear all, 

On behalf of Joybrato Mukherjee (University of Giessen), I would like to
announce the publication of a new book that might be of interest to some of
you (TOC attached below):

Sabine Braun, Kurt Kohn and Joybrato Mukherjee, eds. (2006):
Corpus Technology and Language pedagogy: New Resources, New Tools, New
Methods.
(English Corpus Linguistics, Volume 3)
Frankfurt/Main: Peter Lang.
ISBN: 3-631-54720-X
38 EUR
http://www.peterlang.com

About the volume:
The use of corpora and corpus technology for language learning and teaching 
purposes has been on the agenda of researchers, lexicographers and 
pedagogues for more than two decades now. The present volume is intended to 
take stock of some major developments in corpus-informed language pedagogy 
and brings together a number of contribut­ions, many of which were 
originally presented at the Language Technology section of the LearnTec 
Conference in Karlsruhe/Germany in 2005. The contributions present new 
resources, new tools and new methods for corpus-informed language pedagogy. 
In general, the papers demon­strate a noticeable shift from the more 
'traditional' uses of corpora and corpus technology in linguistic research 
towards uses with specific pedagogical goals in mind.

Contents:

Sabine Braun, Kurt Kohn and Joybrato Mukherjee:
Introduction

Joybrato Mukherjee:
Corpus linguistics and language pedagogy: the state of the art - and beyond

Sabine Braun:
ELISA: a pedagogically enriched corpus for language learning purposes

Sandra Götz and Joybrato Mukherjee:
Evaluation of Data-Driven Learning in university teaching: a project report

Ulrike Gut:
Learner speech corpora in language teaching

Josef Schmied:
Corpus linguistics and grammar learning: tutor versus learner perspectives

Christopher Tribble:
Using keywords to read the news

Christiane Brand and Susanne Kämmerer:
The Louvain International Database of Spoken English Interlanguage 
(LINDSEI): compiling the German component

Nadja Nesselhauf:
Researching L2 production with ICLE

Yvonne Breyer:
My Concordancer: tailor-made software for language learners and teachers

Sebastian Hoffmann and Stefan Evert:
BNCweb (CQP-edition): the marriage of two corpus tools

Christoph Müller and Michael Strube:
Multi-level annotation of linguistic data with MMAX2

With best wishes... Ute

************************************************************

Dr. Ute Römer
English Department
Leibniz University of Hanover
Königsworther Platz 1
30167 Hannover
Germany

Phone: +49 (0)511 762 2997
Fax: +49 (0)511 762 2996
Please note NEW e-mail address: ute.roemer <at> engsem.uni-hannover.de
http://www.uteroemer.com
http://www.fbls.uni-hannover.de/angli/
NEW conference website: "Exploring the Lexis-Grammar Interface" (ELeGI, 5-7
October 2006) http://www.elegi-2006.com


Gmane