Diana Inkpen | 1 Apr 03:56 2009

Re: Sentence similarity

There is a data set of sentence with similarity scores produced by human judges, available at
The reference is:
Y. Li, D. McLean, Z. Bandar, J. O’Shea, and K. Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 8, pp. 1138-1149, Aug. 2006.


hamed wrote:
Dear Corpra members

I have developed a system to measure similarity between sentences. but  I do not know how to evaluate it? I'm looking for a sentence-similarity corpus, i.e., a collection of
 sentences with manually assigned similarities to other sentences. Any ideas?

Thank you very much.

Hamed Khanpour

Computer science student


_______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora

-- ==================================================== Diana Inkpen Associate Professor, PhD, PEng University of Ottawa / School of Information Technology and Engineering 800 King Edward, Ottawa, ON, Canada, K1N 6N5 http://www.site.uottawa.ca/~diana tel: 613-562-5800 ext. 6711 ====================================================
Corpora mailing list
Corpora <at> uib.no
Sunny Fugate | 1 Apr 07:25 2009

Re: Corpus which contains emoticons

The Naval Postgraduate School has a freely available chat corpus which  
has been anonymized which I believe has emoticons.  Visit the NPS page  
on the topic at http://faculty.nps.edu/cmartell/NPSChat.htm  and  
contact Craig Martell to request a copy. I think that this corpus is  
also a pre-packaged part of the Natural Language Toolkit (NLTK).


Sunny Fugate
University of New Mexico Graduate Student

On Mar 31, 2009, at 11:17 AM, Smaranda Muresan wrote:

> Dear all
> Does anyone know of a freely available corpus which contains  
> emoticons?
> Ideally will be chats, but it can also be blogs or other type of text.
> Thank you,
> Smaranda Muresan
> Assistant Professor
> Rutgers University
> _______________________________________________
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora

Corpora mailing list
Corpora <at> uib.no

K.Taraka Rama | 1 Apr 15:23 2009

Distance between natural languages

I know some researchers who work in this area.


This paper has a number of references which might be very useful to you.


This is a survey report on computational historical linguistics which might interest you.

With Regards,
Taraka Rama.

Corpora mailing list
Corpora <at> uib.no
Matthew Purver | 1 Apr 15:13 2009

SIGDIAL 2009 Conference: Submission Deadline April 24

10th Annual Meeting of the Special Interest Group
on Discourse and Dialogue

Queen Mary University of London, UK - September 11-12, 2009
(right after Interspeech 2009)

Submission Deadline: April 24, 2009


The SIGDIAL venue provides a regular forum for the presentation of
cutting edge research in discourse and dialogue to both academic and
industry researchers. Due to the success of the nine previous SIGDIAL
workshops, SIGDIAL is now a conference. The conference is sponsored by
the SIGDIAL organization, which serves as the Special Interest Group in
discourse and dialogue for both ACL and ISCA. SIGDIAL 2009 will be
co-located with Interspeech 2009 as a satellite event.

In addition to presentations and system demonstrations, the program
includes talks by two invited speakers:

- Professor Janet Bavelas (University of Victoria): "What's unique about
- Professor Yorick Wilks (University of Sheffield): "Artificial
companions as dialogue agents".


We welcome formal, corpus-based, implementation, experimental, or
analytical work on discourse and dialogue including, but not restricted
to, the following themes:

1. Discourse Processing and Dialogue Systems

Discourse semantic and pragmatic issues in NLP applications such as text
summarization, question answering, information retrieval including
topics like:

- Discourse structure, temporal structure, information structure ;
- Discourse markers, cues and particles and their use;
- (Co-)Reference and anaphora resolution, metonymy and bridging resolution;
- Subjectivity, opinions and semantic orientation;

Spoken, multi-modal, and text/web based dialogue systems including
topics such as:

- Dialogue management models;
- Speech and gesture, text and graphics integration;
- Strategies for preventing, detecting or handling miscommunication
(repair and correction types, clarification and under-specificity,
grounding and feedback strategies);
- Utilizing prosodic information for understanding and for disambiguation;

2. Corpora, Tools and Methodology

Corpus-based and experimental work on discourse and spoken, text-based
and multi-modal dialogue including its support, in particular:

- Annotation tools and coding schemes;
- Data resources for discourse and dialogue studies;
- Corpus-based techniques and analysis (including machine learning);
- Evaluation of systems and components, including methodology, metrics
and case studies;

3. Pragmatic and/or Semantic Modeling

The pragmatics and/or semantics of discourse and dialogue (i.e. beyond a
single sentence) including the following issues:

- The semantics/pragmatics of dialogue acts (including those which are
less studied in the semantics/pragmatics framework);
- Models of discourse/dialogue structure and their relation to
referential and relational structure;
- Prosody in discourse and dialogue;
- Models of presupposition and accommodation; operational models of
   conversational implicature.


The program committee welcomes the submission of long papers for full
plenary presentation as well as short papers and demonstrations. Short
papers and demo descriptions will be featured in short plenary
presentations, followed by posters and demonstrations.

- Long papers must be no longer than 8 pages, including title, examples,
references, etc. In addition to this, two additional pages are allowed
as an appendix which may include extended example discourses or
dialogues, algorithms, graphical representations, etc.
- Short papers and demo descriptions should be 4 pages or less
(including title, examples, references, etc.).

Please use the official ACL style files:

Papers that have been or will be submitted to other meetings or
publications must provide this information (see submission format).
SIGDIAL 2009 cannot accept for publication or presentation work that
will be (or has been) published elsewhere. Any questions regarding
submissions can be sent to the General Co-Chairs.

Authors are encouraged to make illustrative materials available, on the
web or otherwise. Examples might include excerpts of recorded
conversations, recordings of human-computer dialogues, interfaces to
working systems, and so on.


In order to recognize significant advancements in dialog and discourse
science and technology, SIGDIAL will (for the first time) recognize a
consisting of prominent researchers in the fields of interest will
select the recipients of the awards.


The Young Researchers' Roundtable on Spoken Dialogue Systems, a SIGDIAL
satellite event, is also to be held at QMUL, on September 13-14
(i.e. immediately following the main conference). This is an annual
workshop designed for students, post docs, and junior researchers
working in research related to spoken dialogue systems in both academia
and industry.


Submission: April 24, 2009
Workshop: September 11-12, 2009
Young Researchers' Roundtable: September 13-14, 2009


SIGDIAL 2009 conference website:
SIGDIAL organization website: http://www.sigdial.org/
Interspeech 2009 website: http://www.interspeech2009.org/
Young Researchers' Roundtable website: http://www.yrrsds.org/


For any questions, please contact the appropriate members of the
organizing committee:

Pat Healey (Queen Mary University of London): ph <at> dcs.qmul.ac.uk
Roberto Pieraccini (SpeechCycle): roberto <at> speechcycle.com

Donna Byron (Northeastern University): dbyron <at> ccs.neu.edu
Steve Young (University of Cambridge): sjy <at> eng.cam.ac.uk

Matt Purver (Queen Mary University of London): mpurver <at> dcs.qmul.ac.uk

Tim Paek (Microsoft Research): timpaek <at> microsoft.com

Amanda Stent (AT&T Labs - Research): amanda.stent <at> gmail.com


Gregory Aist               Arizona State University, USA
Jan Alexandersson          DFKI GmbH, Germany
Jason Baldridge            University of Texas at Austin, USA
Srinivas Bangalore         AT&T Labs - Research, USA
Dan Bohus                  Microsoft Research, USA
Johan Bos                  Università di Roma "La Sapienza", Italy
Charles Calloway           University of Edinburgh, UK
Rolf Carlson               Royal Institute of Technology (KTH), Sweden
Mark Core                  University of Southern California, USA
David DeVault              University of Southern California, USA
Myroslava Dzikovska        University of Edinburgh, UK
Markus Egg                 Rijksuniversiteit Groningen, Netherlands
Stephanie Elzer            Millersville University, USA
Mary Ellen Foster          Technical University Munich, Germany
Kallirroi Georgila         University of Edinburgh, UK
Jonathan Ginzburg          King's College London, UK
Genevieve Gorrell          Sheffield University, UK
Alexander Gruenstein       Massachusetts Institute of Technology, USA
Pat Healey                 Queen Mary University of London, UK
Mattias Heldner            Royal Institute of Technology (KTH), Sweden
Beth Ann Hockey            University of California at Santa Cruz, USA
Kristiina Jokinen          University of Helsinki, Finland
Arne Jonsson               University of Linköping, Sweden
Simon Keizer               University of Cambridge, UK
John Kelleher              Dublin Institute of Technology, Ireland
Alexander Koller           University of Edinburgh, UK
Ivana Kruijff-Korbayová    Universität des Saarlandes, Germany
Staffan Larsson            Göteborg University, Sweden
Gary Geunbae Lee           Pohang University of Science and Technology,
Fabrice Lefevre            University of Avignon, France
Oliver Lemon               University of Edinburgh, UK
James Lester               North Carolina State University, USA
Diane Litman               University of Pittsburgh, USA
Ramón López-Cózar          University of Granada, Spain
François Mairesse          University of Cambridge, UK
Michael McTear             University of Ulster, UK
Wolfgang Minker            University of Ulm, Germany
Sebastian Möller           Deutsche Telekom Labs and Technical
University Berlin, Germany
Vincent Ng                 University of Texas at Dallas, USA
Tim Paek                   Microsoft Research, USA
Patrick Paroubek           LIMSI-CNRS, France
Roberto Pieraccini         SpeechCycle, USA
Paul Piwek                 Open University, UK
Rashmi Prasad              University of Pennsylvania, USA
Matt Purver                Queen Mary University of London, UK
Laurent Romary             INRIA, France
Alex Rudnicky              Carnegie Mellon University, USA
Yoshinori Sagisaka         Waseda University, Japan
Ruhi Sarikaya              IBM Research, USA
Candy Sidner               BAE Systems AIT, USA
Ronnie Smith               East Carolina University, USA
Amanda Stent               AT&T Labs - Research, USA
Matthew Stone              Rutgers University, USA
Matthew Stuttle            Toshiba Research, UK
Joel Tetreault             Educational Testing Service, USA
Jason Williams             AT&T Labs - Research, USA


Matthew Purver - http://www.dcs.qmul.ac.uk/~mpurver/

Senior Research Fellow
Interaction, Media and Communication
Department of Computer Science
Queen Mary University of London, London E1 4NS, UK

Corpora mailing list
Corpora <at> uib.no

Buabin, Emmanuel J | 1 Apr 20:22 2009

Medical Information/Documents for text mining


Please, I would appreciate it if anyone could help with a dataset containing medical information/documents for text mining.  Datasets with classifications (class labels) would be most preferred.

Emmanuel Buabin

Corpora mailing list
Corpora <at> uib.no
Kevin B. Cohen | 1 Apr 20:26 2009

Re: Medical Information/Documents for text mining


If you really want medical documents, the only data set that I know of is this one:


It does have class labels, in the form of ICD9CM codes.

Best wishes,


On Wed, Apr 1, 2009 at 12:22 PM, Buabin, Emmanuel J <jojobuabin <at> yahoo.com> wrote:


Please, I would appreciate it if anyone could help with a dataset containing medical information/documents for text mining.  Datasets with classifications (class labels) would be most preferred.

Emmanuel Buabin

Corpora mailing list
Corpora <at> uib.no

K. B. Cohen
Biomedical Text Mining Group Lead, Center for Computational Pharmacology
Lead Artificial Intelligence Engineer, The MITRE Corporation, Human Language Technology Division
303-916-2417 (cell) 303-377-9194 (home)

Corpora mailing list
Corpora <at> uib.no
Buabin, Emmanuel J | 1 Apr 22:28 2009

NEWS TEXT CORPUS (5 or more years)


Please, I would appreciate it if anyone could help with a dataset containing NEWS ARTICLES for text mining. A corpus with 5 or more years with classifications (class labels) would be most preferred.

Emmanuel Buabin

Corpora mailing list
Corpora <at> uib.no
Megerdoomian, Karine | 1 Apr 20:49 2009

CFP: Computational Approaches to Arabic Script Languages (CAASL3)



                     THIRD WORKSHOP ON

August 26, 2009

Machine Translation Summit XII

Ottawa, Ontario, Canada



The Organizing Committee of the Third Workshop on Computational Approaches to Arabic Script-based Languages invites proposals for presentation at CAASL3, being held in conjunction with MT Summit XII.




The first two workshops (2004 and 2007) brought together researchers working on the computer processing of Arabic script-based languages such as Arabic, Persian (Farsi and Dari), Pashto and Urdu, among others. The usage of the Arabic script and the influence of Arabic vocabulary give rise to certain computational issues that are common to these languages despite their being of distinct language families, such as right to left direction, encoding variation, absence of capitalization, complex word structure, and a high degree of ambiguity due to non-representation of short vowels in the writing system.


The third workshop (CAASL3), five years after the successful first workshop, will provide a forum for researchers from academia, industry, and government developers, practitioners, and users to share their research and experience with a focus on machine translation.  It also provides an opportunity to assess the progress that has been made since the first workshop in 2004.


The call for papers as well as future information on the workshop can be found at http://www.arabicscript.org.




Paper submission deadline: May 8, 2009

Notification of acceptance: June 12, 2009

Camera ready submissions: July 10, 2009




We welcome submissions in any area of NLP in Arabic script-based languages. However, preference would be given to papers that focus on Machine Translation applications of Arabic script-based languages. The main themes of this workshop include:


  • Statistical and rule-based machine translation

  • Translation aids

  • Evaluation methods and techniques of  machine translation systems

  • MT of dialectal and conversational language

  • Computer-mediated communication (e.g., blogs, forums, chats)

  • Knowledge bases, corpora, and development of resources for MT applications

  • Speech-to-speech MT

  • MT combined with other technologies (speech translation, cross-language information retrieval, multilingual text categorization, multilingual text summarization, multilingual natural language generation, etc.)

  • Entity extraction

  • Tokenization and segmentation

  • Speech synthesis and recognition

  • Text to speech systems

  • Semantic analysis


SUBMISSION REQUIREMENTS                                                 


Papers should not have been presented somewhere else or be under consideration for publication elsewhere, and should not identify the author(s). They should emphasize completed work rather than intended work. Each paper will be anonymously reviewed by three members of the program committee.


Papers must be submitted in PDF format to caasl3 <at> arabicscript.org by midnight of the due date. Submissions should be in English. The papers should be attached to an email indicating contact information for the author(s) and paper’s title. Papers should not exceed 8 pages including references and tables, and should follow the formatting guidelines posted at




For further information, please visit the workshop site at http://www.arabicscript.org/CAASL3 or contact the organizing committee at caasl3 <at> arabicscript.org.




Ali Farghaly, Oracle USA

Karine Megerdoomian,  The Mitre Corporation

Hassan Sawaf, AppTek Inc.




Jan W. Amtrup (Kofax Image Products)
Kenneth Beesley (SAP)
Mahmood Bijankhan (Tehran University, Iran)
Tim Buckwalter (University of Maryland)
Miriam Butt (Konstanz University, Germany)
Violetta Cavalli-Sforza (Al Akhawayn University, Morocco)
Sherri L. Condon (The MITRE Corporation)
Kareem Darwish (Cairo University, Egypt and IBM)
Mona Diab (Columbia University)
Joseph Dichy (Lyon University)
Andrew Freeman (The MITRE Corporation)
Nizar Habash (Columbia University)
Lamia Hadrich Belguith (University of Sfax, Tunisia)
Hany Hassan (IBM)
Sarmad Hussain (CRULP and FAST National University, Pakistan)
Simin Karimi (University of Arizona)
Martin Kay (Stanford University)
Mohamed Maamouri (Linguistic Data Consortium)
Shrikanth Narayanan (University of Southern California)
Hermann Ney (RWTH Aachen, Germany)
Farhad Oroumchian (University of Wollongong in Dubai)
Nick Pendar (H5 Technologies)
Kristin Precoda (SRI International)
Jean Sennellart (SYSTRAN)
Ahmed Rafea (The American University in Cairo)
Khaled Shaalan (The British University in Dubai)
Mehrnoush Shamsfard (Shahid Beheshti University, Iran)
Stephan Vogel (CMU)
Imed Zitouni (IBM)


Corpora mailing list
Corpora <at> uib.no
Mark Davies | 2 Apr 00:52 2009

Corpus size and accuracy of frequency listings

I'm looking for studies that have considered how corpus size affects the accuracy of word frequency listings.

For example, suppose that one uses a 100 million word corpus and a good tagger/lemmatizer to generate a
frequency listing of the top 10,000 lemmas in that corpus. If one were to then take just every fifth word or
every fiftieth word in the running text of the 100 million word corpus (thus creating a 20 million or a 2
million word corpus), how much would this affect the top 10,000 lemma list? Obviously it's a function of
the size of the frequency list as well -- things might not change much in terms of the top 100 lemmas in going
from a 20 million word to a 100 million word corpus, whereas they would change much more for a 20,000 lemma
list. But that's precisely the type of data I'm looking for.

Thanks in advance,

Mark Davies

Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

Corpora mailing list
Corpora <at> uib.no

Torsten Zesch | 2 Apr 09:47 2009

ACL/IJCNLP-2009 Workshop - Final CFP - The People's Web meets NLP: Collaboratively Constructed Semantic Resources

ACL/IJCNLP-2009 Workshop

"The People's Web meets NLP:
Collaboratively Constructed Semantic Resources"

Co-located with Joint conference of the 47th Annual Meeting of the
Association for Computational Linguistics and the 4th International
Joint Conference on Natural Language Processing of the Asian
Federation of Natural Language Processing

August 7th, 2009


In recent years, online resources collaboratively constructed by
users on the Web have considerably influenced the NLP community. In many
works, they have been used as a substitute for conventional semantic
resources and as semantically structured corpora with great success.
While conventional resources such as WordNet are developed by trained
linguists [1], online semantic resources can now be automatically
extracted from the content collaboratively created by the users [2].
Thereby, the knowledge acquisition bottlenecks and coverage problems
pertinent to conventional lexical semantic resources can be overcome.

The resource that has gained the greatest popularity in this respect
so far is Wikipedia. However, other resources recently discovered in
NLP, such as folksonomies, the multilingual collaboratively
constructed dictionary Wiktionary, or Q&A sites like WikiAnswers or
Yahoo! Answers are also very promising. Moreover, new wiki-based
platforms such as Citizendium or Knol have recently emerged that
offer features distinct from Wikipedia and are of high potential
in terms of their use in NLP.

The benefits of using Web-based resources come along with new
challenges, such as the interoperability with existing resources and
the quality of the knowledge represented. As collaboratively created
resources lack editorial control, they are typically incomplete. For
the interoperability with conventional resources, the mappings have
to be investigated. The quality of collaboratively constructed
resources is questioned in many cases, and the information extraction
remains a complicated task due to the incompleteness and semi-
structuredness of the content. Therefore, the research community has
begun to develop and provide tools for accessing collaboratively
constructed resources [2,5].

The above listed challenges actually present a chance for NLP
techniques to improve the quality of Web-based semantic resources.
Researchers have therefore proposed techniques for link prediction [3]
or information extraction [4] that can be used to guide the "crowds"
to construct resources that are better suited for being used in NLP
in return.

[1] Christiane Fellbaum
    WordNet An Electronic Lexical Database.
    MIT press, 1998.
[2] Torsten Zesch, Christof Mueller and Iryna Gurevych
    Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
    Proceedings of the Conference on Language Resources and Evaluation
    (LREC), 2008.
[3] Rada Mihalcea and Andras Csomai
    Wikify!: Linking Documents to Encyclopedic Knowledge.
    Proceedings of the Sixteenth ACM Conference on Information and
    Knowledge Management, CIKM 2007.
[4] Daniel S. Weld et al.
    Intelligence in Wikipedia.
    Twenty-Third Conference on Artificial Intelligence (AAAI), 2008.
[5] Kotaro Nakayama et al.
    Wikipedia Mining - Wikipedia as a Corpus for Knowledge Extraction.
    Proceedings of the Annual Wikipedia Conference (Wikimania), 2008.


The workshop will bring together researchers from both worlds: those
using collaboratively created resources in NLP applications such as
information retrieval, named entity recognition, or keyword extraction,
and those using NLP applications for improving the resources or
extracting different types of semantic information from them. Hopefully,
this will turn into a feedback loop, where NLP techniques improved by
collaboratively constructed resources are used to improve the resources
in exchange.

Specific topics include but are not limited to:
 * Different types of collaboratively constructed resources, such as
   wiki-based platforms, Q&A sites or folksonomies;
 * Using collaboratively constructed resources in NLP such as
   information retrieval, text categorization, information
   extraction, etc.;
 * Analyzing the properties of collaboratively constructed resources
   related to their use in NLP;
 * Interoperability of collaboratively constructed resources with
   conventional semantic resources and between themselves;
 * Converting unstructured information into structured lexical
   semantic information; tools for mining social and collaborative
 * Quality issues with respect to collaboratively constructed resources.

We also encourage the submission of short papers describing publicly
available tools for accessing or analyzing collaboratively created
resources. During the breaks, tables can be provided for demonstrations.


Rada Mihalcea, University of North Texas


Full paper submissions should follow the two-column format of ACL-IJCNLP
2009 proceedings without exceeding eight (8) pages of content plus one
extra page for references.  Short paper submissions should also follow
the two-column format of ACL-IJCNLP 2009 proceedings, and should not
exceed four (4) pages, including references.
We strongly recommend the use of ACL LaTeX style files or Microsoft
Word Style files tailored for this year's conference, which will be
available on the conference website. All submissions must conform to
the official ACL-IJCNLP 2009 style guidelines available at:

As the reviewing will be blind, the paper must not include the authors'
names and affiliations. Furthermore, self-references that reveal the
author's identity, e.g., "We previously showed (Smith, 1991) ...", must
be avoided. Instead, use citations such as "Smith previously showed
(Smith, 1991) ...". Papers that do not conform to these requirements
will be rejected without review.  

All accepted papers will be presented orally and published in the
workshop proceedings.
The deadline for all papers is May 1st, 2009 (GMT-12).

Submission is electronic using paper submission software at:


Paper submission deadline (full and short): May  1, 2009
Notification of acceptance of papers:       June 1, 2009
Camera-ready copy of papers due:            June 7, 2009
ACL-IJCNLP 2009 Workshop:                   Aug  7, 2009


Iryna Gurevych
Torsten Zesch

Ubiquitous Knowledge Processing Lab
Technical University of Darmstadt, Germany


Delphine Bernhard   Technische Universiaet Darmstadt
Paul Buitelaar      DERI, National University of Ireland, Galway
Razvan Bunescu      University of Texas at Austin
Pablo Castells      Universidad Autononoma de Madrid
Philipp Cimiano     Karlsruhe University
Irene Cramer        Dortmund University of Technology
Andras Csomai       Google Inc.
Ernesto De Luca     University of Magdeburg
Roxana Girju        University of Illinois at Urbana-Champaign
Andreas Hotho       University of Kassel
Graeme Hirst        University of Toronto
Ed Hovy             University of Southern California
Jussi Karlgren      Swedish Institute of Computer Science
Boris Katz          Massachusetts Institute of Technology
Adam Kilgarriff     Lexical Computing Ltd
Chin-Yew Lin        Microsoft Research
James Martin        University of Colorado Boulder
Olena Medelyan      University of Waikato
David Milne         University of Waikato
Saif Mohammad       University of Maryland
Dan Moldovan        University of Texas at Dallas
Kotaro Nakayama     University of Tokyo
Ani Nenkova         University of Pennsylvania
Guenter Neumann     DFKI Saarbruecken
Maarten de Rijke    University of Amsterdam
Magnus Sahlgren     Swedish Institute of Computer Science
Manfred Stede       Potsdam University
Benno Stein         Bauhaus University Weimar
Tonio Wandmacher    University of Osnabrueck
Rene Witte          Concordia University Montreal
Hans-Peter Zorn     European Media Lab, Heidelberg

Corpora mailing list
Corpora <at> uib.no