Peter Parise | 1 Nov 03:15 2011
Picon

Speech to text programs,

Dear corpora list

I am transcribing some audio recordings of group discussions (about 4 people) and would like to know what is out there in terms of transcription programs. I know there is a lot of work involved and would like to make the process as painless as possible. Windows or Linux based is fine with me. Free or with a fee is fine as well. Any information would be greatly appreciated.
 
Peter Parise   http://tesolpeter.wordpress.com/
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Majid Laali | 1 Nov 09:30 2011
Picon

Seeking for multi-words unit tagged corpus

Hi, 

As my M.S. thesis, I need a corpus which in multi-words units are tagged. The only resource I found in the web is multiword.sourceforge.net. I would be grateful if any one could provide me with other corpora which in these words are tagged especially parallel corpus, which my focus is on it.
 
Majid Laali,
NLP Lab (http://ece.ut.ac.ir/NLP),
ECE Department,
University of Tehran,



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Diana McCarthy | 1 Nov 10:04 2011
Picon

Re: Seeking for multi-words unit tagged corpus

Hi Majid

I don't know of parallel corpora but SemCor has multiwords (from WordNet) tagged and also I believe the OntoNotes data (from the LDC). Which languages are you interested in?

best

Diana

Majid Laali wrote, On 01/11/11 08:30:
Hi, 

As my M.S. thesis, I need a corpus which in multi-words units are tagged. The only resource I found in the web is multiword.sourceforge.net. I would be grateful if any one could provide me with other corpora which in these words are tagged especially parallel corpus, which my focus is on it.
 
Majid Laali,
NLP Lab (http://ece.ut.ac.ir/NLP),
ECE Department,
University of Tehran,



_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora


-- =========================================================================== Diana McCarthy, http://www.dianamccarthy.co.uk/ Lexical Computing Ltd. http://www.sketchengine.co.uk/ ===========================================================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Benet Vincent | 1 Nov 10:44 2011
Picon
Picon

Re: Speech to text programs,

Hi Peter, 

One of my colleagues uses Dragon Naturally Speaking and seems pretty happy with it. 

Benet 
________________________________________
From: corpora-bounces <at> uib.no [corpora-bounces <at> uib.no] on behalf of Peter Parise [renshu_renshu <at> yahoo.com]
Sent: Tuesday, November 01, 2011 4:15 AM
To: corpora list
Subject: [Corpora-List] Speech to text programs,

Dear corpora list

I am transcribing some audio recordings of group discussions (about 4 people) and would like to know what is
out there in terms of transcription programs. I know there is a lot of work involved and would like to make
the process as painless as possible. Windows or Linux based is fine with me. Free or with a fee is fine as
well. Any information would be greatly appreciated.

Peter Parise   http://tesolpeter.wordpress.com/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Majid Laali | 1 Nov 12:55 2011
Picon

Re: Seeking for multi-words unit tagged corpus

Dear Dana, 

Thank you for your help. I actually find my first parallel corpora with the help of your email (The MultiSemCor corpus, this corpus also available in other languages).
Regard to corpora language, I prefer English. However, in parallel corpora any languages will help me.

Majid Laali,
NLP Lab (http://ece.ut.ac.ir/NLP),
ECE Department,
University of Tehran,



On Nov 1, 2011, at 12:34 PM, Diana McCarthy wrote:

Hi Majid

I don't know of parallel corpora but SemCor has multiwords (from WordNet) tagged and also I believe the OntoNotes data (from the LDC). Which languages are you interested in?

best

Diana

Majid Laali wrote, On 01/11/11 08:30:
Hi, 

As my M.S. thesis, I need a corpus which in multi-words units are tagged. The only resource I found in the web is multiword.sourceforge.net. I would be grateful if any one could provide me with other corpora which in these words are tagged especially parallel corpus, which my focus is on it.
 
Majid Laali,
NLP Lab (http://ece.ut.ac.ir/NLP),
ECE Department,
University of Tehran,



_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora


-- =========================================================================== Diana McCarthy, http://www.dianamccarthy.co.uk/ Lexical Computing Ltd. http://www.sketchengine.co.uk/ ===========================================================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Qasemizadeh, Behrang | 1 Nov 13:04 2011

Internship on relation extraction

The Unit for Natural Language Processing (http://nlp.deri.ie/) at the Digital Enterprise Research Institute (http://www.deri.ie/) of the National University of Ireland, Galway invites applications for an internship on Relation Extraction from Wikipedia and the ACL Anthology.

The successful candidate will contribute to the development of a research prototype in Semantic Structure Mining.

 

Essential skills:

- software development (any of Java, Perl, C, Prolog, Matlab, SQL) in a Web Framework

- client-server architecture

 

Desirable skills:

- Natural Language Processing

- Machine Learning

- Linked Data

 

Duration of the internship is 6 months (starting date in February 2012) with a remuneration of up to EUR 1200 per month.

 

PLEASE NOTE: the applicant must be a registered student at their home institute for the complete duration of the internship

 

Email your application to Behrang (dot) QasemiZadeh (at) deri [dot] org

- include your CV (upto 2 pages) and motivation letter, both in PDF.

 

Deadline for application is 20 November 2011.

 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Krishnamurthy, Ramesh | 1 Nov 14:47 2011
Picon
Picon

(no subject)

Hi Peter

We used Exmaralda  http://www.exmaralda.org/en_index.html

(the page is also available in German and French) for the GeWiss project

http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/

 

It is free, and seemed to offer an excellent range of facilities. The team responsible for its creation

are also very helpful to users.

 

best

 

Ramesh Krishnamurthy

Visiting Academic Fellow, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET

Room: NX01. Tel: 0121-204-3812.
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/

Corpus Analyst:

(a) GeWiss (Volkswagen Foundation) project: http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/

(b) Discourse of Climate Change: http://www1.aston.ac.uk/lss/research/research-projects/discourse-of-climate-change-project/

(c) Feminism: http://acorn.aston.ac.uk/projects.html

(d) COMENEGO (Corpus Multilingüe de Economía y Negocios) - Multilingual Corpus of Business and Economics: http://dti.ua.es/comenego

(e) European Phraseology Project: http://labidiomas3.ua.es/phraseology/login/login.php

---------------------

Date: Mon, 31 Oct 2011 19:15:10 -0700 (PDT)

From: Peter Parise <renshu_renshu <at> yahoo.com>

Subject: [Corpora-List] Speech to text programs,

To: corpora list <Corpora <at> uib.no>

 

Dear corpora list

 

 

I am transcribing some audio recordings of group discussions (about 4 people) and would like to know what is out there in terms of transcription programs. I know there is a lot of work involved and would like to make the process as painless as possible. Windows or Linux based is fine with me. Free or with a fee is fine as well. Any information would be greatly appreciated.

 

 

Peter Parise   http://tesolpeter.wordpress.com/

 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Josef Ruppenhofer | 1 Nov 15:49 2011
Picon

German SALSA corpus release 2.0

Dear all,

the second and final release of the SALSA corpus, a German corpus with
semantic role annotations in the Berkeley FrameNet paradigm is available for
download at http://www.coli.uni-saarland.de/projects/salsa/corpus/.

The corpus was created by the SALSA project at Saarland University under the
direction of Manfred Pinkal. Work on the corpus was supported by funds from
a Leibniz prize awarded to Manfred Pinkal and by the German Science
Foundation (DFG; grants PI  154/9-3, PI 154/8-1).

The frame semantic annotations are applied on top of the TIGER treebank,
a syntactically annotated German newspaper corpus. Salsa release 2 references
TIGER version 2.1.

More information on TIGER and FrameNet can be found here:

http://www.ims.uni-stuttgart.de/projekte/TIGER/
https://framenet.icsi.berkeley.edu/fndrupal/

SALSA uses frames of FrameNet releases 1.2 and 1.3 for the German annotation,
wherever available and appropriate. In addition, SALSA has developed a
number of ''proto-frames'', i.e., predicate-specific frames, to provide
coverage for predicate instances currently not covered by FrameNet. The
total size of the annotation is roughly 20.000 verbal instances and, new
in Salsa release 2, more than 17.000 nominal instances.

More information on SALSA can be found on the website:

http://www.coli.uni-saarland.de/projects/salsa/

The annotation scheme is described in:

A. Burchardt, K. Erk, A. Frank, A. Kowalski, S. Pado and M. Pinkal. The
SALSA Corpus: a German Corpus Resource for Lexical Semantics. In:
Proceedings of LREC 2006, Genoa, Italy.

If you have any questions, feel free to send an email to

salsa-mit <at> coli.uni-sb.de

Josef Ruppenhofer on behalf of SALSA

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Katya Alahverdzhieva | 1 Nov 18:27 2011
Picon
Picon

Call for Papers: International Workshop on Formal and Computational Approaches to Multimodal Communication

============================================
International Workshop on Formal and Computational Approaches to
Multimodal Communication

held under the auspices of ESSLLI 2012

August 6-17 2012
Opole, Poland

--- First Call for Papers ---

Submission deadline: February 15, 2012
=============================================

Workshop details

Face-to-face dialogue is consistently accompanied by co-verbal
behaviours playing an active role in achieving a successful
communication. A non-exhaustive list of the co-verbal actions
that take part in communication includes hand gestures, head
movements, eye-gaze dynamics, etc. The workshop aims at
collecting state-of-the-art research in the area of multimodal
communication. The focus is on formal and computational
approaches that bridge the gap between novel research findings
and well-established methods from linguistics, human-human
interaction and human-machine interaction.

The study of co-verbal components of communication is still a
young research field, in particular regarding formal and
computational approaches. However, the interest in the field is
rapidly growing as the increasing number of researchers
interested in it and the various dedicated conferences show.

The workshop is of interest for researchers of various
background:

* for formal linguists working on natural language: the workshop
is an opportunity for formal linguist to extend their perspective
on the properties that define language. Co-verbal forms of
communication are deeply connected with the structures of natural
language -- numerous studies have found a systematic link between
co-verbal behaviour on one hand and prosodic, syntactic,
semantic, and discourse structures on the other. Studying the
interplay between the distinct modalities of communication not
only brings up new challenging research questions related to the
nature of language but it also sheds light on well-known problems
in linguistics (e.g. embodied nature of meaning, aspectuality,
referential ‘space’);

* for computational linguists and researchers interested in
human-machine interaction: the possibility of exploiting the
information coming from different sources is becoming an
important technique in the toolkit of natural language
understanding. Further to this, the increasing ubiquitousness of
multimodal devices is opening up new ways of interaction between
users and machines. In both cases, co-verbal behaviour is an
important factor that scientists have to take into account.
Collecting reliable multimodal data is becoming more and more
viable (for example with the introduction of technologies like
time-of-flight cameras and other similar devices) and it is a
necessary condition to design human-machine interaction systems.

The workshop aims at providing a platform to discuss these topics
among the researchers already interested in formal and
computational aspects of multimodal communication, and to
introduce the current state of research to the participants of
ESSLLI that are not familiar with these phenomena.

A non-exhaustive list of topics covered by the workshop includes:

* co-verbal actions and linguistic structures
* resources, e.g., tools, corpora, annotation guidelines, for
studying (the interplay between) the distinct modalities of
communication
* multimodal interfaces addressing the usage of gaze,
gesture, speech, etc
* modelling human behaviour for the purposes of
human-computer interaction systems
* experimental design for studying multimodal behaviour

Workshop Format

Abstracts are invited for 30-minute talks (20 min talk + 10 min
questions), 3 each day of the workshop. The workshop will be
opened by a 60-minute main session and closed by a 30-minute
session for concluding remarks and feedback.

Submission Details

Authors are invited to submit abstracts describing original and
unpublished work. Abstracts detailing ongoing work will be
considered within the discretion of the programme committee.
Submissions should be anonymous, and must not exceed 2 pages
including examples and references. All abstracts must be
submitted electronically in pdf form via Easychair:

https://www.easychair.org/conferences/?conf=focomc2012

Each submission will be anonymously reviewed by three members of
the program committee, and possibly by additional reviewers. The
accepted papers will appear in the workshop electronic
proceedings.

Important dates

Submission deadline: February 15, 2012
Notification of workshop contributors: April 15, 2012
Final submission of camera-ready copies: May 15, 2012
Workshop dates: exact dates to be announced soon, but within August 
6-17, 2011

Practical Information

All workshop participants including the presenters will be
required to register for ESSLLI. The registration fee for authors
presenting a paper will correspond to the early student/workshop
speaker registration fee. Moreover, a number of additional fee
waiver grants will be made available by the ESSLLI Organising
Committee on a competitive basis and workshop participants are
eligible to apply for those. There will be no reimbursement for
travel costs and accommodation. Workshop speakers who have
difficulty in finding funding should contact the local organizing
committee to ask for the possibilities for a grant.

Further Information about ESSLLI 2012:
http://www.esslli2012.pl

Workshop webpage:
http://xerxes.carleton.ca/~giorgolo/gesture-workshop/

Workshop Organisers
Gianluca Giorgolo (Carleton University)
Katya Alahverdzhieva (University of Edinburgh)

Workshop Program Committee
(to be announced)

--

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Linguistic Data Consortium | 1 Nov 21:05 2011

News from LDC

  Fall 2011 LDC Data Scholarships recipients

New publications:

LDC2011S08
2008 NIST Speaker Recognition Evaluation Test Set

LDC2011T11
Arabic Gigaword Fifth Edition 

LDC2011T12
Spanish Gigaword Third Edition


Fall 2011 LDC Data Scholarships recipients

LDC is pleased to announce the student recipients of the Fall 2011 LDC Data Scholarship program!  The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  Data scholarships are offered twice a year to correspond to the Fall and Spring semesters.  Students are asked to complete an application which consists of a data use proposal and letter of support from their academic adviser.  

LDC received many strong applications from students attending universities across the globe.  We've reviewed all the applications, and after careful consideration, we have selected four scholarship recipients!   These students will receive no-cost copies of LDC data:

Haris B C - Indian Institute of Technology Guwahati (India), Electronics & Electrical Engineering.  Haris has been awarded a copy of 2005 NIST Speaker Recognition Evaluation Training Data (LDC2011S01) and 2005 NIST Speaker Recognition Evaluation Test Data (LDC2011S04) to evaluate the performance of a sparse representation speaker verification system.

Friðjón Guðjohnsen - Reykjavik University (Iceland), Computer Science.  Friðjón has been awarded a copy of Treebank-3 (LDC99T42) to be used in the development of tagging methods to improve the accuracy of tagging Icelandic texts.

Leili Javadpour - Louisiana State University (USA), Engineering Science.  Leili has been awarded a copy of BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33) and Message Understanding Conference (MUC) 7 (LDC2001T02) for her work in pronominal anaphora resolution.

Jad Makhlouta - American University of Beirut (Lebanon), Electrical and Computer Engineering.  Jad has been awarded a copy of LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01) for his work in Arabic text mining.

Please join us in congratulating our student recipients!   Look for our upcoming announcements about the submissions deadlines for the Spring 2012 LDC Data Scholarship program.



New publications

(1) 2008 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation (SRE).

NIST SRE is part of an ongoing series of evaluations conducted by NIST.  They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.

LDC previously released the 2008 NIST SRE Training Set in two parts as LDC2011S05 and LDC2011S07.

The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The telephone speech in this corpus is predominantly English, but also includes the above languages. All interview segments are in English. Telephone speech represents approximately 368 hours of the data, whereas microphone speech represents the other 574 hours.

English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system.



*

(2) Arabic Gigaword Fifth Edition is a comprehensive archive of newswire text data that has been acquired from Arabic news sources over several years by LDC. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from January 1, 2009 through December 31, 2010.

Nine distinct sources of Arabic newswire are represented in this distribution:

Asharq Al-Awsat (aaw_arb)

Agence France Presse (afp_arb)

Al-Ahram (ahr_arb)

Assabah (asb_arb)

Al Hayat (hyt_arb)

An Nahar (nhr_arb)

Al-Quds Al-Arabi (qds_arb)

Ummah Press (umh_arb)

Xinhua News Agency (xin_arb)

The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-character language code conforms to the ISO 639-3 standard.

In addition to adding new data, the following updates were made:

Repeated documents in Asharq Al-Awsat data from 2008 were removed.

Document formatting and docid duplication problems were corrected in Agence France Presse  data.

Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed.



*

(3) Spanish Gigaword Third Edition was produced by LDC. It is a comprehensive archive of Spanish newswire text data that has been acquired over several years by LDC. Spanish Gigaword Third Edition includes all of the content of the second edition (LDC2009T21) and adds data collected from January 1, 2009 through December 31, 2010.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

Agence France-Presse, Spanish (afp_spa) May 1994 - Dec 2010

Associated Press, Spanish (apw_spa) Nov 1993 - Dec 2010

Xinhua News Agency, Spanish (xin_spa) Sep 2001 - Dec 2010

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the ISO 639-3 standard.

All text data are presented in SGML/XML form, using a very simple, minimal markup structure; all text consists of printable ASCII, whitespace, and printable code points in the "Latin1 Supplement" character table, as defined by both ISO-8859-1 and the Unicode Standard (ISO 10646) for the "accented" characters used in Spanish. The Supplement/accented characters are rendered using UTF-8 encoding.




Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810 ldc <at> ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Gmane