HOANG Cong Duy Vu | 1 Feb 2011 07:06
Picon

Looking for parallel corpora of Asian languages

Hi,

I am looking for any freely available parallel corpora of Asian languages (e.g. English-Korean, English-Japanese, English-Malay, English-Chinese) for my own research. Especially, I am very interested in parallel corpora in travel domain. If anyone knows, please let me know. 

I greatly appreciate any idea and suggestion. Thanks so much in advance!

--
Cheers,
Vu
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Suzan Verberne | 1 Feb 2011 10:25
Picon

Faster tool for WordNet Similarity measures

Hi all,

I have previously been using Pedersen's WordNet Similarity module (
http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
) for calculating the similarity or relatedness between pairs of
words. Now I started to use it again but I noticed that it is way too
slow for a real-time application (which is what I need now).

I originally wrote a simple Perl script that calls the module (shown
below) but it takes almost five seconds to run. Almost all this time
is spent on calling the module so for batch scripts it is fine (then
the module is only called once for multiple requests), but I need it
to work in real time in a retrieval experiment and then 5 seconds is
too long.

Does anyone know an alternative (fast!) tool for calculating
Similarity and/or Relatedness between two words? It might be using
either a Wu & Palmer-like measure or a Lesk-type measure.

Thanks!
Suzan Verberne

#! /usr/bin/perl
  use WordNet::QueryData;
  use WordNet::Similarity::path;
  my $wn = WordNet::QueryData->new;
  my $measure = WordNet::Similarity::path->new ($wn);
  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
  print "car (sense 1) <-> bus (sense 2) = $value\n";

--

-- 
Suzan Verberne, postdoctoral researcher
Centre for Language and Speech Technology
Radboud University Nijmegen
Tel: +31 24 3611134
Email: s.verberne <at> let.ru.nl
http://lands.let.ru.nl/~sverbern/
--

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Jarmo Jantunen | 1 Feb 2011 14:35
Picon
Picon
Favicon

Advance notice: Conference on Learner Language and Learner Corpora

Dear Colleagues,

Learner Language, Learner Corpora - LLLC 2012 - an international conference addressing learner language
and language corpora will take place from October 5-6, 2012 at the University of Oulu, Finland. This
conference is aimed at those who use electronic data in learner language analyses, as well as those who
study Finno-Ugric learner languages, their acquisition and teaching.

The conference will be organised to celebrate the 15th anniversary of the VIRSU project and to introduce
the CoLLU project (Corpus study on Learner Language Universals). VIRSU was originally a cooperative
project for teachers and researchers of Estonian and Finnish as a foreign language. It has recently
expanded its aims and thus seeks to create connections between all linguists working with any of the
Finno-Ugric target (foreign or second) languages, or the language situations in the countries where
Finno-Ugric languages are spoken.

The CoLLU project in turn focuses on the identification of Learner Language Universals that are present in
a learner language regardless of the learner's mother tongue. The project hosts the ICLFI Corpus, which
is a multi-mother tongue corpus of learner Finnish that is almost five years old.

The invited keynote speakers are 
Professor Fanny Meunier (Centre for English Corpus Linguistics, Université Catholique de Louvain), 
Professor Charlotte Gooskens (Scandinavian Department, University of Groningen), 
Postdoctoral Researcher Minna Suni (Academy of Finland, University of Jyväskylä). 

The languages of the conference will be Finnish and English.

Important dates:
August 31th 2011: First call for papers
December 31st 2011: Abstract submission
March 31st 2012: Notification of acceptance
October 5-6 2012: LLLC 2012 Conference

Further information on the conference and the venue will be given in the first call for papers. We are
looking forward to seeing you in Oulu in 2012!

Pirkko Muikku-Werner
Professor
School of Humanities (Joensuu)
University of Eastern Finland
pirkko.muikku-werner <at> uef.fi

Jarmo Harri Jantunen
Adjunct Professor
Faculty of Humanities
University of Oulu
jarmo.jantunen <at> oulu.fi

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Marco Guerini | 1 Feb 2011 15:32
Picon

New Release of CORPS: a corpus of political speeches tagged with audience reactions

Dear Colleagues,

I'm pleased to announce the new release of  CORPS, a corpus of
political speeches tagged with specific audience reactions, such as
APPLAUSE or LAUGHTER.

In the new release there are more than 3600 speeches (all native
English language), about 7.9 millions words, and more than 67 thousand
tags about audience reactions. The corpus can be usefully employed in
many fields such as:

- qualitative analysis of political communication.
- NLP based persuasive expression mining.
- Automatic production of persuasive communication

The corpus is freely available for research purposes. For further
details please see: http://hlt.fbk.eu/corps

Best Regards
Marco Guerini

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Ted Pedersen | 1 Feb 2011 19:06
Picon
Gravatar

Re: Faster tool for WordNet Similarity measures

Hi Suzan,

If you locate such an alternative I'm absolutely dying to know. :)

It would also be helpful to know how fast would be considered *fast*
or fast enough.  I did a quick little benchmark using 10,000 randomly
generated noun pairs (using our randomPairs.pl program from
http://www.d.umn.edu/~tpederse/wordnet.html )  I used a rather elderly
machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup),
Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs.

path finished in 153 seconds, so about 65 pairs a second
wup finished in 268 seconds, so about 37 pairs a second
lesk finished in 434 second, so about 27 pairs a second

Now this isn't blazing fast, but on the other it doesn't strike me as
unreasonably slow either. What would be your target for pairs per
second? That could help figure out an appropriate solution...Also, if
you are finding that it takes considerably longer than the times I
report above, please do let me know as there might be some other issue
that is slowing you down.

For those who are truly interested in WordNet::Similarity performance,
here are a few more thoughts - I think about this issue regularly, so
any ideas or alternative suggestions are very much appreciated.

You are right in noting that the "load time" is the big factor with a
small number of measurements, and that processing 1 pair is much much
more expensive than processing 100 or 1000. Performance related issues
like this have been discussed a bit on the WordNet Similarity mailing
list, although it sounds like you already understand the basics pretty
well.  Here's a sampling of those discussions....

http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00068.html
http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00226.html
http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00474.html
http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00499.html

There are a few things to consider - lesk is among the slower of the
measures, because it does string overlap matching  for each pair of
extended glosses that are constructed. These extended glosses can be
rather big (hundreds of words) so finding the string overlaps is
rather time consuming. Wu and Palmer (wup), on the other hand, is
fairly swift in that it relies on depth information (for each sense)
and we pre-compute those at run time (how deep is each concept is
found once and then re-used). So overall that's a fairly nice choice
in terms of speed. As you can see above path is even quicker, because
it's only finding shortest paths between the senses.

It's often suggested that we should pre-compute the similarity and
relatedness values, and make them available as some sort of table or
database. That would certainly speed things up, but it's also a rather
enormous task in that if we are talking about roughly a 200,000 x
200,000 relatedness matrix for lesk, vector, vector-pairs and hso.
Those measures can be computing between any parts of speech, so you
need to compute all the pairwise relatedness values for everything in
WordNet. Given that the relatedness measures are generally the slowest
it would be a time consuming operation to build that matrix.

The similarity measures on the other hand, are slightly more tractable
in that they are limited to doing noun-noun and verb-verb comparisons.
So that means we could pre-compute 140,000 x 140,000 matrices for the
nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs
isn't so bad, and if there's any interest I could surely run something
to generate those matrices. The similarity nouns and the relatedness
measures themselves would require a bit more patience. Note the
matrices are symmetric, so it's not *quite* as bad as it first
appears.

I should also say that a number of people have said they were going to
try and do this sort of pre-computation of values, but I can't recall
if anyone actually finished that and/or made it available. If so that
could be something to consider (and it might be nice for anyone who
has done that to remind us as it might be generally pretty useful).

The downside to pre-computing is that it seems to me there are many
options that can be used for some of the measures (all the relatedness
measures and the info content measures for sure) so a pre-computed set
of values would only reflect one particular way of using those
measures - so, one size may not fit all. For the path based similarity
measures (path, wup, lch) I don't think this is much of an issue as
there aren't so many options, and so in general everyone uses those in
the same way and a pre-computed table of similarity measures might be
very helpful.

Other suggestions have included - rewrite in Java, Python, C++, C,
Fortran, anything but Perl!! :) - I don't know if I see much advantage
to that in terms of speed - Perl uses a lot of memory and that can
cause performance to lag in some situations, but in general
WordNet::Similarity doesn't use much memory so reducing memory use
isn't much of a win (typically it's running with less than 200 MB of
RAM which doesn't seem onerous). That said, it is possible that lesk
could be sped up considerably with a faster string matching algorithm
(in Perl or C or something else), and it's also possible that there
might be a more clever way of encoding WordNet so searches through the
paths it done in real-time are much faster. So there are probably
areas of the WordNet::Similarity code that could benefit from an
efficient sub-routine in some other language, and we are very open to
including something like that in the package.

It is possible that you could run many many similarity/relatedness
measurements in parallel and pick up some speed that way. Each
measurement is independent of every other measurement, so it's a very
decomposable problem (embarrassingly parallel as some like to say) so
if you have the ability to divide up the work and collect the results
again that could have some possibilities. You might even consider
using our server mechanism (similarity_server.pl in the package) as a
way to handle multiple parallel queries - you could run the server and
then it can process up to as many children as you care to specify,
where each child would be computing a similarity pair.

Well, I hope this helps a little, and perhaps gives folks ideas about
what's happening. I do think it's important to calibrate things a bit,
and work out how many pairs a second you are able to get in your
current set up, and then set a specific target for what you need. If
you are getting 20 pairs a second and need 20,000, that's quite
different than if you need 100, and each expectation suggests
different sorts of solutions.

Cordially,
Ted

On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne <at> let.ru.nl> wrote:
> Hi all,
>
> I have previously been using Pedersen's WordNet Similarity module (
> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
> ) for calculating the similarity or relatedness between pairs of
> words. Now I started to use it again but I noticed that it is way too
> slow for a real-time application (which is what I need now).
>
> I originally wrote a simple Perl script that calls the module (shown
> below) but it takes almost five seconds to run. Almost all this time
> is spent on calling the module so for batch scripts it is fine (then
> the module is only called once for multiple requests), but I need it
> to work in real time in a retrieval experiment and then 5 seconds is
> too long.
>
> Does anyone know an alternative (fast!) tool for calculating
> Similarity and/or Relatedness between two words? It might be using
> either a Wu & Palmer-like measure or a Lesk-type measure.
>
> Thanks!
> Suzan Verberne
>
> #! /usr/bin/perl
>  use WordNet::QueryData;
>  use WordNet::Similarity::path;
>  my $wn = WordNet::QueryData->new;
>  my $measure = WordNet::Similarity::path->new ($wn);
>  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
>  print "car (sense 1) <-> bus (sense 2) = $value\n";
>
>
> --
> Suzan Verberne, postdoctoral researcher
> Centre for Language and Speech Technology
> Radboud University Nijmegen
> Tel: +31 24 3611134
> Email: s.verberne <at> let.ru.nl
> http://lands.let.ru.nl/~sverbern/
> --
>
> _______________________________________________
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora
>

--

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Nitin Madnani | 1 Feb 2011 19:20
Picon
Gravatar

Re: Faster tool for WordNet Similarity measures

Ted,

I agree that rewriting in Python is not likely to provide much speed
up. I have contributed to NLTK's (python-based) WordNet similarity
implementation (that has been inspired by WordNet::Similarity) and it
isn't that much faster than then Perl version.

I think that pre-computing is actually an interesting idea. It might
be possible to pre-compute the similarity matrics using a combination
of Hadoop and EC2.

I will concur that parallel-computation (either via a server or,
again, via Hadoop+EC2) is definitely a relatively easy solution. I
wasn't aware of similarity_server.pl but I have, in the past, used a
homegrown XML-RPC server to offset start-up cost for similarity
computation.

Nitin

On Tue, Feb 1, 2011 at 1:06 PM, Ted Pedersen <tpederse <at> d.umn.edu> wrote:
> Hi Suzan,
>
> If you locate such an alternative I'm absolutely dying to know. :)
>
> It would also be helpful to know how fast would be considered *fast*
> or fast enough.  I did a quick little benchmark using 10,000 randomly
> generated noun pairs (using our randomPairs.pl program from
> http://www.d.umn.edu/~tpederse/wordnet.html )  I used a rather elderly
> machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup),
> Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs.
>
> path finished in 153 seconds, so about 65 pairs a second
> wup finished in 268 seconds, so about 37 pairs a second
> lesk finished in 434 second, so about 27 pairs a second
>
> Now this isn't blazing fast, but on the other it doesn't strike me as
> unreasonably slow either. What would be your target for pairs per
> second? That could help figure out an appropriate solution...Also, if
> you are finding that it takes considerably longer than the times I
> report above, please do let me know as there might be some other issue
> that is slowing you down.
>
> For those who are truly interested in WordNet::Similarity performance,
> here are a few more thoughts - I think about this issue regularly, so
> any ideas or alternative suggestions are very much appreciated.
>
> You are right in noting that the "load time" is the big factor with a
> small number of measurements, and that processing 1 pair is much much
> more expensive than processing 100 or 1000. Performance related issues
> like this have been discussed a bit on the WordNet Similarity mailing
> list, although it sounds like you already understand the basics pretty
> well.  Here's a sampling of those discussions....
>
> http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00068.html
> http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00226.html
> http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00474.html
> http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00499.html
>
> There are a few things to consider - lesk is among the slower of the
> measures, because it does string overlap matching  for each pair of
> extended glosses that are constructed. These extended glosses can be
> rather big (hundreds of words) so finding the string overlaps is
> rather time consuming. Wu and Palmer (wup), on the other hand, is
> fairly swift in that it relies on depth information (for each sense)
> and we pre-compute those at run time (how deep is each concept is
> found once and then re-used). So overall that's a fairly nice choice
> in terms of speed. As you can see above path is even quicker, because
> it's only finding shortest paths between the senses.
>
> It's often suggested that we should pre-compute the similarity and
> relatedness values, and make them available as some sort of table or
> database. That would certainly speed things up, but it's also a rather
> enormous task in that if we are talking about roughly a 200,000 x
> 200,000 relatedness matrix for lesk, vector, vector-pairs and hso.
> Those measures can be computing between any parts of speech, so you
> need to compute all the pairwise relatedness values for everything in
> WordNet. Given that the relatedness measures are generally the slowest
> it would be a time consuming operation to build that matrix.
>
> The similarity measures on the other hand, are slightly more tractable
> in that they are limited to doing noun-noun and verb-verb comparisons.
> So that means we could pre-compute 140,000 x 140,000 matrices for the
> nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs
> isn't so bad, and if there's any interest I could surely run something
> to generate those matrices. The similarity nouns and the relatedness
> measures themselves would require a bit more patience. Note the
> matrices are symmetric, so it's not *quite* as bad as it first
> appears.
>
> I should also say that a number of people have said they were going to
> try and do this sort of pre-computation of values, but I can't recall
> if anyone actually finished that and/or made it available. If so that
> could be something to consider (and it might be nice for anyone who
> has done that to remind us as it might be generally pretty useful).
>
> The downside to pre-computing is that it seems to me there are many
> options that can be used for some of the measures (all the relatedness
> measures and the info content measures for sure) so a pre-computed set
> of values would only reflect one particular way of using those
> measures - so, one size may not fit all. For the path based similarity
> measures (path, wup, lch) I don't think this is much of an issue as
> there aren't so many options, and so in general everyone uses those in
> the same way and a pre-computed table of similarity measures might be
> very helpful.
>
> Other suggestions have included - rewrite in Java, Python, C++, C,
> Fortran, anything but Perl!! :) - I don't know if I see much advantage
> to that in terms of speed - Perl uses a lot of memory and that can
> cause performance to lag in some situations, but in general
> WordNet::Similarity doesn't use much memory so reducing memory use
> isn't much of a win (typically it's running with less than 200 MB of
> RAM which doesn't seem onerous). That said, it is possible that lesk
> could be sped up considerably with a faster string matching algorithm
> (in Perl or C or something else), and it's also possible that there
> might be a more clever way of encoding WordNet so searches through the
> paths it done in real-time are much faster. So there are probably
> areas of the WordNet::Similarity code that could benefit from an
> efficient sub-routine in some other language, and we are very open to
> including something like that in the package.
>
> It is possible that you could run many many similarity/relatedness
> measurements in parallel and pick up some speed that way. Each
> measurement is independent of every other measurement, so it's a very
> decomposable problem (embarrassingly parallel as some like to say) so
> if you have the ability to divide up the work and collect the results
> again that could have some possibilities. You might even consider
> using our server mechanism (similarity_server.pl in the package) as a
> way to handle multiple parallel queries - you could run the server and
> then it can process up to as many children as you care to specify,
> where each child would be computing a similarity pair.
>
> Well, I hope this helps a little, and perhaps gives folks ideas about
> what's happening. I do think it's important to calibrate things a bit,
> and work out how many pairs a second you are able to get in your
> current set up, and then set a specific target for what you need. If
> you are getting 20 pairs a second and need 20,000, that's quite
> different than if you need 100, and each expectation suggests
> different sorts of solutions.
>
> Cordially,
> Ted
>
> On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne <at> let.ru.nl> wrote:
>> Hi all,
>>
>> I have previously been using Pedersen's WordNet Similarity module (
>> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
>> ) for calculating the similarity or relatedness between pairs of
>> words. Now I started to use it again but I noticed that it is way too
>> slow for a real-time application (which is what I need now).
>>
>> I originally wrote a simple Perl script that calls the module (shown
>> below) but it takes almost five seconds to run. Almost all this time
>> is spent on calling the module so for batch scripts it is fine (then
>> the module is only called once for multiple requests), but I need it
>> to work in real time in a retrieval experiment and then 5 seconds is
>> too long.
>>
>> Does anyone know an alternative (fast!) tool for calculating
>> Similarity and/or Relatedness between two words? It might be using
>> either a Wu & Palmer-like measure or a Lesk-type measure.
>>
>> Thanks!
>> Suzan Verberne
>>
>> #! /usr/bin/perl
>>  use WordNet::QueryData;
>>  use WordNet::Similarity::path;
>>  my $wn = WordNet::QueryData->new;
>>  my $measure = WordNet::Similarity::path->new ($wn);
>>  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
>>  print "car (sense 1) <-> bus (sense 2) = $value\n";
>>
>>
>> --
>> Suzan Verberne, postdoctoral researcher
>> Centre for Language and Speech Technology
>> Radboud University Nijmegen
>> Tel: +31 24 3611134
>> Email: s.verberne <at> let.ru.nl
>> http://lands.let.ru.nl/~sverbern/
>> --
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora <at> uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>
> _______________________________________________
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora
>

--

-- 
Linguist, Desi Linguist
http://www.desilinguist.org

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Mark Sammons | 1 Feb 2011 22:28
Favicon

Re: Faster tool for WordNet Similarity measures

Hi, Suzan.

If what you are after is just a similarity score, you could
try the Cognitive Computation Group's WordNet-based similarity metric,
written in c++ and -- anecdotally -- pretty fast. 

It runs as an xmlrpc service, which imposes a certain network latency
overhead, *but* is language neutral, which is one appealing feature.  Within
our research group, many users call it and cache the response to reduce
processing time still further.  If you are coding in c++, then you could of course 
call WNSim directly.

We've used WNSim in a number of research projects, including our work on 
Recognizing Textual Entailment and on Distant Supervision.  Here's the page for 
the WNSim code:

http://cogcomptest.cs.illinois.edu/page/software_view/21

There's also a link to a technical report that describes the underlying methodology.

You can take a look at the output using the demo here:

http://cogcomp.cs.illinois.edu/demo/wnsim/

Regards,

Mark

---- Original message ----
>Date: Tue, 1 Feb 2011 10:25:23 +0100
>From: Suzan Verberne <s.verberne <at> let.ru.nl>  
>Subject: [Corpora-List] Faster tool for WordNet Similarity measures  
>To: Corpora List <corpora <at> uib.no>
>
>Hi all,
>
>I have previously been using Pedersen's WordNet Similarity module (
>http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
>) for calculating the similarity or relatedness between pairs of
>words. Now I started to use it again but I noticed that it is way too
>slow for a real-time application (which is what I need now).
>
>I originally wrote a simple Perl script that calls the module (shown
>below) but it takes almost five seconds to run. Almost all this time
>is spent on calling the module so for batch scripts it is fine (then
>the module is only called once for multiple requests), but I need it
>to work in real time in a retrieval experiment and then 5 seconds is
>too long.
>
>Does anyone know an alternative (fast!) tool for calculating
>Similarity and/or Relatedness between two words? It might be using
>either a Wu & Palmer-like measure or a Lesk-type measure.
>
>Thanks!
>Suzan Verberne
>
>#! /usr/bin/perl
>  use WordNet::QueryData;
>  use WordNet::Similarity::path;
>  my $wn = WordNet::QueryData->new;
>  my $measure = WordNet::Similarity::path->new ($wn);
>  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
>  print "car (sense 1) <-> bus (sense 2) = $value\n";
>
>
>-- 
>Suzan Verberne, postdoctoral researcher
>Centre for Language and Speech Technology
>Radboud University Nijmegen
>Tel: +31 24 3611134
>Email: s.verberne <at> let.ru.nl
>http://lands.let.ru.nl/~sverbern/
>--
>
>_______________________________________________
>Corpora mailing list
>Corpora <at> uib.no
>http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Linguistic Data Consortium | 1 Feb 2011 22:39
Favicon

New from LDC

New publications:

ACE 2005 English SpatialML Annotations Version 2   -

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages   -


(1) ACE 2005 English SpatialML Annotations Version 2 was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus LDC2006T06. This second version eliminates a number of annotation inconsistencies and errors identified in ACE 2005 English SpatialML Annotations LDC2008T03. In addition, the SpatialML annotation schema has been updated from version 2.0 to version 3.0.1; the revised annotation guidelines are included in this release.

The ACE (Automatic Content Extraction) program focused on developing automatic content extraction technology to support automatic processing of human language in text form., specifically, entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. It is intended to emulate earlier progress on time expression such as TIMEX2, TimeML, and the 2005 ACE guidelines.

SpatialML includes syntax for marking up PLACEs mentioned in text and for linking them to data from gazetteers and other databases. LINKs are used to express relations between places, and RLINKs to capture trajectories for relative locations. To the extent possible, SpatialML leverages ISO and other standards with the goal of making the scheme compatible with existing and future corpora. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such markup can be useful for disambiguation, integration with mapping services and spatial reasoning.

This corpus contains 210065 total words and 17821 unique words. Counts of unique words can be found in doc/ldc_wordcount.csv which includes all words that are not part of XML markup (e.g., without tag names, attribute names or values). Unique words are counted by comparing case insensitive transformations with preceding and trailing punctuation stripped off. "Words" consisting solely of punctuation are discarded.

The principal change in the annotation schema is that "PATH" has been generalized to "RLINK" for relative link. At the top level, there is now a version attribute on the root SpatialML tag to capture which version of SpatialML was used. A number of smaller changes have been made to the annotation specification; these are listed in Section 2 of the updated guidelines.


*

(2) SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000 words of English broadcast and newswire data annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare automatic coreference resolution systems for six languages (Catalan, Dutch, English, German, Italian and Spanish) in four evaluation settings using four metrics. Further information about Task 1 can be found on the task description website.

The data is divided into three sets: the development set which contains 39 documents, 741 sentences and 17,044 tokens; the training set which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set  which contains 85 documents, 1,141 sentences and 24,206 tokens. The complete material for training systems is the sum of the development and training sets.

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages is distributed via web download.

This data is available at no charge.  Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members.  The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. 

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810 ldc <at> ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Roberto Navigli | 2 Feb 2011 10:40
Picon

2 Ph.D. positions in NLP at Sapienza University of Rome

TWO PH.D. POSITIONS IN NATURAL LANGUAGE PROCESSING -- SAPIENZA UNIVERSITY OF
ROME (ITALY)

The Department of Computer Science of the Sapienza University of Rome 
invites applications for TWO PH.D. POSITIONS IN NATURAL LANGUAGE PROCESSING. 

The positions are part of a 5-year ERC Starting Grant on MULTILINGUAL
SEMANTIC PROCESSING funded by the European Research Council (ERC) and
headed by Prof. Roberto Navigli. The successful candidates will join 
a young and dynamic research team (including 2 faculty members, 
3 research fellows and 2 research associates) with solid experience 
and proven excellence in NLP research. Please visit the group's website:

http://lcl.uniroma1.it

The candidates are expected to perform cutting-edge research in one of
the following areas:

* MULTILINGUAL & DOMAIN-ORIENTED WORD SENSE DISAMBIGUATION & INDUCTION
* STATISTICAL MACHINE TRANSLATION FOR RESOURCE-RICH VS. 
  RESOURCE-POOR LANGUAGES
* SEMANTIC INFORMATION RETRIEVAL AND EXTRACTION

We are seeking for qualified candidates with an excellent university degree
who are willing to research and develop state-of-the-art algorithms and 
approaches.

Applications are invited from suitably qualified candidates who have:

* A Master's degree in Computer Science, Computational Linguistics, 
  Mathematics, or a related discipline.
* Good communication skills in English. NOTE: knowledge of Italian is NOT 
  a requirement.
* Strong programming skills: Java and practical expertise with standard 
  NLP tools are required, knowledge of Perl and C++ are a plus.

Publication record in any of the above areas is a plus.

INFORMATION

* Application deadline: February 20, 2011
* Starting date: as soon as possible (successful candidates can start on a 
  paid internship before the official academic year -- i.e., before October
  2011)
* Duration: 3 years
* Salary: approx. 16800 euros per annum after taxes (around 1400 euros per 
  month)

Informal inquiries can be sent by email to Roberto Navigli 
(navigli <at> di.uniroma1.it).

HOW TO APPLY

Please send a detailed CV (possibly including a short research statement) 
and official transcripts of academic records to Roberto Navigli 
(navigli <at> di.uniroma1.it). Please include the position reference 
LCL-PHD-2011 in the subject line.

ABOUT LA SAPIENZA

The Sapienza University of Rome is a seven-century-old university in the 
heart of Rome. It is one of the largest universities in Europe, with around 
150,000 students. Its Faculty of Science (that includes the Department of 
Computer Science) has been ranked as the best school of science in Italy, 
7th in Europe, and 25th in the world according to the Times Higher World 
University Ranking 2009 for Science. Sapienza University is an equal 
opportunity employer.

ABOUT THE COMPUTER SCIENCE DEPARTMENT

The Department of Computer Science is a modern and well-equipped research
institution with a top-class faculty and a strong Ph.D. program. The 
Department comprises 42 faculty members, 15 postdocs and around 20 Ph.D. 
students. The successful candidate will be based in Rome, one of the most 
beautiful cities in the world. The Department is situated in a nice lively 
area with lots of cafes, bars, restaurants, and at walking distance from 
the city centre. Prominent candidates must not be afraid of the language 
barrier, as Italians are in general very friendly.

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Anne-Kathrin Schumann | 1 Feb 2011 12:49
Picon

lexical semantics and alchemy

Hi,

Ken wrote: "It would be nice if we could get some community-wide effort into this. We need a vehicle, perhaps
transforming Wiktionary. It would be nice if we could apply John's rules to Ted's compounds and *put those
findings into a dictionary* (lexicographers have only barely done so, while lexicologists need that
information). "

-- As far as I find time, I would be happy to contribute some German and Russian material (if needed?). In my
view, comparing compounds in a Germanic and a Slavic language could yield interesting results. Russian
has multiple types of compounds with complex grammatical and semantic features and, as far as I know, all
of them are productive. There also might be some hope that comparison with Russian sheds light on
semantically opaque compounds in English because a big part of them will nicely translate into more
explicit multi-word units in Russian (at least in terminology - I'm not sure about the 'rubber duck'). But
this is just an idea, not a hypothesis. 
I also want to contribute some German examples to the rule vs. analogy discussion. "Vollzug" was a very nice
one because 'voll' is extremely productive on all levels of German discourse and across different POS
categories. In my view, there are two distinct senses to 'voll': one that expresses exhaustiveness or
completeness as in Vollmilch, Vollei, Vollkorn, Vollzug, Vollernte and older forms such as volljährig
etc. In colloquial (probably juvenile) speech, however, 'voll' seems to be more of an intensifier as in
Volltrottel, Vollpfosten, Vollidiot or even in phrases as 'Das ist ja voll super'. 
I don't see a rule here. Certainly people could also say 'Komplettzug' or 'Supertrottel' and they actually
do (maybe not with these words, but I think the principle is clear), but the impact of these ad hoc forms
cannot be compared to relatively stable units with 'voll'. I think this really is a case of analogy. There
also might be metaphorical compounds (didn't we have 'rubber chicken'?).
Regards,
anne
___________________________________________________________
NEU: FreePhone - kostenlos mobil telefonieren und surfen!				
Jetzt informieren: http://produkte.web.de/go/webdefreephone

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora


Gmane