Looking for parallel corpora of Asian languages
2011-02-01 06:06:14 GMT
--
Cheers,
Vu
_______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
_______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Hi all, I have previously been using Pedersen's WordNet Similarity module ( http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm ) for calculating the similarity or relatedness between pairs of words. Now I started to use it again but I noticed that it is way too slow for a real-time application (which is what I need now). I originally wrote a simple Perl script that calls the module (shown below) but it takes almost five seconds to run. Almost all this time is spent on calling the module so for batch scripts it is fine (then the module is only called once for multiple requests), but I need it to work in real time in a retrieval experiment and then 5 seconds is too long. Does anyone know an alternative (fast!) tool for calculating Similarity and/or Relatedness between two words? It might be using either a Wu & Palmer-like measure or a Lesk-type measure. Thanks! Suzan Verberne #! /usr/bin/perl use WordNet::QueryData; use WordNet::Similarity::path; my $wn = WordNet::QueryData->new; my $measure = WordNet::Similarity::path->new ($wn); my $value = $measure->getRelatedness("car#n#1", "bus#n#2"); print "car (sense 1) <-> bus (sense 2) = $value\n"; -- -- Suzan Verberne, postdoctoral researcher Centre for Language and Speech Technology Radboud University Nijmegen Tel: +31 24 3611134 Email: s.verberne <at> let.ru.nl http://lands.let.ru.nl/~sverbern/ -- _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Dear Colleagues, Learner Language, Learner Corpora - LLLC 2012 - an international conference addressing learner language and language corpora will take place from October 5-6, 2012 at the University of Oulu, Finland. This conference is aimed at those who use electronic data in learner language analyses, as well as those who study Finno-Ugric learner languages, their acquisition and teaching. The conference will be organised to celebrate the 15th anniversary of the VIRSU project and to introduce the CoLLU project (Corpus study on Learner Language Universals). VIRSU was originally a cooperative project for teachers and researchers of Estonian and Finnish as a foreign language. It has recently expanded its aims and thus seeks to create connections between all linguists working with any of the Finno-Ugric target (foreign or second) languages, or the language situations in the countries where Finno-Ugric languages are spoken. The CoLLU project in turn focuses on the identification of Learner Language Universals that are present in a learner language regardless of the learner's mother tongue. The project hosts the ICLFI Corpus, which is a multi-mother tongue corpus of learner Finnish that is almost five years old. The invited keynote speakers are Professor Fanny Meunier (Centre for English Corpus Linguistics, Université Catholique de Louvain), Professor Charlotte Gooskens (Scandinavian Department, University of Groningen), Postdoctoral Researcher Minna Suni (Academy of Finland, University of Jyväskylä). The languages of the conference will be Finnish and English. Important dates: August 31th 2011: First call for papers December 31st 2011: Abstract submission March 31st 2012: Notification of acceptance October 5-6 2012: LLLC 2012 Conference Further information on the conference and the venue will be given in the first call for papers. We are looking forward to seeing you in Oulu in 2012! Pirkko Muikku-Werner Professor School of Humanities (Joensuu) University of Eastern Finland pirkko.muikku-werner <at> uef.fi Jarmo Harri Jantunen Adjunct Professor Faculty of Humanities University of Oulu jarmo.jantunen <at> oulu.fi _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Dear Colleagues, I'm pleased to announce the new release of CORPS, a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER. In the new release there are more than 3600 speeches (all native English language), about 7.9 millions words, and more than 67 thousand tags about audience reactions. The corpus can be usefully employed in many fields such as: - qualitative analysis of political communication. - NLP based persuasive expression mining. - Automatic production of persuasive communication The corpus is freely available for research purposes. For further details please see: http://hlt.fbk.eu/corps Best Regards Marco Guerini _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Hi Suzan, If you locate such an alternative I'm absolutely dying to know. :) It would also be helpful to know how fast would be considered *fast* or fast enough. I did a quick little benchmark using 10,000 randomly generated noun pairs (using our randomPairs.pl program from http://www.d.umn.edu/~tpederse/wordnet.html ) I used a rather elderly machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup), Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs. path finished in 153 seconds, so about 65 pairs a second wup finished in 268 seconds, so about 37 pairs a second lesk finished in 434 second, so about 27 pairs a second Now this isn't blazing fast, but on the other it doesn't strike me as unreasonably slow either. What would be your target for pairs per second? That could help figure out an appropriate solution...Also, if you are finding that it takes considerably longer than the times I report above, please do let me know as there might be some other issue that is slowing you down. For those who are truly interested in WordNet::Similarity performance, here are a few more thoughts - I think about this issue regularly, so any ideas or alternative suggestions are very much appreciated. You are right in noting that the "load time" is the big factor with a small number of measurements, and that processing 1 pair is much much more expensive than processing 100 or 1000. Performance related issues like this have been discussed a bit on the WordNet Similarity mailing list, although it sounds like you already understand the basics pretty well. Here's a sampling of those discussions.... http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00068.html http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00226.html http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00474.html http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00499.html There are a few things to consider - lesk is among the slower of the measures, because it does string overlap matching for each pair of extended glosses that are constructed. These extended glosses can be rather big (hundreds of words) so finding the string overlaps is rather time consuming. Wu and Palmer (wup), on the other hand, is fairly swift in that it relies on depth information (for each sense) and we pre-compute those at run time (how deep is each concept is found once and then re-used). So overall that's a fairly nice choice in terms of speed. As you can see above path is even quicker, because it's only finding shortest paths between the senses. It's often suggested that we should pre-compute the similarity and relatedness values, and make them available as some sort of table or database. That would certainly speed things up, but it's also a rather enormous task in that if we are talking about roughly a 200,000 x 200,000 relatedness matrix for lesk, vector, vector-pairs and hso. Those measures can be computing between any parts of speech, so you need to compute all the pairwise relatedness values for everything in WordNet. Given that the relatedness measures are generally the slowest it would be a time consuming operation to build that matrix. The similarity measures on the other hand, are slightly more tractable in that they are limited to doing noun-noun and verb-verb comparisons. So that means we could pre-compute 140,000 x 140,000 matrices for the nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs isn't so bad, and if there's any interest I could surely run something to generate those matrices. The similarity nouns and the relatedness measures themselves would require a bit more patience. Note the matrices are symmetric, so it's not *quite* as bad as it first appears. I should also say that a number of people have said they were going to try and do this sort of pre-computation of values, but I can't recall if anyone actually finished that and/or made it available. If so that could be something to consider (and it might be nice for anyone who has done that to remind us as it might be generally pretty useful). The downside to pre-computing is that it seems to me there are many options that can be used for some of the measures (all the relatedness measures and the info content measures for sure) so a pre-computed set of values would only reflect one particular way of using those measures - so, one size may not fit all. For the path based similarity measures (path, wup, lch) I don't think this is much of an issue as there aren't so many options, and so in general everyone uses those in the same way and a pre-computed table of similarity measures might be very helpful. Other suggestions have included - rewrite in Java, Python, C++, C, Fortran, anything but Perl!! :) - I don't know if I see much advantage to that in terms of speed - Perl uses a lot of memory and that can cause performance to lag in some situations, but in general WordNet::Similarity doesn't use much memory so reducing memory use isn't much of a win (typically it's running with less than 200 MB of RAM which doesn't seem onerous). That said, it is possible that lesk could be sped up considerably with a faster string matching algorithm (in Perl or C or something else), and it's also possible that there might be a more clever way of encoding WordNet so searches through the paths it done in real-time are much faster. So there are probably areas of the WordNet::Similarity code that could benefit from an efficient sub-routine in some other language, and we are very open to including something like that in the package. It is possible that you could run many many similarity/relatedness measurements in parallel and pick up some speed that way. Each measurement is independent of every other measurement, so it's a very decomposable problem (embarrassingly parallel as some like to say) so if you have the ability to divide up the work and collect the results again that could have some possibilities. You might even consider using our server mechanism (similarity_server.pl in the package) as a way to handle multiple parallel queries - you could run the server and then it can process up to as many children as you care to specify, where each child would be computing a similarity pair. Well, I hope this helps a little, and perhaps gives folks ideas about what's happening. I do think it's important to calibrate things a bit, and work out how many pairs a second you are able to get in your current set up, and then set a specific target for what you need. If you are getting 20 pairs a second and need 20,000, that's quite different than if you need 100, and each expectation suggests different sorts of solutions. Cordially, Ted On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne <at> let.ru.nl> wrote: > Hi all, > > I have previously been using Pedersen's WordNet Similarity module ( > http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm > ) for calculating the similarity or relatedness between pairs of > words. Now I started to use it again but I noticed that it is way too > slow for a real-time application (which is what I need now). > > I originally wrote a simple Perl script that calls the module (shown > below) but it takes almost five seconds to run. Almost all this time > is spent on calling the module so for batch scripts it is fine (then > the module is only called once for multiple requests), but I need it > to work in real time in a retrieval experiment and then 5 seconds is > too long. > > Does anyone know an alternative (fast!) tool for calculating > Similarity and/or Relatedness between two words? It might be using > either a Wu & Palmer-like measure or a Lesk-type measure. > > Thanks! > Suzan Verberne > > #! /usr/bin/perl > use WordNet::QueryData; > use WordNet::Similarity::path; > my $wn = WordNet::QueryData->new; > my $measure = WordNet::Similarity::path->new ($wn); > my $value = $measure->getRelatedness("car#n#1", "bus#n#2"); > print "car (sense 1) <-> bus (sense 2) = $value\n"; > > > -- > Suzan Verberne, postdoctoral researcher > Centre for Language and Speech Technology > Radboud University Nijmegen > Tel: +31 24 3611134 > Email: s.verberne <at> let.ru.nl > http://lands.let.ru.nl/~sverbern/ > -- > > _______________________________________________ > Corpora mailing list > Corpora <at> uib.no > http://mailman.uib.no/listinfo/corpora > -- -- Ted Pedersen http://www.d.umn.edu/~tpederse _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Ted, I agree that rewriting in Python is not likely to provide much speed up. I have contributed to NLTK's (python-based) WordNet similarity implementation (that has been inspired by WordNet::Similarity) and it isn't that much faster than then Perl version. I think that pre-computing is actually an interesting idea. It might be possible to pre-compute the similarity matrics using a combination of Hadoop and EC2. I will concur that parallel-computation (either via a server or, again, via Hadoop+EC2) is definitely a relatively easy solution. I wasn't aware of similarity_server.pl but I have, in the past, used a homegrown XML-RPC server to offset start-up cost for similarity computation. Nitin On Tue, Feb 1, 2011 at 1:06 PM, Ted Pedersen <tpederse <at> d.umn.edu> wrote: > Hi Suzan, > > If you locate such an alternative I'm absolutely dying to know. :) > > It would also be helpful to know how fast would be considered *fast* > or fast enough. I did a quick little benchmark using 10,000 randomly > generated noun pairs (using our randomPairs.pl program from > http://www.d.umn.edu/~tpederse/wordnet.html ) I used a rather elderly > machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup), > Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs. > > path finished in 153 seconds, so about 65 pairs a second > wup finished in 268 seconds, so about 37 pairs a second > lesk finished in 434 second, so about 27 pairs a second > > Now this isn't blazing fast, but on the other it doesn't strike me as > unreasonably slow either. What would be your target for pairs per > second? That could help figure out an appropriate solution...Also, if > you are finding that it takes considerably longer than the times I > report above, please do let me know as there might be some other issue > that is slowing you down. > > For those who are truly interested in WordNet::Similarity performance, > here are a few more thoughts - I think about this issue regularly, so > any ideas or alternative suggestions are very much appreciated. > > You are right in noting that the "load time" is the big factor with a > small number of measurements, and that processing 1 pair is much much > more expensive than processing 100 or 1000. Performance related issues > like this have been discussed a bit on the WordNet Similarity mailing > list, although it sounds like you already understand the basics pretty > well. Here's a sampling of those discussions.... > > http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00068.html > http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00226.html > http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00474.html > http://www.mail-archive.com/wn-similarity <at> yahoogroups.com/msg00499.html > > There are a few things to consider - lesk is among the slower of the > measures, because it does string overlap matching for each pair of > extended glosses that are constructed. These extended glosses can be > rather big (hundreds of words) so finding the string overlaps is > rather time consuming. Wu and Palmer (wup), on the other hand, is > fairly swift in that it relies on depth information (for each sense) > and we pre-compute those at run time (how deep is each concept is > found once and then re-used). So overall that's a fairly nice choice > in terms of speed. As you can see above path is even quicker, because > it's only finding shortest paths between the senses. > > It's often suggested that we should pre-compute the similarity and > relatedness values, and make them available as some sort of table or > database. That would certainly speed things up, but it's also a rather > enormous task in that if we are talking about roughly a 200,000 x > 200,000 relatedness matrix for lesk, vector, vector-pairs and hso. > Those measures can be computing between any parts of speech, so you > need to compute all the pairwise relatedness values for everything in > WordNet. Given that the relatedness measures are generally the slowest > it would be a time consuming operation to build that matrix. > > The similarity measures on the other hand, are slightly more tractable > in that they are limited to doing noun-noun and verb-verb comparisons. > So that means we could pre-compute 140,000 x 140,000 matrices for the > nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs > isn't so bad, and if there's any interest I could surely run something > to generate those matrices. The similarity nouns and the relatedness > measures themselves would require a bit more patience. Note the > matrices are symmetric, so it's not *quite* as bad as it first > appears. > > I should also say that a number of people have said they were going to > try and do this sort of pre-computation of values, but I can't recall > if anyone actually finished that and/or made it available. If so that > could be something to consider (and it might be nice for anyone who > has done that to remind us as it might be generally pretty useful). > > The downside to pre-computing is that it seems to me there are many > options that can be used for some of the measures (all the relatedness > measures and the info content measures for sure) so a pre-computed set > of values would only reflect one particular way of using those > measures - so, one size may not fit all. For the path based similarity > measures (path, wup, lch) I don't think this is much of an issue as > there aren't so many options, and so in general everyone uses those in > the same way and a pre-computed table of similarity measures might be > very helpful. > > Other suggestions have included - rewrite in Java, Python, C++, C, > Fortran, anything but Perl!! :) - I don't know if I see much advantage > to that in terms of speed - Perl uses a lot of memory and that can > cause performance to lag in some situations, but in general > WordNet::Similarity doesn't use much memory so reducing memory use > isn't much of a win (typically it's running with less than 200 MB of > RAM which doesn't seem onerous). That said, it is possible that lesk > could be sped up considerably with a faster string matching algorithm > (in Perl or C or something else), and it's also possible that there > might be a more clever way of encoding WordNet so searches through the > paths it done in real-time are much faster. So there are probably > areas of the WordNet::Similarity code that could benefit from an > efficient sub-routine in some other language, and we are very open to > including something like that in the package. > > It is possible that you could run many many similarity/relatedness > measurements in parallel and pick up some speed that way. Each > measurement is independent of every other measurement, so it's a very > decomposable problem (embarrassingly parallel as some like to say) so > if you have the ability to divide up the work and collect the results > again that could have some possibilities. You might even consider > using our server mechanism (similarity_server.pl in the package) as a > way to handle multiple parallel queries - you could run the server and > then it can process up to as many children as you care to specify, > where each child would be computing a similarity pair. > > Well, I hope this helps a little, and perhaps gives folks ideas about > what's happening. I do think it's important to calibrate things a bit, > and work out how many pairs a second you are able to get in your > current set up, and then set a specific target for what you need. If > you are getting 20 pairs a second and need 20,000, that's quite > different than if you need 100, and each expectation suggests > different sorts of solutions. > > Cordially, > Ted > > On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne <at> let.ru.nl> wrote: >> Hi all, >> >> I have previously been using Pedersen's WordNet Similarity module ( >> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm >> ) for calculating the similarity or relatedness between pairs of >> words. Now I started to use it again but I noticed that it is way too >> slow for a real-time application (which is what I need now). >> >> I originally wrote a simple Perl script that calls the module (shown >> below) but it takes almost five seconds to run. Almost all this time >> is spent on calling the module so for batch scripts it is fine (then >> the module is only called once for multiple requests), but I need it >> to work in real time in a retrieval experiment and then 5 seconds is >> too long. >> >> Does anyone know an alternative (fast!) tool for calculating >> Similarity and/or Relatedness between two words? It might be using >> either a Wu & Palmer-like measure or a Lesk-type measure. >> >> Thanks! >> Suzan Verberne >> >> #! /usr/bin/perl >> use WordNet::QueryData; >> use WordNet::Similarity::path; >> my $wn = WordNet::QueryData->new; >> my $measure = WordNet::Similarity::path->new ($wn); >> my $value = $measure->getRelatedness("car#n#1", "bus#n#2"); >> print "car (sense 1) <-> bus (sense 2) = $value\n"; >> >> >> -- >> Suzan Verberne, postdoctoral researcher >> Centre for Language and Speech Technology >> Radboud University Nijmegen >> Tel: +31 24 3611134 >> Email: s.verberne <at> let.ru.nl >> http://lands.let.ru.nl/~sverbern/ >> -- >> >> _______________________________________________ >> Corpora mailing list >> Corpora <at> uib.no >> http://mailman.uib.no/listinfo/corpora >> > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > > _______________________________________________ > Corpora mailing list > Corpora <at> uib.no > http://mailman.uib.no/listinfo/corpora > -- -- Linguist, Desi Linguist http://www.desilinguist.org _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Hi, Suzan. If what you are after is just a similarity score, you could try the Cognitive Computation Group's WordNet-based similarity metric, written in c++ and -- anecdotally -- pretty fast. It runs as an xmlrpc service, which imposes a certain network latency overhead, *but* is language neutral, which is one appealing feature. Within our research group, many users call it and cache the response to reduce processing time still further. If you are coding in c++, then you could of course call WNSim directly. We've used WNSim in a number of research projects, including our work on Recognizing Textual Entailment and on Distant Supervision. Here's the page for the WNSim code: http://cogcomptest.cs.illinois.edu/page/software_view/21 There's also a link to a technical report that describes the underlying methodology. You can take a look at the output using the demo here: http://cogcomp.cs.illinois.edu/demo/wnsim/ Regards, Mark ---- Original message ---- >Date: Tue, 1 Feb 2011 10:25:23 +0100 >From: Suzan Verberne <s.verberne <at> let.ru.nl> >Subject: [Corpora-List] Faster tool for WordNet Similarity measures >To: Corpora List <corpora <at> uib.no> > >Hi all, > >I have previously been using Pedersen's WordNet Similarity module ( >http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm >) for calculating the similarity or relatedness between pairs of >words. Now I started to use it again but I noticed that it is way too >slow for a real-time application (which is what I need now). > >I originally wrote a simple Perl script that calls the module (shown >below) but it takes almost five seconds to run. Almost all this time >is spent on calling the module so for batch scripts it is fine (then >the module is only called once for multiple requests), but I need it >to work in real time in a retrieval experiment and then 5 seconds is >too long. > >Does anyone know an alternative (fast!) tool for calculating >Similarity and/or Relatedness between two words? It might be using >either a Wu & Palmer-like measure or a Lesk-type measure. > >Thanks! >Suzan Verberne > >#! /usr/bin/perl > use WordNet::QueryData; > use WordNet::Similarity::path; > my $wn = WordNet::QueryData->new; > my $measure = WordNet::Similarity::path->new ($wn); > my $value = $measure->getRelatedness("car#n#1", "bus#n#2"); > print "car (sense 1) <-> bus (sense 2) = $value\n"; > > >-- >Suzan Verberne, postdoctoral researcher >Centre for Language and Speech Technology >Radboud University Nijmegen >Tel: +31 24 3611134 >Email: s.verberne <at> let.ru.nl >http://lands.let.ru.nl/~sverbern/ >-- > >_______________________________________________ >Corpora mailing list >Corpora <at> uib.no >http://mailman.uib.no/listinfo/corpora _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
- ACE 2005 English SpatialML Annotations Version 2 -
-
SemEval-2010 Task 1 OntoNotes English:
Coreference Resolution in Multiple Languages -
(1) ACE 2005 English SpatialML Annotations Version 2 was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus LDC2006T06. This second version eliminates a number of annotation inconsistencies and errors identified in ACE 2005 English SpatialML Annotations LDC2008T03. In addition, the SpatialML annotation schema has been updated from version 2.0 to version 3.0.1; the revised annotation guidelines are included in this release.
The ACE (Automatic Content Extraction) program focused on developing automatic content extraction technology to support automatic processing of human language in text form., specifically, entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. It is intended to emulate earlier progress on time expression such as TIMEX2, TimeML, and the 2005 ACE guidelines.
SpatialML includes syntax for marking up PLACEs mentioned in text and for linking them to data from gazetteers and other databases. LINKs are used to express relations between places, and RLINKs to capture trajectories for relative locations. To the extent possible, SpatialML leverages ISO and other standards with the goal of making the scheme compatible with existing and future corpora. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such markup can be useful for disambiguation, integration with mapping services and spatial reasoning.
This corpus contains 210065 total words and 17821 unique words. Counts of unique words can be found in doc/ldc_wordcount.csv which includes all words that are not part of XML markup (e.g., without tag names, attribute names or values). Unique words are counted by comparing case insensitive transformations with preceding and trailing punctuation stripped off. "Words" consisting solely of punctuation are discarded.
The principal change in the annotation schema is that "PATH" has been generalized to "RLINK" for relative link. At the top level, there is now a version attribute on the root SpatialML tag to capture which version of SpatialML was used. A number of smaller changes have been made to the annotation specification; these are listed in Section 2 of the updated guidelines.
*
(2) SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000 words of English broadcast and newswire data annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare automatic coreference resolution systems for six languages (Catalan, Dutch, English, German, Italian and Spanish) in four evaluation settings using four metrics. Further information about Task 1 can be found on the task description website.
The data is divided into three sets: the development set which contains 39 documents, 741 sentences and 17,044 tokens; the training set which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set which contains 85 documents, 1,141 sentences and 24,206 tokens. The complete material for training systems is the sum of the development and training sets._______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
TWO PH.D. POSITIONS IN NATURAL LANGUAGE PROCESSING -- SAPIENZA UNIVERSITY OF ROME (ITALY) The Department of Computer Science of the Sapienza University of Rome invites applications for TWO PH.D. POSITIONS IN NATURAL LANGUAGE PROCESSING. The positions are part of a 5-year ERC Starting Grant on MULTILINGUAL SEMANTIC PROCESSING funded by the European Research Council (ERC) and headed by Prof. Roberto Navigli. The successful candidates will join a young and dynamic research team (including 2 faculty members, 3 research fellows and 2 research associates) with solid experience and proven excellence in NLP research. Please visit the group's website: http://lcl.uniroma1.it The candidates are expected to perform cutting-edge research in one of the following areas: * MULTILINGUAL & DOMAIN-ORIENTED WORD SENSE DISAMBIGUATION & INDUCTION * STATISTICAL MACHINE TRANSLATION FOR RESOURCE-RICH VS. RESOURCE-POOR LANGUAGES * SEMANTIC INFORMATION RETRIEVAL AND EXTRACTION We are seeking for qualified candidates with an excellent university degree who are willing to research and develop state-of-the-art algorithms and approaches. Applications are invited from suitably qualified candidates who have: * A Master's degree in Computer Science, Computational Linguistics, Mathematics, or a related discipline. * Good communication skills in English. NOTE: knowledge of Italian is NOT a requirement. * Strong programming skills: Java and practical expertise with standard NLP tools are required, knowledge of Perl and C++ are a plus. Publication record in any of the above areas is a plus. INFORMATION * Application deadline: February 20, 2011 * Starting date: as soon as possible (successful candidates can start on a paid internship before the official academic year -- i.e., before October 2011) * Duration: 3 years * Salary: approx. 16800 euros per annum after taxes (around 1400 euros per month) Informal inquiries can be sent by email to Roberto Navigli (navigli <at> di.uniroma1.it). HOW TO APPLY Please send a detailed CV (possibly including a short research statement) and official transcripts of academic records to Roberto Navigli (navigli <at> di.uniroma1.it). Please include the position reference LCL-PHD-2011 in the subject line. ABOUT LA SAPIENZA The Sapienza University of Rome is a seven-century-old university in the heart of Rome. It is one of the largest universities in Europe, with around 150,000 students. Its Faculty of Science (that includes the Department of Computer Science) has been ranked as the best school of science in Italy, 7th in Europe, and 25th in the world according to the Times Higher World University Ranking 2009 for Science. Sapienza University is an equal opportunity employer. ABOUT THE COMPUTER SCIENCE DEPARTMENT The Department of Computer Science is a modern and well-equipped research institution with a top-class faculty and a strong Ph.D. program. The Department comprises 42 faculty members, 15 postdocs and around 20 Ph.D. students. The successful candidate will be based in Rome, one of the most beautiful cities in the world. The Department is situated in a nice lively area with lots of cafes, bars, restaurants, and at walking distance from the city centre. Prominent candidates must not be afraid of the language barrier, as Italians are in general very friendly. _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
Hi, Ken wrote: "It would be nice if we could get some community-wide effort into this. We need a vehicle, perhaps transforming Wiktionary. It would be nice if we could apply John's rules to Ted's compounds and *put those findings into a dictionary* (lexicographers have only barely done so, while lexicologists need that information). " -- As far as I find time, I would be happy to contribute some German and Russian material (if needed?). In my view, comparing compounds in a Germanic and a Slavic language could yield interesting results. Russian has multiple types of compounds with complex grammatical and semantic features and, as far as I know, all of them are productive. There also might be some hope that comparison with Russian sheds light on semantically opaque compounds in English because a big part of them will nicely translate into more explicit multi-word units in Russian (at least in terminology - I'm not sure about the 'rubber duck'). But this is just an idea, not a hypothesis. I also want to contribute some German examples to the rule vs. analogy discussion. "Vollzug" was a very nice one because 'voll' is extremely productive on all levels of German discourse and across different POS categories. In my view, there are two distinct senses to 'voll': one that expresses exhaustiveness or completeness as in Vollmilch, Vollei, Vollkorn, Vollzug, Vollernte and older forms such as volljährig etc. In colloquial (probably juvenile) speech, however, 'voll' seems to be more of an intensifier as in Volltrottel, Vollpfosten, Vollidiot or even in phrases as 'Das ist ja voll super'. I don't see a rule here. Certainly people could also say 'Komplettzug' or 'Supertrottel' and they actually do (maybe not with these words, but I think the principle is clear), but the impact of these ad hoc forms cannot be compared to relatively stable units with 'voll'. I think this really is a case of analogy. There also might be metaphorical compounds (didn't we have 'rubber chicken'?). Regards, anne ___________________________________________________________ NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://produkte.web.de/go/webdefreephone _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
RSS Feed86 | |
|---|---|
175 | |
187 | |
286 | |
223 | |
187 | |
148 | |
212 | |
245 | |
120 | |
144 | |
158 | |
185 | |
197 | |
202 | |
157 | |
215 | |
245 | |
238 | |
167 | |
335 | |
183 | |
157 | |
175 | |
190 | |
197 | |
209 | |
200 | |
228 | |
241 | |
159 | |
205 | |
136 | |
178 | |
141 | |
106 | |
183 | |
141 | |
130 | |
312 | |
192 | |
175 | |
151 | |
233 | |
127 | |
139 | |
94 | |
76 | |
101 | |
110 | |
180 | |
186 | |
198 | |
102 | |
126 | |
155 | |
178 | |
145 | |
162 | |
175 | |
111 | |
150 | |
278 | |
236 | |
212 | |
174 | |
94 | |
147 | |
207 | |
254 | |
85 | |
168 | |
119 | |
171 | |
120 | |
159 | |
161 | |
179 | |
173 | |
240 | |
111 | |
138 | |
100 | |
129 | |
144 | |
125 | |
109 | |
276 | |
111 | |
89 | |
140 | |
126 | |
170 | |
107 | |
101 | |
61 | |
172 | |
201 | |
149 | |
160 | |
130 | |
146 | |
104 | |
115 | |
103 | |
98 | |
47 | |
58 | |
64 | |
67 | |
57 | |
72 | |
88 | |
105 | |
76 | |
1 |