Čulo, Oliver | 1 Jun 2011 12:07
Picon
Picon
Favicon

Deadline extended to June 15th for GSCL 2011-Workshop "CL-TS-MT -- what can we learn from each other?"

Dear all,

we have moved the deadline for our Pre-GSCL-workshop "Contrastive
Linguistics - Translation Studies- Machine Translation -- what can we
learn from each other?" in Hamburg to ***June 15th***.

Rather than re-posting the whole workshop description, I would like to
point you to the workshop website:

http://www.corpora.uni-hamburg.de/gscl2011/?Workshops:Contrastive_Linguisti
cs_%E2%80%93_Translation_Studies_%E2%80%93_Machine_Translation%26nbsp%3B%E2
%80%93%0Awhat_can_we_learn_from_each_other%3F

Best,
Oliver

------

Oliver Čulo
FTSK, Englische Sprach- und Übersetzungswissenschaft
Johannes Gutenberg-Universität Mainz
An der Hochschule 2
76726 Germersheim

culo <at> uni-mainz.de
http://www.staff.uni-mainz.de/culo

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
(Continue reading)

Ant—onio Branco | 1 Jun 2011 12:17
Picon

Workshop on Treebanks (TLT10), 2nd CfP


[apologies for multiple copies]

===    2nd Call for Papers    ===

TLT10 - The 10th International Workshop on Treebanks
and Linguistic Theories

Heidelberg University, Germany
6-7 January 2012

http://tlt10.cl.uni-heidelberg.de

TLT serves as a venue for new and ongoing high-quality work related
to treebanking, encompassing descriptive, theoretical, formal and
computational aspects of treebanks. The next edition will take place
in Heidelberg, Germany, on 6-7 January 2012, and is hosted by
the Institute for Computational Linguistics at Heidelberg University.

Submissions are invited for papers, posters, and demonstrations
which present research on treebanks and their intersection
with linguistics, natural language processing, and other related fields.

Motivation and Aims

Treebanks are language resources that provide annotations at various
levels of linguistic structure beyond the word level. They typically
provide syntactic constituent or dependency structures for sentences,
and often extend to functional and predicate-argument structure.
Treebanks have become crucially important for the development of
(Continue reading)

Xu Jiajin | 1 Jun 2011 16:28
Picon

WORD ALIGNMENT: Does it exist?

Hi all,

 

The other day, over an academic discussion, my colleague and I had a brief debate on WORD ALIGNMENT, while we were talking about bilingual text aligning practices. When I was reviewing different levels of alignment, such as sentence alignment and word alignment, I commented that WORD ALIGNMENT IS A JOKE, as it is never likely to aligning words. My comment was immediately refuted by another professor. But I have not been convinced by his counterargument so far.

 

In my mind, word alignment is not realistic since it’s impossible to find one-to-one correspondence of parallel texts on the word level. For instance, it’s most likely that words in a sentence are not translated, either kept implicit or assimilated into other words, constructions, idioms and so forth. I reckon it is also the case for parallel texts of cognate languages.

 

But on a second thought, the alignment of a selection of words, say, lexical words, or jargons, across texts is not impossible. However, linguistic alignment, as I see it, has to be exhaustive. In saying so, I actually consider sentence alignment as the canonical type of text alignment. Each sentence is aligned to one or more sentences in the target texts, and the other way round.

 

I am wondering whether there ARE word alignment implementations in practice. I would appreciate any pointers to relevant literature or tools, as well as the clarification of the notion alignment.

 

Maybe due to my ignorance, word alignment has been a mature technology for many years. Could anyone tell me what are main uses of word alignment? Bilingual lexicon? Any other applications?

 

Thanks in advance.

 

Cheers,

 

Jiajin XU

Ph.D., associate professor (discourse studies, corpus linguistics)

National Research Centre for Foreign Language Education

Beijing Foreign Studies University

Beijing 100089

China

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Alberto Simões | 1 Jun 2011 16:45
Picon
Favicon
Gravatar

Re: WORD ALIGNMENT: Does it exist?

Hello

I defend that word alignment is term alignment, and possible term with 
blanks or placeholders alignment, and possible alignment to empty words 
(just like we have sentence alignment to empty sentences).

I agree that 100% of word alignment might be difficult or even 
impossible (for compound verbs, for instance).

But calling it a joke can be offending to people working on that area, 
and getting some interesting results (me one of them >:))

Cheers
ambs

On 01/06/2011 15:28, Xu Jiajin wrote:
> Hi all,
>
> The other day, over an academic discussion, my colleague and I had a
> brief debate on WORD ALIGNMENT, while we were talking about bilingual
> text aligning practices. When I was reviewing different levels of
> alignment, such as sentence alignment and word alignment, I commented
> that WORD ALIGNMENT IS A JOKE, as it is never likely to aligning words.
> My comment was immediately refuted by another professor. But I have not
> been convinced by his counterargument so far.
>
> In my mind, word alignment is not realistic since it’s impossible to
> find one-to-one correspondence of parallel texts on the word level. For
> instance, it’s most likely that words in a sentence are not translated,
> either kept implicit or assimilated into other words, constructions,
> idioms and so forth. I reckon it is also the case for parallel texts of
> cognate languages.
>
> But on a second thought, the alignment of a selection of words, say,
> lexical words, or jargons, across texts is not impossible. However,
> linguistic alignment, as I see it, has to be exhaustive. In saying so, I
> actually consider sentence alignment as the canonical type of text
> alignment. Each sentence is aligned to one or more sentences in the
> target texts, and the other way round.
>
> I am wondering whether there ARE word alignment implementations in
> practice. I would appreciate any pointers to relevant literature or
> tools, as well as the clarification of the notion alignment.
>
> Maybe due to my ignorance, word alignment has been a mature technology
> for many years. Could anyone tell me what are main uses of word
> alignment? Bilingual lexicon? Any other applications?
>
> Thanks in advance.
>
> Cheers,
>
> Jiajin XU
>
> Ph.D., associate professor (discourse studies, corpus linguistics)
>
> National Research Centre for Foreign Language Education
>
> Beijing Foreign Studies University
>
> Beijing 100089
>
> China
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora

--

-- 
Alberto Simoes
CCTC-UM / CEHUM

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Afsaneh Fazly | 1 Jun 2011 16:58
Picon
Favicon

Re: WORD ALIGNMENT: Does it exist?

Look at these two (among many others):

The Mathematics of Statistical Machine Translation: Parameter Estimation.
Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,
Robert L. Mercer
Computational Linguistics, 1993.

The alignment template approach to statistical machine translation.
Franz Josef Och and Hermann Ney.
Computational Linguistics, 30:417–449. 2004.

On Wed, Jun 1, 2011 at 10:28 AM, Xu Jiajin <ustcxujj <at> gmail.com> wrote:
> Hi all,
>
>
>
> The other day, over an academic discussion, my colleague and I had a brief
> debate on WORD ALIGNMENT, while we were talking about bilingual text
> aligning practices. When I was reviewing different levels of alignment, such
> as sentence alignment and word alignment, I commented that WORD ALIGNMENT IS
> A JOKE, as it is never likely to aligning words. My comment was immediately
> refuted by another professor. But I have not been convinced by his
> counterargument so far.
>
>
>
> In my mind, word alignment is not realistic since it’s impossible to find
> one-to-one correspondence of parallel texts on the word level. For instance,
> it’s most likely that words in a sentence are not translated, either kept
> implicit or assimilated into other words, constructions, idioms and so
> forth. I reckon it is also the case for parallel texts of cognate languages.
>
>
>
> But on a second thought, the alignment of a selection of words, say, lexical
> words, or jargons, across texts is not impossible. However, linguistic
> alignment, as I see it, has to be exhaustive. In saying so, I actually
> consider sentence alignment as the canonical type of text alignment. Each
> sentence is aligned to one or more sentences in the target texts, and the
> other way round.
>
>
>
> I am wondering whether there ARE word alignment implementations in practice.
> I would appreciate any pointers to relevant literature or tools, as well as
> the clarification of the notion alignment.
>
>
>
> Maybe due to my ignorance, word alignment has been a mature technology for
> many years. Could anyone tell me what are main uses of word alignment?
> Bilingual lexicon? Any other applications?
>
>
>
> Thanks in advance.
>
>
>
> Cheers,
>
>
>
> Jiajin XU
>
> Ph.D., associate professor (discourse studies, corpus linguistics)
>
> National Research Centre for Foreign Language Education
>
> Beijing Foreign Studies University
>
> Beijing 100089
>
> China
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Adam Przepiorkowski | 1 Jun 2011 16:55
Picon
Picon

Re: question about storage of corpora

Dear Tine,

As already advocated by various people, and also mentioned by Adam
Radziszewski, we are using TEI P5 for the National Corpus of Polish
(http://nkjp.pl/).  Of course, that's a huge collection of very specific
recommendations from which you pick up only those that you need, so each
TEI P5 schema will be different.  Ours is documented here:

http://nlp.ipipan.waw.pl/TEI4NKJP/

and it takes care of metadata, text structure, segmentation into
sentences and word-like tokens, morphosyntactic and syntactic levels,
Named Entities and some Word Sense Disambiguation.  In terms of
gigabytes, this format takes a lot of space[*], but disk space is cheap
these days and XML files compress very well, so it hasn't been too much
of a problem by now.

Now, we haven't tried any native XML databases because of our previous
experiences, but Damir Ćavar mentioned in this thread that things have
changed recently.  Still, I would be interested whether anybody is using
native XML databases for corpora which are 1) in the range of *billions*
of tokens (ours currently has about 1 450 000 000 tokens) and 2)
linguistically annotated at least at the morphosyntactic level.

What we do instead in the National Corpus of Polish, is we compile XML
files into a purpose-designed binary format used by our search engine,
Poliqarp (http://poliqarp.sourceforge.net/), and – independently – we
convert them to a relational database.[**]  Admittedly, this compilation and
conversion takes time (in the range of days), which is a nuisance, but
not a major obstacle, as this is something that needs to be done only
occasionally.

All best,

Adam

*  Specifically, around 240 GB for almost 1.5 billion words
   morphosyntactically annotated (with info about all possible
   interpretations for all tokens, about which one is selected in the
   context and about the tool performing the morphosyntactic annotation,
   and with additional marking of text structure, segmentation at
   various levels and rich metadata).

** The two search engines, with quite different functionalities, are
   employed here: http://nkjp.pl/index.php?page=6&lang=1.

Tine Lassen <tine.lassen <at> tdcadsl.dk>:

> Hi,
>
> I am in the process of compiling a series of domain corpora, and once the present text gathering phase is
completed, of course i need to store the texts somehow. The texts need to be annotated with e.g. parts of
spech and posssibly
> phrase boundaries for term extraction purposes. 
>
> My questions are: Would it be wiser to store the texts as XML or in a relational database format?
> Does a generally accepted corpus annotation XML-schema exist? And do tools for annotation of and search
in such files exists?
> How do you store your corpora?
>
> Any thoughts or ideas regarding the questions are very welcome :)
>
> Best,
> Tine Lassen
> Copenhagen Business School
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora

--

-- 
Adam Przepiórkowski                          ˈadam ˌpʃɛpjurˈkɔfskʲi
http://clip.ipipan.waw.pl/ ____ Computational Linguistics in Poland
http://nlp.ipipan.waw.pl/ ____________ Linguistic Engineering Group
http://nkjp.pl/ _________________________ National Corpus of Polish

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Damir Cavar | 1 Jun 2011 17:49
Picon
Favicon

Re: question about storage of corpora

Sorry Tine, and the others,

here's just some comment on some of the recent arguments related to your posting and some replies:

Space and XML as a storage:
Well, I just bought a fast SATA 2 TB disk for around 100 Euro for my private purposes, to extend my existing 1.5
and 1 TB disks, and backup them. A DB takes also space, and it is not true that the space is reduced to almost
nothing, just maybe one Xth (is it 1/3 or 1/4 in your cases?). 240 GB or 40 GB, I don't see the need in putting
time and effort in mapping the XML to a RelDB just to spare some space.

What I mentioned about the CLC, it is raw TEI P5 XML, and I do not need to store it in a DB to get good performance,
as the online interface shows http://riznica.ihjj.hr/ choose any of the subcorpora, the "complete" one
should be over 100 mil. The rendering of the results is most of the time including a XSLT call on the raw XML
data to create the HTML view, the documents are raw XML TEI P5 files on the server, the rendering to HTML is
done with every request, without our server contaminating Zagreb with smoke. You'll probably wait for
the connection, not for the server to do the job (except for the collocation analyses in the extended
menu). And, there is no relational DB that I needed to maintain and set up, just a storage folder for the XML
and a binary generated index.

Speed:
Observing a decrease of speed in any DB for any type of data storage based on the size of the data is usually a
sign of poor engineering and/or poor hardware. XML DBs and other DBs do not differ there, so, if you index
any field (XML attributes, full text, tags), the search is passing for example a hash function and should
be as fast as your hashing function is, and this is true for any DB, relational, XML-based etc. I cannot
imagine that access to binary DB tables in a RelDB should be significantly faster than direct access
through the Operating Systems File IO to XML files on a disk somewhere (putting my full faith in the current
OSes and their good handling of File IO and Cache management, even true for Linux nowadays).

Evaluation:
If you want to test large sizes and speed, just put BaseX on your desktop, no complicated installation
procedures, fire it up and create a new Base with your millions of XML files in some folder, set up indexes
the right way, and enjoy the power of XQuery, and measure the performance. I cannot do that for our Polish
colleagues with more than a billion, my corpus is just around 100 mil. tokens, and I seriously work on a new
way to extend the existing CLC interface with a new functionality that makes use of XQuery and maybe BaseX,
without the intention to touch RelDBs in any way soon for the corpus work.

ciao
DC

--
Dr. Damir Cavar
http://web.me.com/dcavar/
mobile +49 176 60928748
office +49 7531 885357
private (US): +1 (734) 330-2902
FaceTime: dcavar <at> me.com

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Ant—onio Branco | 1 Jun 2011 19:02
Picon

help on document comparison for historians


Dear all,

A friend of mine is working on medieval history and would
like to find a (user-friendly) tool that could help her with
the following functionality: one enters different documents and
the tool will deliver the excerpts (may be of several paragraph
length) that are identical across documents.

Any hint or help will be most welcome. Please reply to me.
I'll post a summary.

Kind regards,

António Branco

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Graeme Hirst | 1 Jun 2011 18:47
Picon
Favicon

Re: WORD ALIGNMENT: Does it exist?

Also see Jörg Tiedemann's book "Bitext Alignment", which is about to be published (probably this week!)
by Morgan & Claypool (morganclaypool.com) in their HLT Synthesis series.  It includes a 45-page chapter
on word alignment.

On 1 Jun 2011, at 10:58, Afsaneh Fazly wrote:

> Look at these two (among many others):
> 
> The Mathematics of Statistical Machine Translation: Parameter Estimation.
> Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,
> Robert L. Mercer
> Computational Linguistics, 1993.
> 
> The alignment template approach to statistical machine translation.
> Franz Josef Och and Hermann Ney.
> Computational Linguistics, 30:417–449. 2004.
> 
> On Wed, Jun 1, 2011 at 10:28 AM, Xu Jiajin <ustcxujj <at> gmail.com> wrote:
>> Hi all,
>> 
>> 
>> 
>> The other day, over an academic discussion, my colleague and I had a brief
>> debate on WORD ALIGNMENT, while we were talking about bilingual text
>> aligning practices. When I was reviewing different levels of alignment, such
>> as sentence alignment and word alignment, I commented that WORD ALIGNMENT IS
>> A JOKE, as it is never likely to aligning words. My comment was immediately
>> refuted by another professor. But I have not been convinced by his
>> counterargument so far.
>> 
>> 
>> 
>> In my mind, word alignment is not realistic since it’s impossible to find
>> one-to-one correspondence of parallel texts on the word level. For instance,
>> it’s most likely that words in a sentence are not translated, either kept
>> implicit or assimilated into other words, constructions, idioms and so
>> forth. I reckon it is also the case for parallel texts of cognate languages.
>> 
>> 
>> 
>> But on a second thought, the alignment of a selection of words, say, lexical
>> words, or jargons, across texts is not impossible. However, linguistic
>> alignment, as I see it, has to be exhaustive. In saying so, I actually
>> consider sentence alignment as the canonical type of text alignment. Each
>> sentence is aligned to one or more sentences in the target texts, and the
>> other way round.
>> 
>> 
>> 
>> I am wondering whether there ARE word alignment implementations in practice.
>> I would appreciate any pointers to relevant literature or tools, as well as
>> the clarification of the notion alignment.
>> 
>> 
>> 
>> Maybe due to my ignorance, word alignment has been a mature technology for
>> many years. Could anyone tell me what are main uses of word alignment?
>> Bilingual lexicon? Any other applications?
>> 
>> 
>> 
>> Thanks in advance.
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> 
>> Jiajin XU
>> 
>> Ph.D., associate professor (discourse studies, corpus linguistics)
>> 
>> National Research Centre for Foreign Language Education
>> 
>> Beijing Foreign Studies University
>> 
>> Beijing 100089
>> 
>> China
>> 
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora <at> uib.no
>> http://mailman.uib.no/listinfo/corpora
>> 
>> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora

--
::::  Graeme Hirst
::::  University of Toronto * Department of Computer Science

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Picon

Re: help on document comparison for historians

How about eTBLAST http://etest.vbi.vt.edu/etblast3/ ?
Best,
Dina

-----Original Message-----
From: Ant-onio Branco [mailto:Antonio.Branco <at> di.fc.ul.pt] 
Sent: Wednesday, June 01, 2011 1:02 PM
To: corpora <at> uib.no
Subject: [Corpora-List] help on document comparison for historians

Dear all,

A friend of mine is working on medieval history and would
like to find a (user-friendly) tool that could help her with
the following functionality: one enters different documents and
the tool will deliver the excerpts (may be of several paragraph
length) that are identical across documents.

Any hint or help will be most welcome. Please reply to me.
I'll post a summary.

Kind regards,

António Branco

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora


Gmane