Stuart A Yeates | 1 Sep 09:55 2004

Re: language-specific harvesting of texts from the Web

Marco Baroni wrote:
>>One situation where your approach may not work so well, is when a 
>>language's websites use multiple character encodings.  Unfortunately, 
>>this is quite common in languages that have non-Roman writing systems, 
> At least for Japanese, our way to get around this problem in our
> web-mining scripts was to look for the charset declaration in the html
> code of each page, and then to convert (inside the script) the page from
> that charset to utf8.
> I would be interested in hearing about other ways to deal with multiple 
> encodings.

textcat ( is a language and 
encoding guesser which reliably guesses test language and encoding based 
solely on examples and statistics. Knows 69 natural languages. Open source.

I've had good experiance using the built-in java encoding converters 
(readers and writers shipped for ~100 encodings as standard) to convert 
between languages. Freely avaliable.


Stuart Yeates            stuart.yeates <at>
OSS Watch                        
Oxford Text Archive                   
Humbul Humanities Hub               

(Continue reading)

Adam Kilgarriff | 1 Sep 12:13 2004

RE: Searching BNC for adverbs followed by verb


The Sketch Engine is a new piece of software which does this.

Go to  and self-register.  Then you can
call up a word sketch for any word.  The word sketch for a verb includes
a listing of the adverbs that repeatedly and significantly modify it.

Adam Kilgarriff
adam <at>

-----Original Message-----
From: owner-corpora <at> [mailto:owner-corpora <at>] On
Behalf Of Gloria
Sent: 31 August 2004 17:05
Subject: [Corpora-List] Searching BNC for adverbs followed by verb

Dear collegues,

I would like to know if it is possible to search the BNC in order to
find out which adverbs precede a particular verb.

It doesn't seem to be possible, or at least I haven't figured out the
way to query the database in order to retrieve this sort of information.

Thank you in advance for your help,

(Continue reading)

Andrew Kehoe | 1 Sep 14:49 2004

Restoration Drama / WebCorp

A colleague is looking for a corpus of Restoration Drama texts
(1660-1710). Does anyone know if such a corpus exists?

Secondly, our WebCorp system ( is currently
down for maintenance following our move to the University of Central
England in Birmingham. It will be available again later this month.

Andrew Kehoe
Research and Development Unit for English Studies
University of Central England

Priscilla Rasmussen | 1 Sep 19:11 2004

Ann Arbor: ACL-2005 Preliminary Call for Papers

                   ACL-05 Preliminary Call For Papers
43rd Annual Meeting of the Association for Computational Linguistics

                           June 25 - 30, 2005
                 University of Michigan, Ann Arbor, USA


            * * * Submission deadline: January 14, 2005 * * *

General Conference Chair: Kevin Knight (USC/Information Sciences Institute,USA)
Program Co-Chairs: Hwee Tou Ng (National University of Singapore, Singapore)
                   Kemal Oflazer (Sabanci University, Turkey)
Local Organization Chair: Dragomir Radev (University of Michigan, USA)

The Association for Computational Linguistics invites the submission
of papers for its 43rd Annual Meeting hosted jointly with the North
American Chapter of the ACL. Papers are invited on substantial,
original, and unpublished research on all aspects of computational
linguistics, including, but not limited to: pragmatics, discourse,
semantics, syntax, grammars and the lexicon; phonetics, phonology and
morphology; lexical semantics and ontologies; word segmentation,
tagging and chunking; parsing, generation and summarization; language
modeling, spoken language recognition and understanding; linguistic,
psychological and mathematical models of language; language-oriented
information retrieval, question answering, and information extraction;
machine learning for natural language; corpus-based modeling of
language, discourse and dialogue; multi-lingual processing, machine
translation and translation aids; multi-modal and natural language
(Continue reading)

Martin Wynne | 1 Sep 16:35 2004

RE: Restoration Drama / WebCorp

You can find Restoration drama texts in the Oxford Text Archive, for example
with catalogue numbers), by Aphra Behn (1327, 2006), John Gay (1706), Thomas
Shadwell (1294), James Shirley (0601), plus a corpus of prologues and
epilogues of the Restoration (1325).

If you use the author name or the numbers in the simple search box at
http:/ you will find these texts.

You might also want to take a look at Voice of the Shuttle, which
categorises links to free e-texts by period and genre.
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne <at>

> -----Original Message-----
> From: Andrew Kehoe [mailto:Andrew.Kehoe <at>]
> Sent: 01 September 2004 13:49
> To: corpora <at>
> Subject: [Corpora-List] Restoration Drama / WebCorp
(Continue reading)

Pete Whitelock | 2 Sep 17:59 2004

RE: Searching BNC for adverbs followed by verb

You can find this information using my site

Just enter your verb, press return, and click on 'V* ADV' in the top right panel. 


> -----Original Message-----
> From: owner-corpora <at> 
> [mailto:owner-corpora <at>]On Behalf Of Gloria
> Sent: Tuesday, August 31, 2004 5:05 PM
> Subject: [Corpora-List] Searching BNC for adverbs followed by verb
> Dear collegues,
> I would like to know if it is possible to search the BNC in order to
> find out which adverbs precede a particular verb.
> It doesn't seem to be possible, or at least I haven't figured out the
> way to query the database in order to retrieve this sort of 
> information. 
> Thank you in advance for your help,
> Gloria
(Continue reading)

Magali Jeanmaire | 3 Sep 15:30 2004


ELRA - Language Resources Catalogue - Update
We are happy to announce that new speech
databases are available in our catalogue.

You will find below their short descriptions.
Please visit our on-line catalogue to get more
detailed information: and

*** S0164 BAS GEO1 ***

The BAS GEO1 database contains the recordings
of location names in Germany, Austria and Switzerland,
together with their pronunciation coded in SAMPA.
Future updates will be distributed to all users automatically.

*** S0165 MICROAES ***

MICROAES is a Spanish microphone database, which
comprises the recordings from 300 different speakers (a
total of 30 hours of speech). Each speaker recorded a
corpus of 450 paragraphs in a quiet environment.
The database includes an orthographic and lexical
transcription, with a few details that represent audible
acoustic events (speech and non speech) present in the
corresponding waveform files. The lexicon has more than
7400 words with the corresponding pronunciation information
(Continue reading)

Carlos Areces | 6 Sep 19:09 2004

Call for Papers: Special Issue M4M

                       Call for Papers for 

                     "Methods for Modalities"

          A Special Issue of the Journal of Applied Logic

The workshop `Methods for Modalities' (M4M) organized every two 
years aims to bring together researchers interested in developing 
proof tools and decision methods for modal logic broadly conceived,
including description logic, hybrid logic, temporal logic, etc.

During 1993 M4M-3, the third instance of the workshop was organized 
in Nancy, France.  We have been invited by the Journal of Applied 
Logic to prepare a Special Issue containing selected publications 
from the workshop, in addition to other articles that might fit 
within the  topics of interest of M4M.   For that reason we are now
distributing this Call for Papers.  


The following list is provided as an example of suitable topics for
the Special Issue. All topics should concern modal-like logics,
broadly conceived.  The list is by no means exhaustive and is given 
in an arbitrary order:

* Automated theorem proving
* Decision methods
* Proof methods
(Continue reading)

Magali Jeanmaire | 7 Sep 09:35 2004

COCOSDA 2004 Workshop

Apologies for multiple postings
COCOSDA Workshop 2004 - 4th October 2004 - Jeju Island, Korea

                         In conjunction with ICSLP 2004

at the International Conventiion Center Jeju, Jeju Island, Korea
         Preliminary announcement and Call for contributions

The next COCOSDA workshop will be organized on Monday 4th October,
parallel to the INTERSPEECH 2004 - ICSLP conference, held from 4th to
8th October 2004.

The International Committee for the Co-ordination and Standardization of
Speech Databases and Assessment Techniques (COCOSDA), has been
established to encourage and promote international interaction and cooperation
in the foundation areas of Spoken Language Processing.

COCOSDA is an international organization for coordinating the globalized
efforts in Language resource developments and speech technology evaluation.
The annual workshops of COCOSDA have been held as satellite events of ICSLP
and Eurospeech, now Interspeech. This year, the workshop will be organised on
Monday, 4th October at Interspeech 2004 - ICSLP conference in Korea.

COCOSDA is organized with a structure which reflects the two dimensions of its
functionalities: "Topic Domains" and "Regional Programs".
The former considers the dynamic technology environments, while the latter 
the regional efforts and differences.
(Continue reading)

Josh Matthews | 7 Sep 08:26 2004

Concordances / Thai script

Dear Sir/Madam, I am writing to inquire about the availability of software
programmes that enable Thai script to be entered into a corpus. I am
particularly interested in concordances and Thai language.

Can you recommend a software that is available to be purchased that can process Thai script effectively?