True Friend | 1 Jun 2008 13:07
Picon
Gravatar

Cleaning text to take word frequency


HI
I am a corpus linguistics student and learning C# for this purpose as well. I've created a simple application to find the frequency of a given word in two files. Actually this simple application is a practice version in C# of a Perl script a respected subscriber of this list (Alexander Schutz) written for me on my request on this list. I needed it then, now I am trying to programm myself so I tried to implement that idea in C#.
I have done that all and it works also but it does not give me 100% frequency of the word as the Perl script does. What I've done is that the application takes three files as input. 1 wordlist which it reads line by line and stores in an array. other two are simple text files which are splitted by c# String.Split() method. I've used an array of characters like ';', ',' etc. The resulting string array was cleaned from such characters but I couldn't get the 100% result. The frequency of most words are less than that of Perl script (which does the same thing). After trying myself I am requesting here if someone can help me. I am attaching both files (Perl script and C# .cs file) so you can examine the code and point out where I am wrong.
Regards
--
Muhammad Shakir Aziz محمد شاکر عزیز
using System;
using System.IO;
namespace Frequency
{
class Frequency
{
	public static string[] File1; // array for first file
	public static string[] File2; // array for second file
	public static string[] find = new string[50];  // array for wordlist file
	static void Main()
	{
		Run(); // main method 
	}
	public static void Run()
	{
		FileOne(); //invoke all file mthods
		FileTwo();
		Find();
			for(int i=0; i<find.Length; i++) // main for loop counts until the lits ends
			{
				int count1 = 0; // int for freq in file 1
				int count2 = 0; // int for freq in file 2
				for(int j=0; j<File1.Length; j++) //sub for loop 1 counts until main's 
												// word is not iterated through it
				{
					if(find[i].Equals(File1[j]) == true)
					{
						count1++;
					}
				}
				for(int j=0; j<File2.Length; j++) //sub for loop 2 counts until main's 
												// word is not iterated through it
				{
					if(find[i].Equals(File2[j]) == true)
					{
						count2++;
					}
				}
				Console.WriteLine("{0,-10}	{1}	{2}", find[i], count1, count2);
				StreamWriter ss = new StreamWriter( <at> "F:\file2.txt", true);
				ss.WriteLine("{0,-20}	{1}	{2}", find[i], count1, count2);
				ss.Close();
			}
		
		Console.ReadLine();
	}
	static void FileOne() // get file one by stream reader, put it equal to string, get it to lower and split to array
	{
		char[] delims = new char[]{ ' ', '.', ',',';', '"', ':', '(', ')', '[', ']', '\'' };
		string path =  <at> "F:\AgriAll.txt";
		StreamReader sr = new StreamReader(path);
		string str = sr.ReadToEnd();
		sr.Close();
		string str1 = str.ToLower();
		File1 = str1.Split(delims);
	}
	static void FileTwo() // get file two by stream reader, put it equal to string, get it to lower and split to array
	{
		char[] delims = new char[]{ ' ', '.', ',',';', '"', ':', '(', ')', '[', ']', '\'', '/', '?' };
		string path =  <at> "F:\PWE-AllAgri.txt";
		StreamReader sr = new StreamReader(path);
		string str = sr.ReadToEnd();
		sr.Close();
		string str1 = str.ToLower();
		File2 = str1.Split(delims);
	}
	static void Find() //get list by stream reader, put it equal to string, get it to lower and split to array
	{
		string path =  <at> "F:\Agriculture\EditedAgriKeywords.txt";
		StreamReader sw = new StreamReader(path);
		for(int i=0; i<find.Length; i++)
		{
			find[i] = sw.ReadLine();
		}
		sw.Close();
	}
}
}
Attachment (wordlist_corpus_freq.pl): application/octet-stream, 5433 bytes
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Trevor Jenkins | 1 Jun 2008 14:57

Re: Cleaning text to take word frequency

On Sun, 1 Jun 2008, True Friend <true.friend2004 <at> gmail.com> wrote:

> ... version in C# of a Perl script a respected subscriber of this list
> (Alexander Schutz) ... now I am trying to programm myself so I tried to
> implement that idea in C#. I have done that all and it works also but it
> does not give me 100% frequency of the word as the Perl script does.

That is possible. In fact doesn't surprise me at all.

> ... The resulting string array was cleaned from such characters but I
> couldn't get the 100% result. The frequency of most words are less than
> that of Perl script (which does the same thing). ...

I'm neither a perl wizard or a C# tune-smith (I still use Snobol4) but I'd
suspect a major difference in the way the two language process text. For
my money I'd believe perl is giving you a more accurate result because the
language itself was designed to process text. I'd further believer that C#
(as Microsoft's attempt to have their own Java) doesn't deal with
character and/or textual data in the same way. What perl accepts as text
C# may well be ditching. You may be right by citing the System.split()
function; check very carefully what that function is intended to do and
then compare it with how the similar, but not necessarily identical,
function in perl works. Assume absolutely nothing about the functionality
of either language or of functions with the same name. If in doubt blame
C# for the discrepancy.

Regards, Trevor

<>< Re: deemed!

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Alexandre Rafalovitch | 1 Jun 2008 15:30
Picon
Gravatar

Re: Cleaning text to take word frequency

The way I would have approached this is by finding which words
generate count discrepancies and also exist in one, but not another
version of the result. Then, I would look for those words in the text
and see what context they are in.

What I suspect you will find is that your partial reimplementation of
perl's :punct: class is causing problems. I would either do a complete
reimplementation of that (see:
http://en.wikipedia.org/wiki/Regular_expression ) or look into C#'s
regular expressions, which I am sure will contain the same definition
of the :punct: class.

Finally, if you are working with languages other than English, you
most certainly should look into regular expression libraries. They
take into account Unicode's rules as well, something you really don't
want to have to duplicate in your own code.

Regards,
    Alex.

-- 
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/

On Sun, Jun 1, 2008 at 7:07 AM, True Friend <true.friend2004 <at> gmail.com> wrote:
>
> HI
> I am a corpus linguistics student and learning C# for this purpose as well.
> I've created a simple application to find the frequency of a given word in
> two files. Actually this simple application is a practice version in C# of a
> Perl script a respected subscriber of this list (Alexander Schutz) written
> for me on my request on this list. I needed it then, now I am trying to
> programm myself so I tried to implement that idea in C#.

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

pcomp | 2 Jun 2008 03:31
Favicon

Workshop CFP (Corrected submission date):Psychocomputational Models of Human Language Acquisition (PsychoCompLA-2008)

Apologies for multiple postings.

************************ 2nd Call for Abstracts ****************************

Psychocomputational Models of Human Language Acquisition (PsychoCompLA-2008)

July 23rd at CogSci 2008  - Washington, D.C.

Submission Deadline: June 15, 2008

http://www.colag.cs.hunter.cuny.edu/psychocomp/

Workshop Topic:

The workshop is devoted to psychologically-motivated computational models of language acquisition.
That is, models that are compatible with research in psycholinguistics, developmental psychology and linguistics.

Invited Speakers:

* Rens Bod, Institute for Logic, Language and Computation, University of Amsterdam, Netherlands
* Damir Cavar, University of Indiana, USA and Zadar University, Croatia
* Gary Marcus, New York University, USA
* Jeffery Lidz, University of Maryland, USA
* Gary Marcus, New York University, USA
* Josh Tenenbaum, Massachusetts Institute of Technology, USA

Workshop History:

This is the fourth meeting of the Psychocomputational Models of Human Language Acquisition workshop
following PsychoCompLA-2004, held in Geneva, Switzerland as part of the 20th International Conference
on Computational Linguistics (COLING-2004), PsychoCompLA-2005 as part of the 43rd Annual Meeting of
the Association for Computational Linguistics (ACL-2005) held in Ann Arbor, Michigan where the
workshop shared a joint session with the Ninth Conference on Computational Natural Language Learning
(CoNLL-2005), and PsychoCompLA-2007 held in Nashville, Tennessee as part of the 29th meeting of the
Cognitive Science Society (CogSci-2007).

Workshop Description:

The workshop will present research and foster discussion centered around psychologically-motivated
computational models of language acquisition, with an emphasis on the acquisition of syntax. In recent
decades there has been a
thriving research agenda that applies computational learning techniques to emerging natural language
technologies and many meetings, conferences and workshops in which to present such research. However,
there have been only a few (but growing number of) venues in which psychocomputational models of how
humans acquire their native language(s) are the primary focus.

Psychocomputational models of language acquisition are of particular interest in light of recent
results in developmental psychology that suggest that very young infants are adept at detecting
statistical patterns in an audible input stream. Though, how children might plausibly apply
statistical 'machinery' to the task of grammar acquisition, with or without an innate language
component, remains an open and important question. One effective line of investigation is to
computationally model the acquisition process and determine interrelationships between a model and
linguistic or psycholinguistic theory, and/or correlations between a model's performance and data
from linguistic environments that children are exposed to.

Special Theme:

Although the workshop program speaks to many facets of psychocomputational language acquisition
modeling, the theme of the workshop this year is:

* Computational resources: How much is just right, and does it matter?

The computational resources (e.g., number of calculations per input datum, size of memory store, etc.)
employed by current psychocomputational modeling efforts vary tremendously from model to model.
However, two important questions have rarely been addressed. How well do a particular acquisition
model's resources parallel the resources employed by a human language learner? And, how relevant (or
not) is it to establish such a relationship?

Topics and Goals:

Abstracts that present research on (but not necessarily limited to) the following topics are welcome:

* Models that address the acquisition of word-order;
* Models that combine parsing and learning;
* Formal learning-theoretic and grammar induction models that incorporate psychologically plausible constraints;
* Comparative surveys that critique previously reported
studies;
* Models that have a cross-linguistic or bilingual perspective;
* Models that address learning bias in terms of innate
linguistic knowledge versus statistical regularity in the
input;
* Models that employ language modeling techniques from corpus linguistics;
* Models that employ techniques from machine learning;
* Models of language change and its effect on language
acquisition or vice versa;
* Models that employ statistical/probabilistic grammars;
* Computational models that can be used to evaluate existing linguistic or developmental theories (e.g.,
principles & parameters, optimality theory, construction grammar, etc.)
* Empirical models that make use of child-directed corpora such as CHILDES.

This workshop intends to bring together researchers from cognitive psychology, computational
linguistics, other computer/mathematical sciences, linguistics and psycholinguistics working on
all areas of language acquisition. Diversity and cross-fertilization of ideas is the central goal.

Workshop Organizer:
William Gregory Sakas, City University of New York
(sakas at hunter.cuny.edu)

Workshop Co-organizer:
David Guy Brizan, City University of New York
(dbrizan at gc.cuny.edu)

Program Committee:
Rens Bod, Institute for Logic, Language and Computation, University of Amsterdam, Netherlands
David Guy Brizan, City University of New York, USA
Damir Cavar, University of Indiana, USA and Zadar University, Croatia
Gary Marcus, New York University
Nick Chater, University of College London, UK
Alex Clark, Royal Holloway University of London, UK
Rick Dale, University of Memphis, USA
Jeffery Lidz, University of Maryland, USA
Gary Marcus, New York University, USA
Lisa Pearl, University of California, Irvine, USA
William Gregory Sakas, City University of New York, USA
Josh Tenenbaum, Massachusetts Institute of Technology, USA
Charles D. Yang, University of Pennsylvania, USA

Submission details:

Authors are invited to submit abstracts of 1 page plus 1 page for data and other supplementary materials.
Abstracts should be anonymous, clearly titled and no more than 500 words in length. Text of the abstract
should fit on one page, with a second page for examples, table, figures, references, etc. The following
formats are accepted: PDF, PS, and MS Word. Please include a cover sheet (as a separate attachment)
containing the title of your submission, your name, contact details and affiliation. Send your
submission electronically to

Email: Psycho.Comp <at> hunter.cuny.edu.
      with  "PsychoCompLA-2008 Submission" somewhere in the subject line.

Publication:

The accepted abstracts will appear in the online workshop proceedings. Full papers of accepted abstracts
will be considered in Fall 2008 for inclusion in an issue of the new Cognitive Science Society Journal -
topiCS - whose focus will be psychocomputational modeling of human language acquisition.

Submission deadline: June 15, 2008

Contact: Psycho.Comp <at> hunter.cuny.edu
        with "PsychoCompLA-2008" somewhere in the subject line.

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Stanislaw Roszkowski | 2 Jun 2008 10:38
Picon
Favicon

PALC 2009 First Call

PALC 2009 FIRST CALL FOR PAPERS

The Department of English Language and Applied Linguistics is proud to
announce that the 7th international conference on PRACTICAL APPLICATIONS
IN LANGUAGE AND COMPUTERS (PALC 2009) will be held over 3 days, 6 to 8
April 2009 (arrival day 5 April) at the Lodz University Conference Centre
in Lodz, Poland.

For over a decade, the PALC conferences have served the international
community of corpus linguists by providing a useful forum for the exchange
of views and ideas on how corpora and computational tools can be
effectively employed to explore and advance our understanding of language.

The topics of the conference include, but are not limited to, the following:

Contrastive Studies and Language Corpora
Discourse and Language Corpora
ESP and Language Corpora
Expert, Retrieval and Analytical Systems
FLA/SLA and Language Corpora
Language Teaching Materials and Language (Learner) Corpora/ICT
Virtual Learning Environments
E-testing
Large (multilingual/multimodal) Corpora
Lexicography and Language Corpora
Cognition, Computers and Language
Computer Translation Tools
Machine Translation, Machine-aided Translation, Translation and Corpora
E-books and Corpora and Literature

Workshop sessions:

Combinatorics (patterning) in specialized discourses (organized by
Stanisław Goźdź-Roszkowski

E-learning (organized by Jacek Waliński & Przemek Krakowian)

Exploring National Corpora (organized by Piotr Pęzik and Łukasz Dróżdż)

Official language of the conference will be English.

PLENARY SPEAKERS
The following scholars have accepted our invitation to address the
conference as plenary speakers:

Mark Davies, Brigham Young University, Provo, USA
Ken Hyland, University of London, UK
Ramesh Krishnamurthy, Aston University, Birmingham, UK
Margaret Rogers, University of Surrey, Guildford, UK
Terttu Nevalainen, University of Helsinki, Finland

ABSTRACTS
Abstracts of papers should be up to 500 words long and forwarded (by
e-mail or fax) to the organisers (see below). Deadline for submission is
31 December 2008. Presentations should last 30 minutes including
demonstrations, questions and discussion.

PROCEEDINGS
A selection of conference papers will be published with the Peter Lang as
part of the Łódź Studies in Language series. Deadline for the submission
of papers is 1 July 2009.

COST
The cost of conference registration is 200 euros (170 euros on or before
15 February). This includes a conference pack, coffee breaks and
participation in sessions.

Participants can book subsidised accommodation at the Lodz University
Conference Centre complex, where the conference will be held
(http://www.csk.uni.lodz.pl/).
A conference package is available (including  registration fee,
accommodation [3 nights], full board and conference dinner) at 470 euros.
Please, note that all accommodation bookings at the conference centre are
handled by the organizers and the conference centre will not accept
reservations coming directly from participants.

Alternative accommodation options can be found at http://www.hotel.lodz.pl/

We have got a limited number of bursaries for colleagues from low-GNP
countries (including Poland) and full-time students. Please apply by
emailing us at palc <at> uni.lodz.pl.

PAYMENT
Payment should be by bank transfer:

BANK: PKO S.A. II O/Lodz
ACCOUNT No:14124030281111001004347782
IBAN: PL14124030281111001004347782
SWFT: PKO PPL PW

Alternatively, cash payment can be made on arrival. This will incur an
additional charge of 25 euros. We regret that neither cheques nor credit
cards can be handled.

IMPORTANT DATES
Abstracts due: 31 December 2008
Notification of acceptance 31 January 2009
Early bird registration ends 15 February, 2009
Submission of conference papers 1 July 2009

ORGANISING COMMITTEE
Prof. Barbara Lewandowska-Tomaszczyk
Dr Stanisław Goźdź-Roszkowski
Dr Przemysław Krakowian
Dr Krzysztof Kredens
Dr Jacek Waliński
Mr Łukasz Dróżdż
Ms Anna Kamińska
Mr Piotr Pęzik

Department of English Language and Applied Linguistics
University of Lodz
Email: palc <at> uni.lodz.pl; roszkowski <at> uni.lodz.pl
http://palc.ia.uni.lodz.pl
phone 48 42 6655220
fax    48 42 6655221

--

-- 
Dr Stanislaw Gozdz-Roszkowski
Assistant Professor
Department of English and Applied Linguistics
University of Lodz
Kosciuszki 65
90-514, Lodz, Poland
tel. 48 42 6655220
fax  48 42 6655221
cell phone 48 603 965 867
http://www.filolog.uni.lodz.pl/kja/Roszkowski.htm
http://www.linkedin.com/in/stanislawgozdzroszkowski

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Afzal, Naveed | 2 Jun 2008 15:27
Picon
Picon
Favicon

Biomedical Relations annotated corpus for Evaluation

Dear Members,
 
I have build an information extraction system using Genia Corpus and I am looking for the biomedical corpus in which relations are manually annotated for the evaluation purpose. I am already aware of the Genia Event annotation corpus but I am looking for any other alternative biomedical corpus. Any pointers in this regard will be very much appricated.
 
Thanks for your time and help.
 
 
Kind Regards,
Naveed Afzal
PhD Student
Computational Linguistics Group
University of Wolverhampton
MB109 Stafford Street
Wolverhampton
WV1 1SB
United Kingdom
 
 

--
Scanned by iCritical.


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
José Lopes Moreira Filho | 2 Jun 2008 21:22
Favicon

RES: Corpora Digest, Vol 12, Issue 2

Hi 

I am a Corpus Linguistics student/researcher. Experience with: Shell, PHP,
ASP, VB6, VB.NET, and learning C#. I also have some applications built with
VB.net:
http://www.corpuslg.org/software/temp/kwicg1.1.exe
http://www.corpuslg.org/software/developer/?postid=17 

I think you don't need to clean text to take word frequency with C# or
VB.NET. 

The 'Split' or 'Replace' process are slow compared to what you can do using
'StringBuilder', 'RegularExpressions', and 'Dictionaries'.

Take a look at the simple vb.net code below. It 'cleans' (actually extracts)
text without using 'replace' or 'Split' functions.   
The same approach is possible for a C# function.

VB.NET CODE:
'+-------------------------------------------------------------------------
Imports System.Text
Imports System.Text.RegularExpressions
Imports System.Collections.Generic
'+-------------------------------------------------------------------------

'--------------------------------------------------------------------------
Private function  Get_Words (strText as string ) as String 
    Dim strbText As New StringBuilder 
    Dim strbResult As New StringBuilder 
    Dim WordIndex As MatchCollection
    Dim regexpWord As String = "\b\w+-\w+\b|\b\w+\'\w+\b|\b\w+\'\b|\b\w+\b"
    Dim i As Integer 

		
    strbText.Append (strText)
    WordIndex = Regex.Matches(strbText.ToString, regexpWord)

     
    For i = 0 To WordIndex.Count - 1
     	strbresult.AppendLine (wordindex(i).Value )
    Next

    Get_Words = strbresult.ToString 

    strbresult = nothing
    strbText =Nothing
    WordIndex = Nothing
    regexpword = Nothing
    i = Nothing
End function
'---------------------------------------------------------------------------

In order to count words, you can use a 'Dictionary(Of String, Long)' object.
Instead of passing the words to a Stringbuilder object (as in the code
above), you pass the extracted words to a Dictionary of String, Long.

I hope this can shed some light on the issue. 

Regards,
    José. 

-----Mensagem original-----
De: corpora-request <at> uib.no [mailto:corpora-request <at> uib.no] 
Enviada em: segunda-feira, 2 de junho de 2008 10:00
Para: corpora <at> uib.no
Assunto: Corpora Digest, Vol 12, Issue 2

Today's Topics:

   1.  Cleaning text to take word frequency (Trevor Jenkins)
   2.  Cleaning text to take word frequency (Alexandre Rafalovitch)
   3.  Workshop CFP (Corrected submission	date):Psychocomputational
      Models of Human Language Acquisition	(PsychoCompLA-2008)
      (pcomp_AT_hunter.cuny.edu)
   4.  PALC 2009 First Call (Stanislaw Roszkowski)

----------------------------------------------------------------------

Message: 1
Date: Sun, 1 Jun 2008 12:57:13 +0000 (GMT)
From: Trevor Jenkins <trevor.jenkins_AT_suneidesis.com>
Subject: [Corpora-List] Cleaning text to take word frequency
To: Corpora list <corpora_AT_uib.no>

On Sun, 1 Jun 2008, True Friend <true.friend2004_AT_gmail.com> wrote:

> ... version in C# of a Perl script a respected subscriber of this list
> (Alexander Schutz) ... now I am trying to programm myself so I tried to
> implement that idea in C#. I have done that all and it works also but it
> does not give me 100% frequency of the word as the Perl script does.

That is possible. In fact doesn't surprise me at all.

> ... The resulting string array was cleaned from such characters but I
> couldn't get the 100% result. The frequency of most words are less than
> that of Perl script (which does the same thing). ...

I'm neither a perl wizard or a C# tune-smith (I still use Snobol4) but I'd
suspect a major difference in the way the two language process text. For
my money I'd believe perl is giving you a more accurate result because the
language itself was designed to process text. I'd further believer that C#
(as Microsoft's attempt to have their own Java) doesn't deal with
character and/or textual data in the same way. What perl accepts as text
C# may well be ditching. You may be right by citing the System.split()
function; check very carefully what that function is intended to do and
then compare it with how the similar, but not necessarily identical,
function in perl works. Assume absolutely nothing about the functionality
of either language or of functions with the same name. If in doubt blame
C# for the discrepancy.

Regards, Trevor

<>< Re: deemed!

------------------------------

Message: 2
Date: Sun, 1 Jun 2008 09:30:03 -0400
From: "Alexandre Rafalovitch" <arafalov_AT_gmail.com>
Subject: [Corpora-List] Cleaning text to take word frequency
To: "True Friend" <true.friend2004_AT_gmail.com>
Cc: corpora_AT_uib.no

The way I would have approached this is by finding which words
generate count discrepancies and also exist in one, but not another
version of the result. Then, I would look for those words in the text
and see what context they are in.

What I suspect you will find is that your partial reimplementation of
perl's :punct: class is causing problems. I would either do a complete
reimplementation of that (see:
http://en.wikipedia.org/wiki/Regular_expression ) or look into C#'s
regular expressions, which I am sure will contain the same definition
of the :punct: class.

Finally, if you are working with languages other than English, you
most certainly should look into regular expression libraries. They
take into account Unicode's rules as well, something you really don't
want to have to duplicate in your own code.

Regards,
    Alex.

-- 
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/

On Sun, Jun 1, 2008 at 7:07 AM, True Friend <true.friend2004_AT_gmail.com>
wrote:
>
> HI
> I am a corpus linguistics student and learning C# for this purpose as
well.
> I've created a simple application to find the frequency of a given word in
> two files. Actually this simple application is a practice version in C# of
a
> Perl script a respected subscriber of this list (Alexander Schutz) written
> for me on my request on this list. I needed it then, now I am trying to
> programm myself so I tried to implement that idea in C#.

------------------------------

Message: 3
Date: Sun,  1 Jun 2008 21:31:02 -0400 (EDT)
From: pcomp_AT_hunter.cuny.edu
Subject: [Corpora-List] Workshop CFP (Corrected submission
	date):Psychocomputational Models of Human Language Acquisition
	(PsychoCompLA-2008)
To: Corpora_AT_uib.no

Apologies for multiple postings.

************************ 2nd Call for Abstracts ****************************

Psychocomputational Models of Human Language Acquisition (PsychoCompLA-2008)

July 23rd at CogSci 2008  - Washington, D.C.

Submission Deadline: June 15, 2008

http://www.colag.cs.hunter.cuny.edu/psychocomp/

Workshop Topic:

The workshop is devoted to psychologically-motivated computational models of
language acquisition. That is, models that are compatible with research in
psycholinguistics, developmental psychology and linguistics.

Invited Speakers:

* Rens Bod, Institute for Logic, Language and Computation, University of
Amsterdam, Netherlands
* Damir Cavar, University of Indiana, USA and Zadar University, Croatia
* Gary Marcus, New York University, USA
* Jeffery Lidz, University of Maryland, USA
* Gary Marcus, New York University, USA
* Josh Tenenbaum, Massachusetts Institute of Technology, USA

Workshop History:

This is the fourth meeting of the Psychocomputational Models of Human
Language Acquisition workshop following PsychoCompLA-2004, held in Geneva,
Switzerland as part of the 20th International Conference on Computational
Linguistics (COLING-2004), PsychoCompLA-2005 as part of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL-2005) held in
Ann Arbor, Michigan where the workshop shared a joint session with the Ninth
Conference on Computational Natural Language Learning (CoNLL-2005), and
PsychoCompLA-2007 held in Nashville, Tennessee as part of the 29th meeting
of the Cognitive Science Society (CogSci-2007).

Workshop Description:

The workshop will present research and foster discussion centered around
psychologically-motivated computational models of language acquisition, with
an emphasis on the acquisition of syntax. In recent decades there has been a
thriving research agenda that applies computational learning techniques to
emerging natural language technologies and many meetings, conferences and
workshops in which to present such research. However, there have been only a
few (but growing number of) venues in which psychocomputational models of
how humans acquire their native language(s) are the primary focus.

Psychocomputational models of language acquisition are of particular
interest in light of recent results in developmental psychology that suggest
that very young infants are adept at detecting statistical patterns in an
audible input stream. Though, how children might plausibly apply statistical
'machinery' to the task of grammar acquisition, with or without an innate
language component, remains an open and important question. One effective
line of investigation is to computationally model the acquisition process
and determine interrelationships between a model and linguistic or
psycholinguistic theory, and/or correlations between a model's performance
and data from linguistic environments that children are exposed to.

Special Theme:

Although the workshop program speaks to many facets of psychocomputational
language acquisition modeling, the theme of the workshop this year is:

* Computational resources: How much is just right, and does it matter?

The computational resources (e.g., number of calculations per input datum,
size of memory store, etc.) employed by current psychocomputational modeling
efforts vary tremendously from model to model. However, two important
questions have rarely been addressed. How well do a particular acquisition
model's resources parallel the resources employed by a human language
learner? And, how relevant (or not) is it to establish such a relationship?

Topics and Goals:

Abstracts that present research on (but not necessarily limited to) the
following topics are welcome:

* Models that address the acquisition of word-order;
* Models that combine parsing and learning;
* Formal learning-theoretic and grammar induction models that incorporate
psychologically plausible constraints;
* Comparative surveys that critique previously reported
studies;
* Models that have a cross-linguistic or bilingual perspective;
* Models that address learning bias in terms of innate
linguistic knowledge versus statistical regularity in the
input;
* Models that employ language modeling techniques from corpus linguistics;
* Models that employ techniques from machine learning;
* Models of language change and its effect on language
acquisition or vice versa;
* Models that employ statistical/probabilistic grammars;
* Computational models that can be used to evaluate existing linguistic or
developmental theories (e.g., principles & parameters, optimality theory,
construction grammar, etc.)
* Empirical models that make use of child-directed corpora such as CHILDES.

This workshop intends to bring together researchers from cognitive
psychology, computational linguistics, other computer/mathematical sciences,
linguistics and psycholinguistics working on all areas of language
acquisition. Diversity and cross-fertilization of ideas is the central goal.

Workshop Organizer:
William Gregory Sakas, City University of New York
(sakas at hunter.cuny.edu)

Workshop Co-organizer:
David Guy Brizan, City University of New York
(dbrizan at gc.cuny.edu)

Program Committee:
Rens Bod, Institute for Logic, Language and Computation, University of
Amsterdam, Netherlands
David Guy Brizan, City University of New York, USA
Damir Cavar, University of Indiana, USA and Zadar University, Croatia
Gary Marcus, New York University
Nick Chater, University of College London, UK
Alex Clark, Royal Holloway University of London, UK
Rick Dale, University of Memphis, USA
Jeffery Lidz, University of Maryland, USA
Gary Marcus, New York University, USA
Lisa Pearl, University of California, Irvine, USA
William Gregory Sakas, City University of New York, USA
Josh Tenenbaum, Massachusetts Institute of Technology, USA
Charles D. Yang, University of Pennsylvania, USA

Submission details:

Authors are invited to submit abstracts of 1 page plus 1 page for data and
other supplementary materials. Abstracts should be anonymous, clearly titled
and no more than 500 words in length. Text of the abstract should fit on one
page, with a second page for examples, table, figures, references, etc. The
following formats are accepted: PDF, PS, and MS Word. Please include a cover
sheet (as a separate attachment) containing the title of your submission,
your name, contact details and affiliation. Send your submission
electronically to

Email: Psycho.Comp_AT_hunter.cuny.edu.
      with  "PsychoCompLA-2008 Submission" somewhere in the subject line.

Publication:

The accepted abstracts will appear in the online workshop proceedings. Full
papers of accepted abstracts will be considered in Fall 2008 for inclusion
in an issue of the new Cognitive Science Society Journal - topiCS - whose
focus will be psychocomputational modeling of human language acquisition.

Submission deadline: June 15, 2008

Contact: Psycho.Comp_AT_hunter.cuny.edu
        with "PsychoCompLA-2008" somewhere in the subject line.

------------------------------

Message: 4
Date: Mon, 2 Jun 2008 10:38:53 +0200 (CEST)
From: "Stanislaw Roszkowski" <roszkowski_AT_uni.lodz.pl>
Subject: [Corpora-List] PALC 2009 First Call
To: Corpora_AT_uib.no

PALC 2009 FIRST CALL FOR PAPERS

The Department of English Language and Applied Linguistics is proud to
announce that the 7th international conference on PRACTICAL APPLICATIONS
IN LANGUAGE AND COMPUTERS (PALC 2009) will be held over 3 days, 6 to 8
April 2009 (arrival day 5 April) at the Lodz University Conference Centre
in Lodz, Poland.

For over a decade, the PALC conferences have served the international
community of corpus linguists by providing a useful forum for the exchange
of views and ideas on how corpora and computational tools can be
effectively employed to explore and advance our understanding of language.

The topics of the conference include, but are not limited to, the following:

Contrastive Studies and Language Corpora
Discourse and Language Corpora
ESP and Language Corpora
Expert, Retrieval and Analytical Systems
FLA/SLA and Language Corpora
Language Teaching Materials and Language (Learner) Corpora/ICT
Virtual Learning Environments
E-testing
Large (multilingual/multimodal) Corpora
Lexicography and Language Corpora
Cognition, Computers and Language
Computer Translation Tools
Machine Translation, Machine-aided Translation, Translation and Corpora
E-books and Corpora and Literature

Workshop sessions:

Combinatorics (patterning) in specialized discourses (organized by
Stanis?aw Go?d?-Roszkowski

E-learning (organized by Jacek Wali?ski & Przemek Krakowian)

Exploring National Corpora (organized by Piotr P?zik and ?ukasz Dró?d?)

Official language of the conference will be English.

PLENARY SPEAKERS
The following scholars have accepted our invitation to address the
conference as plenary speakers:

Mark Davies, Brigham Young University, Provo, USA
Ken Hyland, University of London, UK
Ramesh Krishnamurthy, Aston University, Birmingham, UK
Margaret Rogers, University of Surrey, Guildford, UK
Terttu Nevalainen, University of Helsinki, Finland

ABSTRACTS
Abstracts of papers should be up to 500 words long and forwarded (by
e-mail or fax) to the organisers (see below). Deadline for submission is
31 December 2008. Presentations should last 30 minutes including
demonstrations, questions and discussion.

PROCEEDINGS
A selection of conference papers will be published with the Peter Lang as
part of the ?ód? Studies in Language series. Deadline for the submission
of papers is 1 July 2009.

COST
The cost of conference registration is 200 euros (170 euros on or before
15 February). This includes a conference pack, coffee breaks and
participation in sessions.

Participants can book subsidised accommodation at the Lodz University
Conference Centre complex, where the conference will be held
(http://www.csk.uni.lodz.pl/).
A conference package is available (including  registration fee,
accommodation [3 nights], full board and conference dinner) at 470 euros.
Please, note that all accommodation bookings at the conference centre are
handled by the organizers and the conference centre will not accept
reservations coming directly from participants.

Alternative accommodation options can be found at http://www.hotel.lodz.pl/

We have got a limited number of bursaries for colleagues from low-GNP
countries (including Poland) and full-time students. Please apply by
emailing us at palc_AT_uni.lodz.pl.

PAYMENT
Payment should be by bank transfer:

BANK: PKO S.A. II O/Lodz
ACCOUNT No:14124030281111001004347782
IBAN: PL14124030281111001004347782
SWFT: PKO PPL PW

Alternatively, cash payment can be made on arrival. This will incur an
additional charge of 25 euros. We regret that neither cheques nor credit
cards can be handled.

IMPORTANT DATES
Abstracts due: 31 December 2008
Notification of acceptance 31 January 2009
Early bird registration ends 15 February, 2009
Submission of conference papers 1 July 2009

ORGANISING COMMITTEE
Prof. Barbara Lewandowska-Tomaszczyk
Dr Stanis?aw Go?d?-Roszkowski
Dr Przemys?aw Krakowian
Dr Krzysztof Kredens
Dr Jacek Wali?ski
Mr ?ukasz Dró?d?
Ms Anna Kami?ska
Mr Piotr P?zik

Department of English Language and Applied Linguistics
University of Lodz
Email: palc_AT_uni.lodz.pl; roszkowski_AT_uni.lodz.pl
http://palc.ia.uni.lodz.pl
phone 48 42 6655220
fax    48 42 6655221

-- 
Dr Stanislaw Gozdz-Roszkowski
Assistant Professor
Department of English and Applied Linguistics
University of Lodz
Kosciuszki 65
90-514, Lodz, Poland
tel. 48 42 6655220
fax  48 42 6655221
cell phone 48 603 965 867
http://www.filolog.uni.lodz.pl/kja/Roszkowski.htm
http://www.linkedin.com/in/stanislawgozdzroszkowski

----------------------------------------------------------------------
Send Corpora mailing list submissions to
	corpora <at> uib.no

To subscribe or unsubscribe via the World Wide Web, visit
	http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
	corpora-request <at> uib.no

You can reach the person managing the list at
	corpora-owner <at> uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 12, Issue 2
**************************************


--

-- 
No virus found in this incoming message.
Checked by AVG. 
Version: 7.5.524 / Virus Database: 269.24.4/1478 - Release Date: 2/6/2008
07:12

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

True Friend | 3 Jun 2008 02:42
Picon
Gravatar

Re: Cleaning text to take word frequency



Thanks for your kind replies.
I am a learner of C# and there is no prior programming experience. I've created this application to practice what I'd learnt. The Perl scrip I presented is using probably hashtable and obviously fine piece of code I just tried to implement same idea in another way.
I try after changing punctuation cleaning method and also using dictionaries/hash tables instead.  Regular Expressions are good idea to clean text (I still have to learn regex then will be able to apply).
Thanks for your valuable suggestions about Perl and Python and I'll try to learn these languages as soon as possible after completing basic concepts of C#.
Regards



--
Muhammad Shakir Aziz محمد شاکر عزیز
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
True Friend | 3 Jun 2008 09:59
Picon
Gravatar

Re: Cleaning text to take word frequency

Thanks for your message. The Perl script is written not by me, the person is a subscriber of this list who made it on my request. It is obviously better to use hashtables and C# do have hashtables also. The reason of using arrays is just to practice what I've learnt. Your valuable suggestions about delimiters and changing the code will help me to make it better. I'll use hashtalbes next time and will apply better methods approach and will remove duplicate methods.
Thank for your kind suggestions.
Regards

On Tue, Jun 3, 2008 at 11:28 AM, jeremy ellman <jeremyellman <at> gmail.com> wrote:
Hi,

The two implementations are quite different. In Perl you are using hashes, and in C# you are using arrays. C# has hashtables and most other features that Perl has, including iterate over hashtables.

Split and regular expressions are identical between C# and Perl (leastwise, I've never found a difference), but I do notice that you are using different delimiters as you declare them twice. I suggest that you make delims an instance variable, and declare that once. Then you can test you application with a small paragraph length text to understand what it is the two programs do differently

It is also a bad idea to use static methods in C# unless you need to (and you don't).

Incidentally, it is much faster to write Corpus applications in Perl, although C# apps are more robust.

Jeremy


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora





--
Muhammad Shakir Aziz محمد شاکر عزیز
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
YEPES, GR | 2 Jun 2008 15:23
Picon
Picon
Favicon

Quijote

Dear all,

I have been searching the web for two translations of Cervante's masterpiece into German (plain text to
download), specifically the translation by Ludwig Tieck (1799/1801) and the translation by Anton Maria
Rothbauer (1964). I have asked the publishers already and have also looked at the Gutenberg website,
where I have found the original and its translation into English by Thomas Shelton.

Any clues about where to find these two translations of El Quijote into German would be very appreciated.

Kind regards, Guadalupe Ruiz Yepes

--
Dr Guadalupe Ruiz Yepes
Post-Doc Researcher
School of Languages and Social Sciences
Aston University
Birmingham, UK B4 7ET
Tel: +44 (0)121 204 3809

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora


Gmane