Daniel O'Donnell | 4 Apr 07:53 2006
Picon

Feature structures in a production environment

I have always been intrigued by feature structures. In SGML/pre-XLST
days, it was never clear to me how they might be used. In XSLT terms I
have an inkling I might be able to use them in some productive-to-humans
way.

Has anybody used feature structures in a production environment? Can
they let us see some examples?

-d
--

-- 
Daniel Paul O'Donnell, PhD
Associate Professor and Chair of English
Director, Digital Medievalist Project
(http://www.digitalmedievalist.org/)
University of Lethbridge
Lethbridge AB T1K 3M4
Canada

Vox +1 (403) 329-2378
Fax +1 (403) 382-7191
Cell +1 (403) 393-2539

: <at> wiglaf/ubuntu

Peter Boot | 4 Apr 11:23 2006
Picon
Picon

Modifying element description in Roma

Hi all,

Sorry to bring this up again, but I need some help. I'm trying to change
element descriptions using Roma, which for some reason doesn't work for
me. At Sebastian's end, the system seems to work as it should. I may be
missing something, but what? Is there some kind soul out there willing to
go through the following scenario and report his/her results, perhaps
off-list? It shouldn't take more than a few minutes.

Thanks in advance,
Peter Boot

1. Go to http://www.tei-c.org.uk/Roma/
result: server redirects to http://tei.oucs.ox.ac.uk/Roma/

2. leave 'create new cust' on and press submit
result: shows 'Set your parameters' window

3. click 'modules' tab
result: shows 'modules' window with four selected modules at right hand side

4. click 'core' at right side of window
result: shows 'Change module' window

5. press 'abbr'
result: shows 'Change Element' window

6. change original description ('contains an abbreviation of any sort')
into 'CHANGED!', press 'submit query'
result: back to 'change module' window,
(Continue reading)

Martin Mueller | 7 Apr 16:57 2006

Northwestern University announces release of WordHoard

Academic Technologies and the Library at Northwestern University are happy to announce the release of WordHoard at http://wordhoard.northwestern.edu.

 

Named after an Old English phrase for the verbal treasure unlocked by a wise speaker, WordHoard is an application for the close reading and scholarly analysis of deeply tagged literary texts. It applies to highly canonical literary texts the insights and techniques of corpus linguistics, that is to say, the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In the WordHoard environment, such texts are tagged by morphological, lexical, prosodic, and narratological criteria. They are mediated through a digital page or user interface that lets scholarly but non-technical users explore the greatly increased query potential of textual data kept in such a form.

 

The development of WordHoard has been supported by a generous grant from the Andrew W. Mellon Foundation.  The current release includes the remains of Early Greek epic in Greek and translation, all of Chaucer and Shakespeare, and Spenser’s Faerie Queene. The texts have been tagged by morphosyntactic, lexical, prosodic, and narratological criteria. The English texts have been tagged according to a common scheme that enables users to compare Chaucer with Spenser or Shakespeare from a variety of perspectives.

 

WordHoard may be seen as a textbase with an unusually flexible set of concordance features. Much attention has been paid to a user interface that allows for the side-by-display of arbitrarily chosen passages in the same field of vision. Concordance searches may quickly be grouped and regrouped by various criteria, including speaker gender or prosodic status in the case of Shakespeare.

 

Every word occurrence in the texts is a link that can be activated to display in a GetInfo window all the information the text may be said to know about all forms of the word in that location. This is very useful for texts that have much orthographic or morphological variety, such as Spenser or Chaucer, not to speak of Homer or Hesiod: for any given word in the text the reader is a second away from a table that shows all the spellings of all the forms of that word sorted by frequency, thus giving an immediate overview of actual usage.

 

WordHoard includes a statistical engine that supports a variety of procedures common in Natural Language Processing. For example, users can look for words that are disproportionately common or rare in Shakespeare’s comedies when compared with the tragedies or all of Shakespeare. The current release includes precompiled work sets for analysis. In later releases, users will be able to configure sets for their own purposes.

 

WordHoard also includes an annotation module. In the current release, this module supports the display of the Iliad scholia as true textual marginalia. Later releases will support user annotation not only of particular locations in a text but of words wherever they occur. A prototype of WordHoard with user generated annotation is in operation at Northwestern, but it will require additional security feature before it can be released.

 

WordHoard is a Java Web Start/Swing application.  It requires a broadband connection and will not work over a modem.  Many operations in WordHoard involve extensive shuttling between the client and the server. WordHoard will therefore generally be quite a bit faster in on-campus environments, where information moves at the same speed in both directions, than in off-campus environments where download speeds are between five and times as fast as upload speeds. General network traffic and the complexity of queries or size of result sets also are important variables.  We will be very interested in getting feedback from users about how the application works in different environments. WordHoard has a Send Error Report in its File Menu. This was designed to point out errors in the tagging, but it can be used just as effectively for general comments. You may also send email to martinmueller <at> northwestern.edu.

 

 

 

 

 

 

 

 

Wendell Piez | 7 Apr 20:51 2006

Re: tag spoken sections in fiction corpora

Martin,

You posted this two weeks ago or more....

At 04:19 PM 3/20/2006, you wrote:
>I have been thinking about retroactively tagging spoken sections in
>various fiction corpora, and I wonder whether anybody has advice on
>the utility or feasibility of such a project.

No advice in particular, but interest in general.

>As for feasibility, it's certainly going to be a tedious business.
>You have to look at files one by one and figure out whether through a
>combination of authorial pointers (she said) and typographical
>devices (quotation marks, dashes, etc) you could get good enough
>results (whatever 'good enough' means in that context. And you'd have
>to keep your fingers crossed that a script that works for one work or
>author will with little labor do other texts well enough. Does
>anybody have experience with that kind of work?

I should think that each work would be unique, to say nothing of 
differences in reconciling the available representations of the texts 
you started with. Each edition having its own peculiar history, etc. 
Heuristic analytic tools and methods might cross over, but would also 
have to be tuned and adapted for particular cases.

>As for utility, it is a reasonable assumption that narrative and
>speech will differ significantly in just about every text.

I agree.

>I learned this with Homer, where narrative and speech seem on the 
>surface quite
>continuous.

Because in epic (as in tragedy, a related form) plots are often moved 
forward, or recounted, through dialogue. Yet there are interesting 
folds in narrative-temporal logic, as when whole segments are 
encapsulated in framed narratives. Also, there is a correspondence / 
alignment between narrative and the appearance of set pieces that 
could be traced.

>There was a study some years ago that claimed to distinguish between 
>the authors of the Iliad and Odyssey on the basis of the 
>distribution of common words. But what that study measured was 
>mainly the fact that characters talk more in the Odyssey.

That's fascinating. I wonder about similar data like the lengths and 
distribution of speeches, etc.

One thing I've often thought about is graphical representations of 
differences between instants of a genre with respect to these sorts of things.

>Are there stylometric or thematic analyses for which scholars would
>like to have tagged fiction corpora where narrative and speech are
>tagged with sufficient accuracy? By sufficient accuracy I mean a
>level that would allow a scholar interested in a particular smaller
>set of works to bring them up to snuff himself over the course of a
>long weekend.

I would love to know about such examples myself.

For what it's worth, it is largely the attraction of this sort of 
problem that tells me we really want decent approaches to the 
"overlap problem".

Best regards,
Wendell

======================================================================
Wendell Piez                            mailto:wapiez <at> mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Kenneth M Price | 10 Apr 04:41 2006

Nebraska Digital Workshop

First Annual Nebraska Digital Workshop

Center for Digital Research in the Humanities (CDRH)
University of Nebraska-Lincoln
September 22-23, 2006

The Center for Digital Research in the Humanities (CDRH) at the University
of Nebraska-Lincoln will host the first annual Nebraska Digital Workshop on
September 22-23, 2006 and seeks proposals for digital presentations by
pre-tenure faculty, postdoctoral fellows, and advanced graduate students
working in digital humanities.

Workshop Goals

The goal of the Workshop is to enable the best early-career scholars in the
field of digital humanities, including but not limited to, English,
History, and Modern Languages, to present their work in a forum where it
can be critically evaluated, improved, and showcased.

Under the auspices of the CDRH research faculty and staff-a group that
includes CDHR Co-Directors Kenneth M. Price and Katherine L. Walter, Brett
Barney, Andrew Jewell, Brian Pytlik Zillig, Douglas Seefeldt, William G.
Thomas, III, and Judellen Thornton-Järinge-the Nebraska Digital Workshop
will offer opportunities to discuss the potential of humanities computing,
present examples of successful projects created at the CDRH, offer a new
tools workshop, share strategies for developing administrative and
institutional support for digital humanities scholarship at the applicants'
home institutions, and share external funding and grant-writing tips. The
Workshop ultimately endeavors to foster a network of digital scholars who
will come together across disciplinary boundaries at the workshop, and who
in the future will advance humanities computing and help define the state
of digital scholarship. For information on the Center for Digital Research
in the Humanities and faculty/staff biographies, see http://cdrh.unl.edu.

The Workshop will supplement its roster by bringing two nationally
recognized senior scholars in digital humanities to Lincoln to participate
and work with the scholars whose work is selected for presentation. This
year, the Workshop coincides with the Department of History's third annual
Pauley Symposium on the topic "History in the Digital Age," a gathering of
top digital historians that will include: Abdul Alkalimat, University of
Toledo; Edward L. Ayers, University of Virginia; Peter Bol, Harvard
University; Alan Liu, University of California, Santa Barbara; John Lutz,
University of Victoria; Patrick Manning, Northeastern University; Mary Beth
Norton, Cornell University; Jan Reiff, University of California, Los
Angeles; Roy Rosenzweig, George Mason University and Robert Schwartz, Mt.
Holyoke College. Two of these digital humanists will participate in the
Workshop.

Travel, Lodging and Honoraria

The CDRH will pay for travel and lodging expenses and scholars will receive
an honorarium for presenting their work at the Nebraska Digital Workshop.
Workshop participants will also be invited to all of the Pauley Symposium
"History in the Digital Age" events.

Selection Criteria

Applicants are encouraged to submit a three-page narrative abstract for an
approximately 30-minute presentation of their digital project along with
files of, or links to, any digital elements, electronic text, analytical
tools, or multimedia visualizations already created. Applicants who are
earlier in the production phase of their digital project may also submit
descriptive text that explains their plans for such digital materials.

Selection criteria include: the significance of the project in primary
disciplinary field, elements of technical innovation, theoretical and
methodological sophistication, and creativity of approach to the subject.

To Apply

Applicants are asked to send a proposed workshop abstract, curriculum
vitae, and a representative sample of digital work via a URL or disk to
William G. Thomas, III, Chair, Nebraska Digital Workshop Committee, via
email attachment at wgt <at> unl.edu or via surface mail at 615 Oldfather Hall,
UNL, Lincoln NE 68588-0327.

Deadline

The deadline for applications is May 1, 2006.

Young, John T | 11 Apr 12:11 2006
Picon

Re: tag spoken sections in fiction corpora

I'm not entirely sure if this is germane to the discussion, but I'm wondering how tagging of this sort would deal with works such as Wuthering Heights, Frankenstein and The Handmaid's Tale, in which virtually the whole work is ostensibly reported speech (or, in the case of The Handmaid's Tale, recorded speech), but is obviously far too well crafted to be anything of the sort.
 
John
John Young
The Newton Project
Imperial College London

From: TEI (Text Encoding Initiative) public discussion list on behalf of Wendell Piez
Sent: Fri 4/7/2006 7:51 PM
To: TEI-L <at> listserv.brown.edu
Subject: Re: tag spoken sections in fiction corpora

Martin,

You posted this two weeks ago or more....

At 04:19 PM 3/20/2006, you wrote:
>I have been thinking about retroactively tagging spoken sections in
>various fiction corpora, and I wonder whether anybody has advice on
>the utility or feasibility of such a project.

No advice in particular, but interest in general.

>As for feasibility, it's certainly going to be a tedious business.
>You have to look at files one by one and figure out whether through a
>combination of authorial pointers (she said) and typographical
>devices (quotation marks, dashes, etc) you could get good enough
>results (whatever 'good enough' means in that context. And you'd have
>to keep your fingers crossed that a script that works for one work or
>author will with little labor do other texts well enough. Does
>anybody have experience with that kind of work?

I should think that each work would be unique, to say nothing of
differences in reconciling the available representations of the texts
you started with. Each edition having its own peculiar history, etc.
Heuristic analytic tools and methods might cross over, but would also
have to be tuned and adapted for particular cases.

>As for utility, it is a reasonable assumption that narrative and
>speech will differ significantly in just about every text.

I agree.

>I learned this with Homer, where narrative and speech seem on the
>surface quite
>continuous.

Because in epic (as in tragedy, a related form) plots are often moved
forward, or recounted, through dialogue. Yet there are interesting
folds in narrative-temporal logic, as when whole segments are
encapsulated in framed narratives. Also, there is a correspondence /
alignment between narrative and the appearance of set pieces that
could be traced.

>There was a study some years ago that claimed to distinguish between
>the authors of the Iliad and Odyssey on the basis of the
>distribution of common words. But what that study measured was
>mainly the fact that characters talk more in the Odyssey.

That's fascinating. I wonder about similar data like the lengths and
distribution of speeches, etc.

One thing I've often thought about is graphical representations of
differences between instants of a genre with respect to these sorts of things.

>Are there stylometric or thematic analyses for which scholars would
>like to have tagged fiction corpora where narrative and speech are
>tagged with sufficient accuracy? By sufficient accuracy I mean a
>level that would allow a scholar interested in a particular smaller
>set of works to bring them up to snuff himself over the course of a
>long weekend.

I would love to know about such examples myself.

For what it's worth, it is largely the attraction of this sort of
problem that tells me we really want decent approaches to the
"overlap problem".

Best regards,
Wendell




======================================================================
Wendell Piez                            mailto:wapiez <at> mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Wendell Piez | 11 Apr 16:28 2006

Re: tag spoken sections in fiction corpora

John,

At 06:11 AM 4/11/2006, you wrote:
>I'm not entirely sure if this is germane to the discussion, but I'm 
>wondering how tagging of this sort would deal with works such as 
>Wuthering Heights, Frankenstein and The Handmaid's Tale, in which 
>virtually the whole work is ostensibly reported speech (or, in the 
>case of The Handmaid's Tale, recorded speech), but is obviously far 
>too well crafted to be anything of the sort.

Yes, exactly. The nesting in Frankenstein, for example, goes four 
deep, as the sea captain writes to his sister a narrative recounted 
to him by Frankenstein, who in the midst of his telling, quotes the 
monster (in five chapters), who in turn relates a story containing 
direct quotes (some of them a bit longish). To say nothing of the 
citations from Coleridge, Shelley and Volney.

Chaucer also comes to mind. More recently, there are such interesting 
tours-de-force as Calvino's "If On a Winter's Night a Traveller" and 
Nabokov's "Pale Fire". The latter, indeed, might be marked up (at 
least to start) as a TEI "critical edition". Then we might want to 
annotate the annotations.

Cheers,
Wendell

======================================================================
Wendell Piez                            mailto:wapiez <at> mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Branko Collin | 11 Apr 17:36 2006
Picon
Picon

Re: tag spoken sections in fiction corpora

On 11 Apr 2006 at 10:28, Wendell Piez wrote:
> At 06:11 AM 4/11/2006, you wrote:

> >I'm not entirely sure if this is germane to the discussion, but I'm 
> >wondering how tagging of this sort would deal with works such as 
> >Wuthering Heights, Frankenstein and The Handmaid's Tale, in which 
> >virtually the whole work is ostensibly reported speech (or, in the 
> >case of The Handmaid's Tale, recorded speech), but is obviously far 
> >too well crafted to be anything of the sort.
> 
> Yes, exactly. The nesting in Frankenstein, for example, goes four 
> deep, as the sea captain writes to his sister a narrative recounted 
> to him by Frankenstein, who in the midst of his telling, quotes the 
> monster (in five chapters), who in turn relates a story containing 
> direct quotes (some of them a bit longish). To say nothing of the 
> citations from Coleridge, Shelley and Volney.

The other day, sci-fi author John Scalzi wrote in his blog: 

"The mail today brought me [...] my author copies of the Science 
Fiction Book Club edition of The Ghost Brigades. It looks good, but 
there are a few subtle differences between it and the Tor version of 
the hardcover. For the collectors, here are the major differences: 
The Tor dust jacket has raised letters, while the SFBC version 
doesn't; the SFBC has the author picture from Old Man's War while the 
Tor version has a new picture, and the SFBC version of the book has a 
black cover while the Tor version is blue. Also, the SFBC version has 
a previously deleted paragraph at the end of the final chapter in 
which John Perry wakes up and entire of the book was just a dream, 
the end. Yeah, don't know how that got through."

"I'm kidding about that last paragraph, of course. Or am 
I????!111?!??!"

(<http://www.scalzi.com/whatever/004128.html>)

Talk about critical editions!

(The moral of the story: you cannot trust an author to stick to your 
carefully crafted tagging scheme.)
--

-- 
branko collin
collin <at> xs4all.nl

Martin Mueller | 11 Apr 18:38 2006

Re: tag spoken sections in fiction corpora

My crude mind was  after much simpler things in my original  
question.  There is a large number of pre-twentieth century novels  
where the distinction between speech and narrative is pretty obvious.  
There is an indeterminate subset of these novels--but probably not a  
trivial number--where this distinction can be inferred fairly  
accurately from typographical layout (though the scripts to extract  
the distinctions would have to be adjusted from author to author or  
publisher to publisher).

You can then retroactively tag speech in those novels, and this would  
give you a 'spoken' corpus as opposed to a 'narrative' corpus. Is  
that of likely interest to anybody, and is there any reason to  
believe that the corpus resulting from these various constraints  
would be 'good enough' for some or many inquiries?

My hunch is that there isn't much interest in this--which is itself a  
useful thing to know

On Apr 11, 2006, at 10:36 AM, Branko Collin wrote:

> On 11 Apr 2006 at 10:28, Wendell Piez wrote:
>> At 06:11 AM 4/11/2006, you wrote:
>
>>> I'm not entirely sure if this is germane to the discussion, but I'm
>>> wondering how tagging of this sort would deal with works such as
>>> Wuthering Heights, Frankenstein and The Handmaid's Tale, in which
>>> virtually the whole work is ostensibly reported speech (or, in the
>>> case of The Handmaid's Tale, recorded speech), but is obviously far
>>> too well crafted to be anything of the sort.
>>
>> Yes, exactly. The nesting in Frankenstein, for example, goes four
>> deep, as the sea captain writes to his sister a narrative recounted
>> to him by Frankenstein, who in the midst of his telling, quotes the
>> monster (in five chapters), who in turn relates a story containing
>> direct quotes (some of them a bit longish). To say nothing of the
>> citations from Coleridge, Shelley and Volney.
>
> The other day, sci-fi author John Scalzi wrote in his blog:
>
> "The mail today brought me [...] my author copies of the Science
> Fiction Book Club edition of The Ghost Brigades. It looks good, but
> there are a few subtle differences between it and the Tor version of
> the hardcover. For the collectors, here are the major differences:
> The Tor dust jacket has raised letters, while the SFBC version
> doesn't; the SFBC has the author picture from Old Man's War while the
> Tor version has a new picture, and the SFBC version of the book has a
> black cover while the Tor version is blue. Also, the SFBC version has
> a previously deleted paragraph at the end of the final chapter in
> which John Perry wakes up and entire of the book was just a dream,
> the end. Yeah, don't know how that got through."
>
> "I'm kidding about that last paragraph, of course. Or am
> I????!111?!??!"
>
> (<http://www.scalzi.com/whatever/004128.html>)
>
> Talk about critical editions!
>
> (The moral of the story: you cannot trust an author to stick to your
> carefully crafted tagging scheme.)
> -- 
> branko collin
> collin <at> xs4all.nl

Wendell Piez | 11 Apr 20:49 2006

Re: tag spoken sections in fiction corpora

Dear Martin,

At 12:38 PM 4/11/2006, you wrote:
>My crude mind was  after much simpler things in my original
>question.

Ah....

>   There is a large number of pre-twentieth century novels
>where the distinction between speech and narrative is pretty obvious.
>There is an indeterminate subset of these novels--but probably not a
>trivial number--where this distinction can be inferred fairly
>accurately from typographical layout (though the scripts to extract
>the distinctions would have to be adjusted from author to author or
>publisher to publisher).
>
>You can then retroactively tag speech in those novels, and this would
>give you a 'spoken' corpus as opposed to a 'narrative' corpus. Is
>that of likely interest to anybody, and is there any reason to
>believe that the corpus resulting from these various constraints
>would be 'good enough' for some or many inquiries?

Hm, I guess it would depend on the inquiries.

The first thing I'd find to be of interest would be which works would 
fall into this set, and which works would not. And how "fuzzy" the 
set would be. How would one detect whether a work was in the set? 
What makes the distinction between speech and narrative consistent 
and obvious? Presumably the presence of some markers (say, for 
dialogue) and the absence of others (say, for indirect discourse). 
Even then, one would have to be on the watch for false hits. Some 
kinds of narrative might have conventional markers for dialog, and 
yet pose similar problems for narrative subjectivity as those that 
did not. (I'm thinking of Charlotte Bronte's works, or George 
Eliot's, as possible boundary cases.)

Ease of auto-tagging (which is to say, level of consistency and 
explicitness of "tagging" by layout, typography, and narrative 
convention) might be an interesting marker of some sort of genre ... 
or it might not. It would be particularly interesting to see where 
such works would cluster. But this is meta-analysis, and has nothing 
to do with what you could learn, or not, from such tagging once you had it.

As to the latter, I think this would depend on the genre(s) of works, 
their commonalities (or not) over and above this property of 
consistency, and what kinds of inquiries you might pose relative to 
these properties.

>My hunch is that there isn't much interest in this--which is itself a
>useful thing to know

But there's a difference between the level of interest going in, and 
the potential for discoveries that could be interesting. :-)

Best regards,
Wendell

======================================================================
Wendell Piez                            mailto:wapiez <at> mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


Gmane