Alpo Honkapohja | 30 Jun 13:43 2015

Problem with tagging quotations and linguistic annotation

Dear TEI-list, 

I work for a corpus project called Medieval Latin from Anglo-Saxon Sources at the university of Zurich.

The corpus is based on editions, which typically identify quoted passages in footnotes and separate the quoted passage either by quotation marks or italics. A high priority in compiling our corpus is to keep the quotations separate from the running text, so that someone wanting to carry out a corpus analysis on anglisms in the Latin of Byrhtferth of Ramsay will not end up having long stretches of text quoted straight from the vulgate Bible or Venerable Bede in their results. 

[…] et de profundis clara uoce huius seculi proclamare, quoadusque benediceret illum pius Christus, ‘qui fecit celum et terram’. [footnote in the edition: quotation from Bible, Ps 123 (124), 8]. 

For the time being, we have been using the following encoding: 

[…] et de profundis clara uoce huius seculi proclamare, quoadusque benediceret illum pius Christus, <cit><quote>qui fecit celum et terram</quote><ref>quotation from Bible, Ps 123 (124), 8</ref></cit>.

However, I have currently been adding <s>-tags for sentences, for reasons of citation and as a point of compatibility with Toronto Dictionary of Old English corpus, which encodes everything as sentences. Since the resource is intended for linguistic research, the plan is eventually to add <w> tags for individual words and POS-information produced by an automatic parser/tagger. 

This leads to the problem that <cit>, <quote> or <q> tags are not allowed inside tag used for linguistic segment categories (whereas <s> tags are allowed inside quotations), so the following is not valid: 

<s>[…] et de profundis clara uoce huius seculi proclamare, quoadusque benediceret illum pius Christus, <cit><quote>qui fecit celum et terram</quote><ref>quotation from Bible, Ps 123 (124), 8</ref></cit>.</s> 

As an interim solution, I have been adding <note> tags around the <cit> tags, but this strikes me as clumsy, and creates occasional nesting problems like this:

Bede said he wanted: ‘to leave the monastery. It was just too hot in the summer.’

** <s>Bede said he wanted: <quote>‘to leave the monastery.</s> <s>It was just too hot in the summer.’</s></quote> 

I would be looking for a solution which would:

- clearly keep the quoted material separate so that anything tagged as quotation can be left out in a corpus search,

- not to interfere with the tags used for linguistic annotation, which are of major importance (words inside quotes will be POS-tagged as well),

- be sufficiently ‘one-size fits all’, so the same tags could be used for quotations of various length from one word to several sentences to entire paragraphs. We’ve got 300000+ words and hundreds if not thousands of quotations.

Thanks in advance!

Best Wishes,

Alpo Honkapohja, post-doc

University of Zurich

B Tommie Usdin | 29 Jun 20:08 2015

[ANN] Symposium on Cultural Heritage Markup adds Short Talks


With the addition of six short talks, the program for the symposium on Cultural Heritage Markup is now

The short talks are: 

 To those who startle at innovation...
 Laura Randall, NCBI/NLM/NIH

 EAGLE and EUROPEANA: architecture problems for aggregation and harmonization
 Pietro Maria Liuzzo, Ruprecht-Karls-Universität Heidelberg

 Encoding western and non-western names for Ancient Syriac authors
 Nathan P. Gibson, Vanderbilt Divinity School 
 Winona Salesky, Independent Consultant for

 Duplicitous Diabolos: Parallel witness encoding in quantitative studies of Coptic manuscripts
 Amir Zeldes, Georgetown University

 Encoding document and text in the Shelley-Godwin Archive
 Raffaele Viglianti, MITH - University of Maryland

 Divide and conquer: can we handle complex markup simply?
 Robin La Fontaine, DeltaXML

Complete program:
Inquiries: info <at>

The symposium on Cultural Heritage Markup precedes Balisage: The Markup Conference 2015. 
Balisage is the XML Geek-fest; the annual gathering of people who design markup and markup-based
applications; who develop XML specifications, standards, and tools; the people who read and write,
books about publishing technologies in general and XML in particular; and super-users of XML and related
technologies. You can read about the Balisage 2015 conference at 

** Members of the TEI and employees of TEI members are eligible for 
** discount registration at 
** Balisage 2015 and the pre-conference symposium on Cultural Heritage Markup 

Balisage: The Markup Conference 2015          mailto:info <at>
August 11-14, 2015                   
Preconference Symposium: August 10, 2015               +1 301 315 9631

Syd Bauman | 29 Jun 04:39 2015

Schematron 1.x vs ISO Schematron in Guidelines

 <lg type="spoof" sample="initial">
   <l>Schematron one-point-four, what shall we do?</l>
   <l>Is anyone left who wants to use you?</l>
     Information about the original can be found at
     And yes, that's a somewhat questionable use of  <at> sample.

executive summary
--------- -------
If you use <constraintSpec> in your customization (i.e. ODD) file,
please let us know which category you fit into:

 A. I depend on Schematron 1.x, and either cannot or don't want to
    change to ISO Schematron. Please don't remove support for 1.x.

 B. I don't use Schematron 1.x. Do whatever you want.

 C. I use Schematron 1.x, but have been looking for an excuse to
    change to ISO Schematron. Please remove 1.x.

 D. Why are you even asking about a language that has been outdated
    for 7 years? Just remove 1.x and go on with your lives.

[If you don't want to post to the list, feel free to reply to me

The TEI Guidelines support and use the capability to express formal
constraints above and beyond those expressed in the TEI RELAX NG
schema produced by customization.[1] An example is the constraint
that a  <at> spanTo attribute should point forward in the document, not
backwards. These extra-RELAX NG constraints may currently be
expressed in one of four ways:[2]
 * "schematron":    in the original Schematron language [3]
 * "isoschematron": in the ISO Schematron language [4]
 * "xsl":           in XSLT
 * "private":       in anything else, including non-XML languages

The TEI Guidelines themselves have not used anything other than
"isoschematron", which superseded Schematron 1.6 in 2007, for years.
The TEI Customization processing provided by the TEI (e.g. in Roma,
Byzantium, or the stylesheets built into oXygen) supports
"schematron" and "isoschematron", but not "xsl" (and obviously not
"private"). However, support for "isoschematron" is stronger than
support for "schematron".

The TEI Technical Council is seriously considering dropping support
for the original Schematron language.[5] I am writing to find out if
this would be a significant hardship on anyone in the community.

For users, dropping support for the original Schematron language would
not necessarily mean that they have to change to the ISO Schematron
language. But it does mean they would have to change from the
original Schematron language if and when they wish to process their
ODD file against future versions of the Guidelines.

For the TEI Council, dropping support for the original Schematron
language means less time working on something that not only don't we
use, but, as far as we know, no one else uses either. Thus more time
to work on other issues. (And there are a *lot* of other issues. :-)

If we do remove support for the original Schematron language, the
general method would be something like the following.

 1) The value "schematron" would be deprecated -- the documentation
    would indicate that it should be avoided, and those who use it
    would get a warning on validation.

 2) The Council would no longer put significant time into work on
    improving the processing of Schematron 1.x. However, the
    processes would continue to work, and significant bugs might still
    be addressed.

 3) ~1-2 years after steps (1) and (2) start, the value "schematron"
    would be completely removed. The documentation would no longer
    mention that value, and those who use it would get an error on

 4) The code that processes the original Schematron language would
    (eventually) be removed.

[1] Or the DTD or the W3C Schema; but since the constraints expressed
    in these languages are (currently) a subset of those expressed in
    RELAX NG (although not necessarily a proper subset), it is not
    necessary to mention them explicitly. But I just did. :-)
[2] See
[3] Schematron versions 1.3, 1.4, 1.5, or 1.6 to be precise. See,
    for details on the latest version, but note that even this latest
    version has been superseded by ISO Schematron.
[4] <ref target="">
      ISO/IEC 19757 - DSDL Document Schema Definition Language - 
      Part 3: Rule-based validation - Schematron


 Syd Bauman, EMT-Paramedic
 Senior XML Programmer/Analyst
 Northeastern University Women Writers Project
 s.bauman <at> or
 Syd_Bauman <at>

Martin de la Iglesia | 25 Jun 11:36 2015


Dear list,

I am confused about the definition of att.dimensions ("provides 
attributes for describing the size of physical objects") in conjunction 
with its members: while att.dimensions makes sense for elements like 
<space>, <gap> or <width>, I wonder what the "size of physical objects" 
might be in the case of elements such as <age>, <date> or <sex>?

This issue has been touched upon at 
<> already, but I 
wonder how all these elements ended up as members of att.dimensions in 
the first place. I suspect one wanted to have  <at> precision? Or is there 
anyone actually using things like <death quantity="9"> at all? Why not 
simply remove these elements from att.dimensions again?


Martin de la Iglesia
Metadata and Data Conversion

Georg-August-Universität Göttingen
Göttingen State and University Library
D-37073 Göttingen

Papendiek 14 (Historical Building, Room 1.206)
+49 551 39-14070 (Tel.)
+49 551 39-3468 (Fax)

mdelaig <at>

Easterly, Joseph | 24 Jun 22:19 2015

oXygen Content Completion plugin for personography


Have you (or do you know of anyone) who has developed a plugin for oXygen which allows editors to search
through external XML files (or an xml database) from oXygen's autocomplete / content completion interface?

An example of how such a plugin might be used is in doing personography work with primary source materials,
and at the point in the TEI editing process where you add the  <at> ref attribute to a <persName> tag, a
searchable pop-up list appears, populated with entries from your personography file.

Such a plugin wouldn’t necessarily modify oXygen’s content completion functionality, but I imagine
it would somewhat mimic its behavior.

many thanks,

Joe Easterly
Digital Humanities Librarian
River Campus Libraries, University of Rochester
B Tommie Usdin | 22 Jun 17:41 2015

[ANN] Balisage Program adds Late-breaking News


When the regular (peer-reviewed) part of the Balisage 2015 program was scheduled, a few slots were
reserved for presentation of "Late breaking" material.  These presentations have now been selected and
added to the program:
 - Tomos Hillman on "XSLT Pipelines in XSLT”
 - David A. Lee on "Multi Markup Language - When one markup language 
      just isn’t enough”
 - Chris Maloney, Alf Eaton, & Jeff Beck on "A client-side JATS4R validator using Saxon-CE"
 - David RR Webber on "The Semantics of self-assembling user apps”
 - Joseph Wicentowski & Wolfgang Meier on "Publishing TEI documents 
      with TEI Simple: a case study at the U.S. Department of 
      State’s Office of the Historian”

The program already included case studies from journal publishing, regulatory compliance systems, and
large-scale document systems; formatting XML for print and browser-based print formatting;
visualizing XML structures and documents. Technical papers cover such topics as: MathML; XSLT; use of
XML in government and the humanities; XQuery; design of authoring systems; uses of markup that vary from
poetry to spreadsheets to cyber justice; and hyperdocument link management. The conference will be
preceded by a one-day symposium on Cultural Heritage Markup.

Balisage is the XML Geek-fest; the annual gathering of people who design markup and markup-based
applications; who develop XML specifications, standards, and tools; the people who read and write,
books about publishing technologies in general and XML in particular; and super-users of XML and related
technologies. You can read about the Balisage 2015 conference at 

** Members of the TEI are eligible for discount registration at 
** Balisage 2015 and the pre-conference symposium on Cultural Heritage Markup 

Inquiries: info <at>

Balisage: The Markup Conference 2015          mailto:info <at>
August 11-14, 2015                   
Preconference Symposium: August 10, 2015               +1 301 315 9631

DCMI Announce | 21 Jun 23:40 2015

DCMI Webinar: "OpenAIRE Guidelines: Promoting Repositories Interoperability and Supporting Open Access Funder Mandates"

*********** Please excuse the cross postings ***********

OpenAIRE Guidelines: Promoting Repositories Interoperability and Supporting Open Access Funder Mandates
DCMI/ASIST Joint Webinar

:: Time: 10:00am EDT (World Clock: 14:00 UTC
:: Presenters: Pedro Antonio Príncipe & Jochen Schirrwagen
:: Date: Wednesday, 1 July 2015


The OpenAIRE Guidelines for Data Source Managers provide recommendations and best practices for encoding of bibliographic information in OAI metadata. The Guidelines have adopted established standards for different classes of content providers: (1) Dublin Core for textual publications in institutional and thematic repositories; (2) DataCite Metadata Kernel for research data repositories; and (3) CERIF-XML for Current Research Information Systems.

The principle of these Guidelines is to improve interoperability of bibliographic information exchange between repositories, e-journals, CRIS and research infrastructures. They are a means to help content providers to comply with funders Open Access policies, e.g. the European Commission Open Access mandate in Horizon2020, and to standardize the syntax and semantics of funder/project information, open access status, links between publications and datasets. The presenters will provide an overview of the Guidelines, implementation support in major platforms and tools for validation.


Pedro Príncipe is an information specialist at University of Minho Documentation Services (Portugal) on the Open Access Projects Office. He has worked since 2010 in the OpenAIRE projects and infrastructure, in support, helpdesk and dissemination activities. He is member of the OpenAIRE guidelines team and co-author of the OpenAIRE guidelines for data source managers.

Jochen Schirrwagen is research fellow at Bielefeld University Library, Germany. He has worked since 2008 in the knowledge infrastructure projects DRIVER and OpenAIRE in the fields of metadata management, aggregation and contextualization. He is co-author of the OpenAIRE guidelines for data source managers and coordinates its further evolvement.

For more information and to register, visit
Birnbaum, David J | 21 Jun 19:52 2015

CollateX collation workshop at DH2015

CollateX collation workshop at DH2015

There are still openings for additional participants in the CollateX collation workshop to be held on Monday, 29 June 2015 from 9:30 through 4:30 as part of the ADHO DH2015: Global Digital Humanities conference at the University of Western Sydney. The workshop will teach participants how to use the open-source CollateX collation tool to compare witnesses of a text automatically, in a way that can be used to produce critical textual editions and other types of comparative documents. Participants will learn how to prepare source materials in any written script for collation, how to perform automated collation using CollateX, and how to inspect and modify the results. To register for the workshop please follow the link at The workshop web site (still under development) is accessible at, and includes instructions for downloading and installing CollateX prior to the workshop. 
Eric Lease Morgan | 20 Jun 19:44 2015

eebo-tcp workset browser

I have put on GitHub a thing I call the EEBO-TCP Workset Browser. [1] From the README file:

  The EEBO-TCP Workset Browser is a suite of software designed to support
  "distant reading" against the corpus called the Early English Books
  Online - Text Creation Partnership corpus. Using the Browser it is
  possible to: 1) search a "catalog" of the corpus's metadata, 2) create a
  list of identifiers representing a subset of content for study, 3) feed
  the identifiers to a set of files which will mirror the content locally,
  index it, and do some rudimentary analysis outputting as set of HTML
  files, structured data, and graphs. The reader is then expected to
  examine the output more "closely" (all puns intended) using their
  favorite Web browser, text editor, spreadsheet, database, or statistical
  application. The purpose and functionality of this suite is very similar
  to the purpose and functionality of HathiTrust Research Center Workset

[1] EBO-TCP Workset Browser -

Eric Lease Morgan, Librarian
University of Notre Dame

Benjamin Kiessling | 19 Jun 22:23 2015

Encoding text coordinates


I'm working on OCR of Latin and Greek texts and looking for a more
flexible alternative to the common hOCR format. As our results get
converted to TEI/Epidoc finally anyway (and OCR itself could be
described as an epigraphic process) it would be somewhat fortuitous
if information like bounding boxes for lines, words, and graphemes,
recognition confidences, and script detection could be adequately
represented using already defined TEI primitives. In addition,
representing the output of multiple OCR engines including different
segmentations (word boundaries, columns, ...) would be desirable.

I've had a look at the P5 guidelines but couldn't find any
elements/attributes that could be utilized for these purposes without
some extremely creative coercion. So I'm looking for input on how to
achieve a non-contrived encoding of these features.

All Best,

James Cummings | 19 Jun 16:00 2015

Last chance to book! Digital Humanities at Oxford Summer School 2015

Please Forward!
It is your last chance to book for the Digital Humanities at 
Oxford Summer School 2015!
Booking closes on 29 June, and some workshops will be sold out 
before then!

Can't make it to the DHOxSS 2015? Sign up to our announcement 
mailing list for 2016 at

Digital Humanities at Oxford Summer School
20 - 24 July 2015

Scholarship -- Application -- Community

Do you work in the Humanities or support people who do?

Are you interested in how the digital can help your research?

Come and learn from experts with participants from around the
world, from every field and career stage, to develop your
knowledge and acquire new skills

Immerse yourself for a week in one of our 8 workshop strands, and
widen your horizons through the keynote and additional sessions

- An Introduction to Digital Humanities
- Crowdsourcing for Academic, Library and Museum Environments
- Digital Approaches in Medieval and Renaissance Studies
- Digital Musicology
- From Text to Tech
- Humanities Data: Curation, Analysis, Access, and Reuse
- Leveraging the Text Encoding Initiative
- Linked Data for the Humanities

Keynote Speakers:
- Jane Winters, Institute of Historical Research, University of 
- James Loxley, University of Edinburgh

Additional Lectures:
Supplement your chosen workshop with a choice from 9 additional
morning sessions covering a variety of Digital Humanities topics.

Evening Events:
Join us for events every evening, include a research poster and
drinks reception, guided walking tour of Oxford, the annual TORCH
Digital Humanities lecture, and a dinner at Exeter College.

For more information see:

Directors of DHOxSS,
James Cummings
Pip Willcox


Dr James Cummings,James.Cummings <at>
Academic IT Services, University of Oxford