Donald Moses | 1 Dec 16:35 2008
Picon

stylesheet to transform ABBYY XML output to TEI?

Hello:

Is anyone working on or have completed a stylesheet that will transform the xml output from ABBYY to TEI?

Thanks,

Donald

Syd Bauman | 1 Dec 18:56 2008

Re: stylesheet to transform ABBYY XML output to TEI?

> Is anyone working on or have completed a stylesheet that will
> transform the xml output from ABBYY to TEI? 

The short answer from me is "no". But IIRC, ABBYY is a suite of
Windows-only software tools that do all sorts of things. E.g., I
think one component does PDF -> MS Word, and the suite includes OCR,
So I'm suspicious that there may be lots of possible cases of ABBYY
XML output. Can you post (or if it's long, point us to) an example of
the kind of stuff you want to convert?

Hugh Cayless | 1 Dec 19:19 2008
Picon

Re: stylesheet to transform ABBYY XML output to TEI?

I've heard tell of such a beast.  I think the venue was the 2007  
Chicago Colloquium in Timothy Cole's presentation
(http://dhcs.northwestern.edu/abstracts/ab9.html 
).  Looking into this is on my agenda, but I haven't had time to  
really explore it yet, so I'd be interested to know if you find  
anything.

Best,
Hugh

/**
  * Hugh A. Cayless, Ph.D
  * Head, Research & Development Group
  * Carolina Digital Library and Archives
  * UNC Chapel Hill
  * hcayless <at> email.unc.edu
  */

On Dec 1, 2008, at 10:35 AM, Donald Moses wrote:

> Hello:
> Is anyone working on or have completed a stylesheet that will  
> transform the xml output from ABBYY to TEI?
> Thanks,
> Donald

Piotr Bański | 2 Dec 11:07 2008
Picon

header<->text references for a collection of text samples

Dear All,

I'd love to hear your advice on a problem that I have just got faced
with and that I practically need to have solved for yesterday. It
concerns the relationship between the header and the text in a corpus
file that is an aggregate of multiple (up to around a 1000) press
samples coming from a single press title. I am also somewhat uneasy with
regard to the header itself.

The general problem is this: because the <text> contains a series of
<div> elements coming from various sections and various editions of a
single daily, I want to put the corresponding metadata into a series of
statements in the header. Additionally, I want to have bidirectional
links between each <div> and its description.

The texts are taken from another corpus with TEI headers, so I figured
that what I need is, in the <fileDesc>:

<sourceDesc>
  <biblFull>   # containing a near copy of the original header
  </biblFull>  # of the newspaper

  <list>
    <item xml:id="header-34">
       <ptr target="#div-34"/>
        <bibl>
           <title>[section title]</title>
           <date></date><extent></extent>
        </bibl>
     </item>
  </list>
</sourceDesc>

and in the <text> below, a series of <div>s:

<body>
  ...
  <div xml:id="div-34" decls="#header-34">...</div>
  ...
</body>

My questions are:
1. is the header arrangement of <biblFull>+<list> sensible?

2. is the way I set up the header<->text references acceptable, or have
I abused something? In particular,
2.1. is my use of  <at> decls on the <div> ok?
2.2. is the <ptr> the right way of pointing from the header into the text?

Thanks for your time,

  Piotr

Piotr Banski | 2 Dec 12:30 2008
Picon

Re: header<->text references for a collection of text samples

Laurent Romary points out to me a certain vagueness in my original
query, so let me add that the problem occurs in the following context:

text.xml:

<teiCorpus xmlns:xi="http://www.w3.org/2001/XInclude"
           xmlns="http://www.tei-c.org/ns/1.0">
  <xi:include href="corpusHeader.xml"/> # main corpus header
  <TEI>
    <xi:include href="teiHeader.xml"/># this is the header I asked about
    <text xml:lang="pl">
      <body>

     (and now a series of <div>s follows, each containing a sample of
       ca. 100 words.)

Because the <div>s are so little, I didn't want to use multiple
<teiCorpus> elements for them, which (apparently) creates the need for
two aligned series of data chunks: metadata in the header and the
corresponding text samples in the <div> elements. Or am I steering away
from best practice here?

Thanks, Laurent :-)

  Piotr

I wrote:
> Dear All,
> 
> I'd love to hear your advice on a problem that I have just got faced
> with and that I practically need to have solved for yesterday. It
> concerns the relationship between the header and the text in a corpus
> file that is an aggregate of multiple (up to around a 1000) press
> samples coming from a single press title. I am also somewhat uneasy with
> regard to the header itself.
> 
> The general problem is this: because the <text> contains a series of
> <div> elements coming from various sections and various editions of a
> single daily, I want to put the corresponding metadata into a series of
> statements in the header. Additionally, I want to have bidirectional
> links between each <div> and its description.
> 
> The texts are taken from another corpus with TEI headers, so I figured
> that what I need is, in the <fileDesc>:
> 
> <sourceDesc>
>   <biblFull>   # containing a near copy of the original header
>   </biblFull>  # of the newspaper
> 
>   <list>
>     <item xml:id="header-34">
>        <ptr target="#div-34"/>
>         <bibl>
>            <title>[section title]</title>
>            <date></date><extent></extent>
>         </bibl>
>      </item>
>   </list>
> </sourceDesc>
> 
> and in the <text> below, a series of <div>s:
> 
> <body>
>   ...
>   <div xml:id="div-34" decls="#header-34">...</div>
>   ...
> </body>
> 
> My questions are:
> 1. is the header arrangement of <biblFull>+<list> sensible?
> 
> 2. is the way I set up the header<->text references acceptable, or have
> I abused something? In particular,
> 2.1. is my use of  <at> decls on the <div> ok?
> 2.2. is the <ptr> the right way of pointing from the header into the text?
> 
> Thanks for your time,
> 
>   Piotr
> 

Kevin Hawkins | 2 Dec 17:23 2008

review of TEI Tite, a specification for encoding vendors

As you may have heard mentioned in London or at a previous TEI annual 
meeting, there is a customization of TEI called TEI Tite, which is 
designed for use by keyboarding vendors -- that is, commercial firms 
which perform large-scale encoding, often by transcribing and encoding 
from page images of the source document.  It is hoped that a simple 
encoding schema, which not only minimizes keystrokes but also prescribes 
particular elements for particular textual features (rather than 
offering options), will lower the going rates for outside digitization, 
possibly through aggregating content from multiple institutions to 
receive a discount.  TEI Tite is not designed as a final format for 
encoded text: on the contrary, texts received in TEI Tite would likely 
be converted not only to canonical TEI but also might have additional 
tagging added to them.

The ODD, DTD, RELAX NG, and and XML Schema have been available on the 
TEI customizations page for a while, and recently a link to the HTML 
version of the documentation was added:

http://www.tei-c.org/Guidelines/Customization/index.xml

At its meeting in London, the SIG on Libraries decided to conduct a 
review of the TEI Tite schema to ensure that it would suit the needs of 
those interested in outsourcing some or all of their encoding.  On 
behalf of the SIG, I am writing to ask that those who are interested in 
outsourcing part or all of their text encoding review the TEI Tite 
documentation and schema for specific changes that would needed in order 
for your project to adopt TEI Tite for use with vendors.  Background on 
your particular use cases is helpful.

Please either:

a) send comments both to Kevin Hawkins ( 
kevin.s.hawkins <at> ultraslavonic.info ) and to Michelle Dalmau ( 
mdalmau <at> indiana.edu ), or

b) post comments and add to the discussion at 
http://www.tei-c.org/wiki/index.php/Review_of_Tite_Scheme

by January 15, 2009.

Michelle and I will compile the comments, share with the SIG on 
Libraries, and arrange for a conference call to discuss the results, 
toward offering specific recommendations to the TEI Council on future 
revision of TEI Tite.  (Those submitting comments will not be required 
to participate in the conference call but are encouraged to join the 
SIG's email list if they are not already a member and to participate 
when the time comes.)

Please let me know if you have any questions.

Kevin
Co-Convenor of the TEI SIG on Libraries ( 
http://www.tei-c.org/Activities/SIG/Libraries/ )

Gabriel Bodard | 2 Dec 18:49 2008
Picon

Re: Layout attributes (columns, etc.)

Peter Boot a écrit :
> Could you expand a little on this? How would the  <at> columns and  <at> faces
> attribute interact?
> It seems to me that the faces of a 3d object are like the pages of a
> manuscript, and that columns are subdivisions of pages, or of faces, for
> that matter. Put another way: isn't it the face that has a layout,
> rather than the layout consisting of multiple faces? Or am I
> misunderstanding you completely?

I'm not sure I fully understand your objection, Peter. Limiting <layout> 
to layout on a single (flat) surface seems a little bit 
manuscript-centric. The faces on a 3-D object are not perfectly 
analogous to the pages in a codex (even a folio has two faces, right?). 
In some cases we might treat each face as a separate text instance, but 
in others the relationship of text to face (as text to column) is more 
complex.

I can certainly see some ambiguity in the question of how the  <at> columns 
and  <at> faces attributes relate to one another: are they columns per face, 
or columns total? What if each face has a different number of columns? 
What if lines run across faces, making less columns than faces on the 
entire object?

Elli, I wonder if you can give some help with how you would envisage 
using this attribute with USEp materials?

G

--

-- 
Dr Gabriel BODARD
(Epigrapher & Digital Classicist)

Centre for Computing in the Humanities
King's College London
26-29 Drury Lane
London WC2B 5RL
Email: gabriel.bodard <at> kcl.ac.uk
Tel: +44 (0)20 7848 1388
Fax: +44 (0)20 7848 2980

http://www.digitalclassicist.org/
http://www.currentepigraphy.org/

Peter Boot | 2 Dec 20:30 2008
Picon
Picon

Re: Layout attributes (columns, etc.)

Gabriel Bodard schreef:
> 
> I'm not sure I fully understand your objection, Peter. Limiting <layout> 
> to layout on a single (flat) surface seems a little bit 
> manuscript-centric. The faces on a 3-D object are not perfectly 
> analogous to the pages in a codex (even a folio has two faces, right?). 
> In some cases we might treat each face as a separate text instance, but 
> in others the relationship of text to face (as text to column) is more 
> complex.

I don't really have objections, it's just that I didn't quite understand 
how you proposed using this. You don't need a layout to record the fact 
that a text continues on another face, I would think. No more so than 
for a text continuing on a next folio (side). But a physical line that 
runs across faces would create a different situation.

Peter

Martin Holmes | 2 Dec 22:08 2008
Picon
Picon

Roma broken?

Hi folks,

Is Roma broken at the moment? I can get to the first page, but when I 
try to upload an ODD file, nothing happens. Once when I did get it 
uploaded, when I tried to generate schemas of various kinds, I got empty 
files, after a long wait.

Cheers,
Martin
--

-- 
Martin Holmes
University of Victoria Humanities Computing and Media Centre
(mholmes <at> uvic.ca)
Half-Baked Software, Inc.
(mholmes <at> halfbakedsoftware.com)
martin <at> mholmes.com

Laurent Romary | 2 Dec 22:16 2008
Picon

Re: Roma broken?

We do have a problem since some time. I tend to use: http://tei.oucs.ox.ac.uk/Roma/ 
   when things are getting worse, but it should be seen as a priority  
for action to get the official web site become more stable.
Laurent

Le 2 déc. 08 à 22:08, Martin Holmes a écrit :

> Hi folks,
>
> Is Roma broken at the moment? I can get to the first page, but when  
> I try to upload an ODD file, nothing happens. Once when I did get it  
> uploaded, when I tried to generate schemas of various kinds, I got  
> empty files, after a long wait.
>
> Cheers,
> Martin
> -- 
> Martin Holmes
> University of Victoria Humanities Computing and Media Centre
> (mholmes <at> uvic.ca)
> Half-Baked Software, Inc.
> (mholmes <at> halfbakedsoftware.com)
> martin <at> mholmes.com


Gmane