Al Magary | 1 Sep 05:38 2003

Re: Converting from any HTML to TEI

I introduced myself a couple months ago as a newbie to TEI and
have been trying valiantly to follow the traffic, particularly
this thread as it seems learnable.  Now, in Conal Tuohy's post
(below), he describes a process that begins in MS Word--which,
shall we say, is EZ to understand--and end with a properly TEI
encoded (?) document.  That certainly looks like a useful
lesson--but would it be possible for someone to step through
such a process here and explain what is happening with each
step?

And I just do not understand your last sentence.

My project is a new edition of Hall's Chronicle (1550), a
700,000-word chronicle covering 1399-1547.  There are
practically no commercial opportunities, so I expect that it
will be a self-financed web edition--so, of course, I will have
to do the markup myself.

I understand that TEI-L is devoted to solving problems at the
expert level, but we newbies could sure use some occasional
instruction, sometimes in words of one syllable or less!

Thanks,
Al Magary

----- Original Message -----
From: "Conal Tuohy" <Conal.Tuohy <at> VUW.AC.NZ>
To: <TEI-L <at> LISTSERV.BROWN.EDU>
Sent: Sunday, August 31, 2003 3:51 PM
Subject: Re: Converting from any HTML to TEI
(Continue reading)

Al Magary | 1 Sep 07:46 2003

Re: Converting from any HTML to TEI

Now *that* is a great post, very friendly and explanatory, and I
thank Conal Tuohy (in distant NZ; I'm laboring away on Labor Day
Weekend in the US) and the TEI-L list for this learning
opportunity.

Cheers,
Al Magary

----- Original Message -----
From: "Conal Tuohy" <Conal.Tuohy <at> vuw.ac.nz>
To: "Al Magary" <al <at> magary.com>; <TEI-L <at> LISTSERV.BROWN.EDU>
Sent: Sunday, August 31, 2003 9:38 PM
Subject: RE: Converting from any HTML to TEI

Al Magary wrote:

> I introduced myself a couple months ago as a newbie to TEI and
> have been trying valiantly to follow the traffic, particularly
> this thread as it seems learnable.  Now, in Conal Tuohy's post
> (below), he describes a process that begins in MS Word--which,
> shall we say, is EZ to understand--and end with a properly TEI
> encoded (?) document.  That certainly looks like a useful
> lesson--but would it be possible for someone to step through
> such a process here and explain what is happening with each
> step?

OK

Step 1: Open the document in MS-Word and save as HTML.

(Continue reading)

Conal Tuohy | 1 Sep 06:38 2003
Picon
Picon

Re: Converting from any HTML to TEI

Al Magary wrote:

> I introduced myself a couple months ago as a newbie to TEI and
> have been trying valiantly to follow the traffic, particularly
> this thread as it seems learnable.  Now, in Conal Tuohy's post
> (below), he describes a process that begins in MS Word--which,
> shall we say, is EZ to understand--and end with a properly TEI
> encoded (?) document.  That certainly looks like a useful
> lesson--but would it be possible for someone to step through
> such a process here and explain what is happening with each
> step?

OK

Step 1: Open the document in MS-Word and save as HTML.

Step 2: Convert this document to XHTML (e.g. using JTidy)

Step 3: Run an XSL transformaion ("style sheet") to convert this Word-HTML document into more generic
HTML. This deserves more explanation:

Word saves "Heading 1" paragraphs as <h1> elements, and so on, and there are other HTML elements you can
create by applying styles in Word. For instance, if you format a para in Word with the style "HTML Code" then
Word will save this as <code>...</code>. Word has this mapping built in. But there are other mappings you
would have to implement yourself, such as the <abbr> (abbreviation) element, which Word knows nothing
about. You could format an abbreviation with a style you define yourself (in Word) called
"abbreviation", and when Word saves this it will write the following: <span
class=abbreviation>..</span>. So this step would transform such span elements into standard HTML
<abbr> elements.

(Continue reading)

Joel Reungoat | 1 Sep 12:16 2003
Picon

Re: Converting from any HTML to TEI

Hi Conal,
It is exactly the same idea I had to convert my Word in a standardized XML.
Since now, I only made some litle draft XSLT just to see if the complete
pipeline under Cocoon could work.
So, if you already have a complete working XSL sample pipeline, I am
interested to have a look on it. Could you attach it to an email ?

Concerning the conversion process, I have also 2 questions :
1- Are you using the Tidy integration which is already packaged in Cocoon,
or do you running Jtidy outside of Cocoon process ? Wich specific parameter
are you using for Tidy ?

2- I want to extract some "properties" of the document which will
constitute some metadata of the final XML. Do you have implemented
something to be able to extract these "properties" ?  I am using Word 2000
which insert the properties in the HTML comment tag as follow  :
<!--[if gte mso 9]><xml>
  <o:DocumentProperties>
   <o:Author>KHAFAJI</o:Author>
   <o:Template>Opinions.dot</o:Template>
   <o:LastAuthor>JReungoat</o:LastAuthor>
   <o:Revision>2</o:Revision>
.....
   <o:Pages>27</o:Pages>
   <o:Words>11897</o:Words>
   <o:Characters>67814</o:Characters>
   <o:Version>9.3821</o:Version>
  </o:DocumentProperties>
</xml><![endif]-->

(Continue reading)

Joel Reungoat | 1 Sep 12:22 2003
Picon

Re: Converting from any HTML to TEI

Hello Chuck,

As I previously answered to somebody else (Conal) on the mailist, I'm
working on some automatic XSLT conversion under the wonderfull Cocoon
environment.
But I'm also interesting in other solutions and so on your Word macro.
Could you send it by email, please ? Thanks.

I will also have a look to the filters under Open Office (thanks to
Sebastien and Michael for the information) even if it needs to use an
another third software.

Regards.

>Approved-By: Syd_Bauman <at> BROWN.EDU
>X-Mailer: Mew version 3.1 on Emacs 21.2 / Mule 5.0 (SAKAKI)
>X-Virus-Scanned: by AMaViS
>X-Abuse-Complaints: abuse <at> gol.com
>Date:         Sat, 30 Aug 2003 00:46:42 +0900
>Reply-To: Charles Muller <acmuller <at> GOL.COM>
>Sender: "TEI (Text Encoding Initiative) public discussion list"
><TEI-L <at> listserv.brown.edu>
>From: Charles Muller <acmuller <at> GOL.COM>
>Subject:      Re: Converting from any HTML to TEI
>Comments: To: jreungoat <at> RENNES.JOUVE.FR
>To: TEI-L <at> listserv.brown.edu
>
>Joel wrote
>
> > Could you explain what kind of product or converter software you are using
(Continue reading)

Eric Frigot | 1 Sep 10:13 2003
Picon

Re: Fwd: Converting from any HTML to TEI

On Fri, 29 Aug 2003 11:12:18 +0200, Joel Reungoat
<jreungoat <at> RENNES.JOUVE.FR> wrote:

>Hello Eric,
>
>Could you explain what kind of product or converter software you are using
>to convert Word documents to XML respecting TEI ?

Yes, it is very simple. Have a look on this application :
Majix on http://www.tetrasix.com/
This application is written in Java (you can get the source) and handle any
word-rtf documents to convert them in any XML format (you need a DTD). You
can handle the process in a Java application. If you need any other
information, ask me.

>Which version of Word documents do you use ?

I don't test a lot of documents, or complicated ones, but it works with any
version of Word documents.

>Is there anybody who has also experiences in such convertion ?
>
>Thanks.
>Joel.

Eric.

Eric Frigot | 1 Sep 10:07 2003
Picon

Re: Converting from any HTML to TEI

Conal wrote :
> Hi Eric
>
> I have been doing this for a while: I use MS-Word and save the document
> as HTML. This HTML is as ugly as sin, of course. Then I use JTidy to

Can you do it without human interaction ? i mean, saving a html document in
word as a word document.

> convert it to XHTML, and then various XSLT transformations to produce

I also use JTidy, but i get problems using XSLT, it missed me some of them.

> TEI. First the MS-Word-flavoured HTML is converted to a more standard
> HTML and from there to a simple TEI. The whole conversion process is
> hosted inside Cocoon2 running as a Servlet inside Tomcat. I'm happy to
> share the XSL transforms if you like.

Yes, if you can, i think you get my email : eric.frigot <at> voila.fr
Your process seems strange :
1- Converting any HTML to HTML-word (.html or .htm extension ?)
2- Converting this HTML-word to XHTML using JTidy
3- Using XSLT to transform this XHTML in TEI.

I cannot understand why you first convert the HTML with word ?

>
> Cheers
>
> Con
(Continue reading)

Michael Beddow | 1 Sep 15:15 2003
Picon

Re: Converting from any HTML to TEI

Al Magary wrote:

> I introduced myself a couple months ago as a newbie to TEI and
> have been trying valiantly to follow the traffic, particularly
> this thread as it seems learnable.  Now, in Conal Tuohy's post
> (below), he describes a process that begins in MS Word--which,
> shall we say, is EZ to understand--and end with a properly TEI
> encoded (?) document.

This is indeed useful as a first step if you *have* to start from a Word
document. But it begs the question as to whether, in your particular case as
outlined in your initial postings, you really should be starting from a Word
document, since you are creating an edition ab initio.

You originally wrote about making your initial transcription in a
three-column Word document. Line numbers in l.h. column, diplomatic
transcription in centre, annotations in r.h. column. I took this to mean you
were for the time being simply using the PC as an electronic notebook, with
the actual encoding to follow once you had your transcription and
annotations prepared, which is a common and sensible practice.

You best next step would be to create your encoded version in a
Windows-compatible dedicated XML editor, cutting and pasting the material
from your Word table as you go. None of the conversion routines so far
described would generate useful XML from your three-column Word working
document without pretty arduous customisation, and a lot of the markup your
editorial aims require is not of the type that can readily be pre-encoded
via named Word styles. To try to encode in Word in your circumstances would
add a big layer of additional complications and difficulties that would soon
outweigh the initial pain of learning a different style of editing software.
(Continue reading)

Michael Beddow | 1 Sep 17:11 2003
Picon

Re: Converting from any HTML to TEI

Eric Frigot wrote:
>
> I also use JTidy, but i get problems using XSLT, it missed me some of
them.

XSLT processors are among the most closely-scrutinised bits of software in
the wild. If an XSLT transform doesn't work as expected, this is highly
likely to be because of a problem in the XSLT code, not a limitation of the
language or the processor. My first guess would be a namespaces problem: in
particular the handling of default namespaces in XSLT 1.0 has a number of
gotchas. But XSL-L is the place to go...

> Your process seems strange :
> 1- Converting any HTML to HTML-word (.html or .htm extension ?)
> 2- Converting this HTML-word to XHTML using JTidy
> 3- Using XSLT to transform this XHTML in TEI.
>
> I cannot understand why you first convert the HTML with word ?
>

Some of us do just this, because "HTML" can mean and be almost anything,
given the notorious laxity of the way browsers parse.. Whereas the HTML that
Word outputs, although verbose, invalid, and generally wild and wacky, has a
certain method in its madness, expecially after it's been squezzed through
Dave Ragett's digital mangle, and that allows us to get our XSLT-honed claws
on it.

But that aside, I think it was your original post that assumed HTML as the
starting point. Those of us who suggested you might go via Word or OO were
just trying to address the question you posed as we initially understood it.
(Continue reading)

David J Birnbaum | 1 Sep 16:39 2003
Picon

Attributes vs. elements

Dear TEI-L,

For a discussion of the different syntactic properties of elements and
attributes within the context of a specific project, see:

http://clover.slavic.pitt.edu/~djb/sgml/extreme2000/birn0505.html

Short version: I used elements for authoring to obtain better control over
ordering and repetition and then converted to attributes for publishing to
achieve TEI conformance (where attributes are necessary for interchange
purposes). The lesson I learned was that the best DTD for authorizing and
the best DTD for publishing (in an interchange context) may differ.

Cheers,

David
________

Professor David J. Birnbaum
Department of Slavic Languages and Literatures
1417 Cathedral of Learning
University of Pittsburgh
Pittsburgh, PA 15260 USA
Voice: 1 412 624 5712
Fax: 1 412 624 9714
Email: djbpitt+ <at> pitt.edu
URL: http://clover.slavic.pitt.edu/~djb/


Gmane