Markus Hoenicka | 3 Mar 21:14 2003
Picon

ANN: RefDB 0.9.2

Hi all,

this is an announcement of the latest version of RefDB, sent
approx. annually in order not to clutter the mailing list.

RefDB is a reference manager and bibliography tool for markup
languages. RefDB uses a client/server model to allow easy sharing of
reference data between researchers but it works just fine on a single
workstation. Data are stored in a SQL database server. MySQL and
PostgreSQL are supported as external database servers, but there's a
SQLite-based embedded SQL engine available too. DocBook SGML and XML
as well as TEI XML are supported out of the box, but other document
types can be added as needed. Bibliography and citation styles can be
supplied as XML documents according to publisher's
specifications. Document processing is Makefile-based, so e.g. "make
pdf" is essentially all it takes to turn your TEI document containing
RefDB-compatible citations into a PDF document with formatted citations
and a bibliography.

The RefDB homepage (http://refdb.sourceforge.net) has all the details,
an extensive manual (> 200 pages), a short tutorial targeted at new
users, and a few example documents.

The RefDB sources as well as a starterkit containing the RefDB sources
as well as the sources of all required libraries and Perl modules are
available at http://sourceforge.net/projects/refdb/.

regards,
Markus

(Continue reading)

Charles Faulhaber | 4 Mar 01:08 2003
Picon

Treatment of hyphenation

Is there a standard way to treat hyphenation such that hyphenated words at
the ends of lines can be treated for computational purposes as the
equivalent of non-hyphenated words within a line?

Many thanks,

Charles Faulhaber       The Bancroft Library    UC Berkeley, CA 94720-6000
(510) 642-3782          FAX (510) 642-7589    cfaulhab <at> library.berkeley.edu

Jesper Overgaard Nielsen | 4 Mar 10:59 2003
Picon
Picon

Re: Treatment of hyphenation

On Tue, 2003-03-04 at 01:08, Charles Faulhaber wrote:
> Is there a standard way to treat hyphenation such that hyphenated words at
> the ends of lines can be treated for computational purposes as the
> equivalent of non-hyphenated words within a line?

If you want to preserve information on linebreaks and hyphenation, I
would suggest encoding a linebreak with <lb/>, but only preserve the
hyphen if it is a "hard hyphen", eg.:

        "Is there a stan<lb/>dard way to treat hyphenation....."
but
        "... as the equivalent of non-<lb/>hyphenated words...."

If you make sure to include a space before the <lb/> if it occurs
between two words, like

        "... within a <lb/>line?"

it would be easy to write a stylsheet (XSLT) to add the hyphens later,
if needed for presentation.

I think the same applies to page breaks and the <pb> element.

--
Jesper Overgaard Nielsen                       email: slajon <at> hum.au.dk
Slavic Department                                   Phone: 89 42 64 79
Institute of History and Area Studies                 Fax: 89 42 64 65
University of Aarhus, Jens Chr. Skous Vej 5, DK-8000 Aarhus C, Denmark

(Continue reading)

Burnard Towers | 4 Mar 11:23 2003
Picon
Picon

Re: Treatment of hyphenation

Several ways have been used:

1. Use an entity reference such as &rehy; (record-end-hyphen). Tokenizing
software can be instructed to expand this as a null string, while formatting
software renders as a hyphen and line break

2. Use the rend attribute of <lb/> to record that the word is hyphenated
across the line break, e.g. <lb rend="hyphenated"/>

3. Use some adhoc in line code such as +- rather than -

4. Use the appropriate Unicode character (x00AD)

Of these, only the last can properly be said to be standard.

At the risk of being tiresome, I'd just like to point out that deciding
whether or not an end-of-line hyphen can actually be regarded as
discretionary may be a nontrivial exercise. Since I've just used
"nontrivial" (rather than "non-trivial"), you might assume that this
spelling was characteristic of my idiolect (or ortholect anyway). But
supposing it was at the end of a line "non-<lb/>trivial"?

Lou

> -----Original Message-----
> From: TEI (Text Encoding Initiative) public discussion list
> [mailto:TEI-L <at> LISTSERV.BROWN.EDU]On Behalf Of Charles Faulhaber
> Sent: 04 March 2003 00:09
> To: TEI-L <at> LISTSERV.BROWN.EDU
> Subject: Treatment of hyphenation
(Continue reading)

Greg Murray | 4 Mar 13:07 2003

Re: Treatment of hyphenation

Not sure there's a "standard," but if you preserve both forms using

  <reg orig="ex-ample">example</reg>

or its converse

  <orig reg="example">ex-ample</orig>

you may be able to have your cake and eat it too.

Greg Murray
XML/Text Programmer
Digital Library Production Services
University of Virginia Library
murray <at> virginia.edu

On Mon, 3 Mar 2003, Charles Faulhaber wrote:

> Is there a standard way to treat hyphenation such that hyphenated words at
> the ends of lines can be treated for computational purposes as the
> equivalent of non-hyphenated words within a line?
>
> Many thanks,
>
> Charles Faulhaber       The Bancroft Library    UC Berkeley, CA 94720-6000
> (510) 642-3782          FAX (510) 642-7589    cfaulhab <at> library.berkeley.edu
>

Chris Emery | 4 Mar 14:39 2003

Re: Treatment of hyphenation

As hyphenation is an accident of typesetting why would one want to capture
it. It has no textual significance? Just puzzled.

Best
C

Lou Burnard | 4 Mar 15:51 2003
Picon
Picon

Re: Treatment of hyphenation

As do catchwords, hyphenation sometimes captures interesting linguistic
information about morphology. In older printed texts, it may be of value
for that reason. I suppose.

-----Original Message-----
From: TEI (Text Encoding Initiative) public discussion list
[mailto:TEI-L <at> LISTSERV.BROWN.EDU] On Behalf Of Chris Emery
Sent: 04 March 2003 13:39
To: TEI-L <at> LISTSERV.BROWN.EDU
Subject: Re: Treatment of hyphenation

As hyphenation is an accident of typesetting why would one want to
capture
it. It has no textual significance? Just puzzled.

Best
C

BODARD Gabriel | 4 Mar 15:44 2003
Picon

Re: Treatment of hyphenation

This is pretty much the system we have used to encode physical line
breaks on a stone inscription if they happen to be mid-word. It is
possible that some XML and/or XSL envoronments will have trouble telling
whether or not there is a space before the <lb/> tag, so we use an
attribute <lb type="worddiv"/> to indicate that the line break is
word-dividing, so that the letters to either side can be treated as a
single lexical word, a hyphen can be introduced if desired, etc.

HTH,

Gabriel.

On Tue, 4 Mar 2003, Jesper Overgaard Nielsen wrote:

> If you want to preserve information on linebreaks and hyphenation, I
> would suggest encoding a linebreak with <lb/>, but only preserve the
> hyphen if it is a "hard hyphen", eg.:
>
>         "Is there a stan<lb/>dard way to treat hyphenation....."
> but
>         "... as the equivalent of non-<lb/>hyphenated words...."
>
> If you make sure to include a space before the <lb/> if it occurs
> between two words, like
>
>         "... within a <lb/>line?"
>
> it would be easy to write a stylsheet (XSLT) to add the hyphens later,
> if needed for presentation.

(Continue reading)

Syd Bauman | 4 Mar 15:04 2003

Re: Treatment of hyphenation

All Lou's suggestions should work (as should Jesper Overgaard
Nielsen's and Greg Murray's), however I feel compelled to improve on
one method, depreciate another, and to join Lou and warning that
differentiating between the two kinds of hyphens can be quite
difficult.

For Lou's solution #1, there is an ISO standard entity name, 'shy'
for "soft hyphen"[1]. In the XML world this is always[2] mapped to
U+00AD, the Unicode code-point for soft hyphen. I would say that this
could be properly said to be a standard (if not the standard)
solution. However, I understand it is likely to lose that status in
the future, as there is much of the XML world out there that doesn't
handle general entities well if at all.

I *really* dislike Lou's #3, an ad-hoc code. (I bet he does, too :-)

Differentiating soft from hard hyphens can be quite the trick. One
(automated) method that has trickled through my brain (the brainchild
of linguist Jacque Russom, if I recall correctly) but I have never
coded is as follows.

* Initially encode the file with "-" for all hyphens and "<lb/>" for
  all line-breaks.

* Read the file in and build a word-list of all words that do *not*
  have "-<lb/>" in 'em (should use clever TEI parsing software that
  knows how to build words from TEI-encoded text, e.g. to extract
  SOnum "duck" from "du<sic corr='ck'>kc</sic>"). Note that you might
  want to create the word-list from the single file you are operating
  on, from the corpus of all your texts, or (as we at the WWP would
(Continue reading)

Peter Boot | 4 Mar 18:54 2003
Picon
Picon

Re: Treatment of hyphenation

 > For Lou's solution #1, there is an ISO standard entity name, 'shy'
 > for "soft hyphen"[1]. In the XML world this is always[2] mapped to
 > U+00AD, the Unicode code-point for soft hyphen.

I have often wondered about this. I always understood &shy; to have the
meaning which Lou and Syd attribute to it, and I set out using it to
record the hyphen at end-of-line word-breaks.

But then, the HTML 4.01 specification says (at
http://www.w3.org/TR/html401/struct/text.html#h-9.3.3):

<quote>
In HTML, there are two types of hyphens: the plain hyphen and the soft
hyphen. The plain hyphen should be interpreted by a user agent as just
another character. The soft hyphen tells the user agent where a line
break can occur.

Those browsers that interpret soft hyphens must observe the following
semantics: If a line is broken at a soft hyphen, a hyphen character must
be displayed at the end of the first line. If a line is not broken at a
soft hyphen, the user agent must not display a hyphen character. For
operations such as searching and sorting, the soft hyphen should always
be ignored.

In HTML, the plain hyphen is represented by the "-" character (&#45; or
&#x2D;). The soft hyphen is represented by the character entity
reference &shy; (&#173; or &#xAD;)
</quote>

There is a difference between using a soft hyphen to record where this
(Continue reading)


Gmane