Paul F. Schaffner <pfs-listmail <at> UMICH.EDU>
2011-01-05 23:52:20 GMT
IN the interest (and at the risk) of 'exposing' our practice
with regard to trivial yet vexing issues of text-capture,
I reproduce the following internal message from this morning,
for the benefit of those who like to chew on such things early
in the new year.
Has anyone dealt with similar issues differently?
I've been thinking a little about number-separators, and how most
practically and consistently to capture them with minimal offense
to Unicode, existing practice, and utility.
The immediate inspiration here was noticing a few books
that use the high comma (aka the closing curvy quote mark or apostrophe)
as a thousands separator, as they do nowadays in Switzerland and
perhaps elsewhere, e.g. 1'234,56, and one book that used the same
apostrophe as thousands separator, miles/furlongs separator, and
decimal separator indifferently. In the texts as they've been keyed
this has been captured sometimes as the straight apostrophe/single quote
('), and sometimes as the minutes/prime sign. And probably as other
things as well that I have failed to notice.
Unicode itself is relatively silent on the subject, though what it
does say is as usual slightly inconsistent.
About *decimal* separators, it says (I think) that when the full stop
is used as a decimal separator, it should be captured as full stop
(period), and that when a comma serves as decimal separator it should
be captured as a comma, but that when the mid-dot serves as a decimal
separator, it should be regarded as a glyph variant of full stop.
I.e., there is no character 'decimal separator': this is simply a
use of the ordinary punctuation marks (except for middot, which in
this case, but not others, is regarded as a glyph variant of full stop).
It says nothing about what to do when other characters are used
in this role, but I infer that they too should be regarded as special
uses of existing characters--assuming that the characters exist
in their own right.
About *thousands* separators, it says nothing.
About other *non-decimal* separators e.g. shillings/pence or
miles'furlongs or volume:page it says (I believe) nothing, except
that the ordinary virgule (SOLIDUS or slash) should be used
to capture the ordinary virgule-like shillings/pence separator ("3/6").
Since in our case there is no likelihood of these separators being
processed in any mathematical way (i.e., the strings are unlikely
ever to be parsed as number values), but there is some likelihood
of someone's wanting to search for odd uses of punctuation, I think
we should in general:
(1) reserve the prime sign for its intended purpose: minutes (of arc or
of the hour) and feet (as a unit of measure), i.e. used it only when
it marks one of those specific units of measure, not when it acts
as a generic separator. And likewise for any other
semantically-weighted characters that might serve in such a role
(2) adopt the Unicode recommendations with respect to decimal periods
and commas--i.e. capture them as ordinary periods and commas
(3) adopt the Unicode recommendation with respect to the shilling/pence
virgule--i.e. capture it as an ordinary virgule (/).
(4) ignore the Unicode recommendation with respect to the decimal
middot--i.e. capture this as a middot (U+00B7), not as a
(5) follow the spirit of Unicode usage with respect to other lightly
loaded characters used as separators--i.e. capture tham as ordinary
punctuation characters, preferring a more formally precise
option if there are several, e.g. if the right-single-quote
or left-single-quote is used as a decimal-, or furlongs-, or
thousands-separator, capture it as ’ (U+2018) etc., regardless
of which function it is serving.
(6) if a novel glyph is used or or a heavily-weighted character is misused
in the separator role, invent a new PUA character. We have already
done this in the case of the American (?) elongated s-shaped
shilling separator (in order to avoid confusing it with
ordinary tall-s, which we are apt to convert to round-s); of
the "L-"shaped decimal separator; and (?perhaps simply a glyph variant
of the "L-") of the decimal separator that resembles a mirror-image
comma or small subscripted letter "c". Only in this instance (6)
will we indicate the semantics of the character, i.e. as separator,
and maybe not always then.
Some examples at http://www.lib.umich.edu/tcp/docs/dox/separators.html
Paul Schaffner | PFSchaffner <at> umich.edu | http://www.umich.edu/~pfs/
316-C Hatcher Library N, Univ. of Michigan, Ann Arbor MI 48109-1190