FYI: upcoming changes to (old)nroff
Thorsten Glaser <
tg@...>
2007-01-09 23:51:58 GMT
Hi people,
I'd like to point you to /usr/bin/nrcon. This is a shell script
which basically does the equivalent of
> nroff -B -Tcol "$ <at> " | col
so that manual pages can double-space (and the -B is to switch
french spacing on).
If you also give a parameter (-8 or -e), it's doing something
more like
> cat "$ <at> :files" | nr8pre [-7] | nroff -B -Tcol "$ <at> :options" | nr8post | col
where -7 is passed when -e is given.
nr8pre does something like: read the input, byte for byte,
if it's larger than max (-7 ? 255 : 127), escape it; if it's
an escape of the form \N'x' where x is 1-3 decimal digits,
escape it as if it had read the character x, no matter if
it's below max or not. nr8post just reverts the action. The
"col" driver of nroff has been extended by three escapes
and three output characters corresponding to these:
\(88 = width 1, signals to nr8post: start sequence
\(80 = width 0, signals to nr8post: bit is set to 0
\(81 = width 0, signals to nr8post: bit is set to 1
Characters are escaped \(88\(8x\(8x\(8x\(8x\(8x\(8x\(8x\(8x,
where 'x' is the bit of that position.
This works well, even for EUC-JP manual pages (with -8),
and \N'xxx' is a GNU groff extension which thusly works
as well, but DBCS can't be bolded like x\by\b (where x
and y are the two octets that define a byte), so we've
got to post-process _again_ with col -b to strip any
formatting.
----
I want to change it to support UTF-8 (well, CESU-8) as
well as to perform better, so we can afford to always
pre- and postprocess. (No, it's not feasible to hack
this into nroff, trust me on this one.)
Using pipes, fork'n'exec, etc. of course.
Characters will then be escaped like this:
n*\(88 + m*\(8x, where n is the wcwidth(3) of the
character, and m is 16 for a valid utf-8 input character
and 8 for a random input octet (so this is 8-bit trans-
parent, as needed for EUC-JP, although I'd rather have
users pipe through iconv -f euc-jp -t utf-8 first).
I wonder how to extend \N'xxx' (x in decimal) to unicode
(BMP, to be exact) codepoints.
Interestingly, while researching this… Bruno Haible has
written a pre-/postprocessor for GNU groff which does
quite the same as ours. It uses \N'xxxxx' (x in decimal)
notation for unicode codepoints. Well, no reason we can
not use this too. I think we actually could, already,
if we supported wide chars. It doesn't emit special out-
put chars though, but uses special \h'-1' directives,
*roff standard, to move backwards, and encodes them as
<Uxxxx> (where the second > is \h'-1' for width-1 chars).
I don't think we're going to adapt this.
It doesn't do hex codepoints though.
I could do \N'Uxxxx' which is sort of compatible, or
tell users to use decimals to stay GNU groff compatible.
Another todo: add all 1- and 2-octet name chars from
groff_chars(7) to nroff (if 7bit) or the p*processor.
(Possibly, they could be loaded as strings, because I
want the \(xy form, we don't support the \[xy] form.)
I could even map long names (\[xyz] form) in the pre-
processor, though.
Let's see what can be done.
bye,
//mirabile
--
--
"Using Lynx is like wearing a really good pair of shades: cuts out
the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL."
-- Henry Nelson, March 1999