Doug Ewell | 1 Sep 01:25

Re: extlang

Randy Presuhn <randy underscore presuhn at mindspring dot com> wrote:

> Fallback has probably been over-sold.

I think at least two things have been oversold.  Fallback is definitely 
one of them.  As I've said so many times, after all the effort and delay 
we went through to generate RFC 4647, after everything we've created 
that might require RFC 3066-style matching and validation engines to be 
modified or rewritten, we are still hooked on the idea that 
remove-from-right truncation is of paramount importance, and that a 
mechanism like extlang that might exhibit unwanted results in some 
RFR-truncation scenarios is thereby fatally flawed.

The other thing that has been oversold is the identification of a 
macrolanguage with one of its encompassed languages.  From reading this 
thread, you would think people are making a conscious decision to tag 
Mandarin Chinese, and only Mandarin Chinese, as "zh", and tag other 
flavors of Chinese as "zh-yue" or "zh-hakka" or whatever.  Furthermore, 
you would think people are planning to make the same decision with 
respect to Arabic, explicitly equating "ar" with Standard Arabic and not 
with any other flavor, as soon as RFC 4646bis comes out with its 30-odd 
new Arabic subtags.

I don't think this sort of conscious decision is what drives the current 
tagging scenario.  Instead, consider the following:

1.  Most (not all) of the content that is currently tagged with BCP 47 
language tags tends to be written, not spoken, sung, etc.  This will not 
necessarily be the case forever, but it is likely the case today.

(Continue reading)

Marion Gunn | 3 Sep 14:11
Picon

Re: Re: extlang

On 31 Aug 2007, at 17:26, scríobh Doug Ewell:

> ...
> "Micro-" is an obvious opposite of "macro-".  This sort of word- 
> forming process is common among technical-minded people, including  
> much of the LTRU membership (I make up words every day)...

I hope you have now seen it demonstrated (in recent mails) that said  
tendency to "make up words ever day" only worsens a problem for which  
the obvious cure is a neat Terms and Definitions section.

On 31 Aug 2007, at 18:26, scríobh Doug Ewell:

> ...
> For my part, I understand that "extlang" is not the official name  
> for this type of subtag, and didn't mean to propose defining it  
> formally if you were talking to me.  We use it all the time because  
> it's easier to type 7 letters than 24.

Yet more reason to either define such informal terms as you make up.  
After that, you may even propose making them formal, if you so wish.

> ...
> One thing the document does not need is to be longer.  If we do add  
> such a section, we should attempt to strip out the now-redundant  
> inline definitions that exist throughout the document...

Exactly.
mg

(Continue reading)

Doug Ewell | 4 Sep 00:18

Re: New draft posted to Web

On Saturday, August 18, 2007 10:27, I wrote:

>> Please, if you want the working group to comment, submit the draft to 
>> internet-drafts at ietf.org
>
> Done.

I've written back to internet-drafts to inquire why my draft still 
hasn't been published more than two weeks after being submitted.  I'm 
hoping it's just about ready for WGLC, except of course for the 
extlang/macrolanguage question.

--
Doug Ewell · Fullerton, California, USA · RFC 4645 · UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

John Cowan | 4 Sep 05:46

Re: Re: extlang

Addison Phillips scripsit:

> Thus, if I were to support extlang, it would be based solely on John 
> Cowan's argument that we need extlang to prevent a "retagging crisis" 
> for languages formerly enclosed by a macrolanguage.

Indeed, I think this is the most important point.  RFCs come and go,
but tagging decisions ought not to have to be revisited.  We have already
obsoleted some IETF or semi-IETF tags in favor of ISO ones, but there's
no getting away from that.

> 2. Randy suggested (a long while ago now) that cherry-picking from the 
> macrolanguage list would be a bad idea. Yet tag stability provisions 
> prevent us from taking the list wholesale.

Three macrolanguages with seven encompassed languages, all accounted
for by the same pre-existing principle (stability), hardly counts
as cherry-picking.

> And I have some concern that the macrolanguage "collections" (if you'll
> pardon this inaccurate term) have yet to be thoroughly tested. They
> may not be stable or suitable in the short-to-medium term. This is
> not a critique of ISO 639-3's work here. It is merely a note of caution.

I'm not sure what counts as testing in this case.  Can you explain further?

> If I were to support doing extlangs, it would consider each 
> macrolanguage separately, as a one-time-event, and, again, solely as a 
> compatibility item.

(Continue reading)

GerardM | 4 Sep 13:20
Picon

Re: Re: extlang

Exactly.  People will continue to use "zh", but those who want to be more
explicit can use "zh-cmn", just as people who want to be explicit about
script can use "zh-Hans".  With extlangs, there is a soft choice between
"zh" and "zh-cmn" rather than a hard choice between "zh" and "cmn".

Hoi,
From my perspective the system is how it is, the codes are mainly to be used by computers and it can look as horrible as it does. I do not care what people do because that would mean that we accept that they will continue to have 85% of the Internet not tagged and most of what is tagged tagged incorrectly.

My question is about Min Nan, according to what is proposed zh-nan. This language has been written in both the Chinese script and the Latin script for several centuries. According to my information it was already written in the Latin script in the 16th century in the Philippines. The question, zh implies one of the Chinese scripts. Should the code therefore be zh-nan-Latn and, should there not be a code for the particular orthography as well and can a code for the orthography replace the Latn indicator.

NB this is a practical question as the Wikipedia in Min-Nan is written in the Latin script.

Thanks,
     Gerard
Frank Ellermann | 4 Sep 14:24
Picon
Picon

Re: Wrapping up the UTF-8 debate

Randy Presuhn wrote:

>   2)  The registry file itself currently uses something which is similar
>       to an NCR.  Are we willing to change the registry format to
>          a) use actual NCRs for non-ASCII code points, making conversion
>             to XML even more trivial than it already is, while still
>             giving some fallback to folks inspecting the data for errors
>             or looking at it through ASCII windows
[...]

Hi, the NCRs in the registry are "real" NCRs, using the hex. format
recommended by the W3C "charmod".

Frank

Frank Ellermann | 4 Sep 14:43
Picon
Picon

Re: New draft posted to Web

Doug Ewell wrote:

> I'm hoping it's just about ready for WGLC, except of course for the 
> extlang/macrolanguage question.

Hi, I've updated my awk script for the "macrolanguage" business:

http://xyzzy.webhop.info/home/ltru/ltru2xml.awk (version 0.7)
http://xyzzy.webhop.info/home/ltru/ltru2xml.dtd (modified DTD)
http://xyzzy.webhop.info/home/ltru/4645bis2.xml (based on -02)

http://xyzzy.webhop.info/home/ltru/4645bis.awk (script to extract
the proto registry from your drafts).

All still using ASCII and text/xml, not UTF-8 and application/xml.

IMHO the "Macrolanguage" field is rather pointless for extlang-s,
it's always the same as the "Prefix" with two exceptions:

1 - all sign extlangs don't have a Macrolanguage: sgn
2 - in theory extlang ccc in aa-bbb-ccc would get Prefix: aa-bbb
     and Macrolanguage: bbb.

Frank

Doug Ewell | 4 Sep 15:58

Re: extlang

GerardM <gerard dot meijssen at gmail dot com> wrote:

> My question is about Min Nan, according to what is proposed zh-nan. 
> This language has been written in both the Chinese script and the 
> Latin script for several centuries. According to my information it was 
> already written in the Latin script in the 16th century in the 
> Philippines. The question, zh implies one of the Chinese scripts. 
> Should the code therefore be zh-nan-Latn and, should there not be a 
> code for the particular orthography as well and can a code for the 
> orthography replace the Latn indicator.
>
> NB this is a practical question as the Wikipedia in Min-Nan is written 
> in the Latin script.

My opinion, and mine alone, is that you should be able to use "zh-nan" 
and not worry about the script subtag.  Not every tagging scenario 
requires explicit specification, or even implication, of the script, 
just as not every tagging scenario requires identification of the 
region.

Specifying the script is important when it is necessary to contrast it 
with other scripts.  For example, you might write "zh-nan-Latn" to 
distinguish it from "zh-nan-Hani" if it were necessary to distinguish 
between the two.  The fact that the Min-Nan Wikipedia text is written 
does not necessarily mean this is necessary.

This is one of the things I dislike about Suppress-Script: it tends to 
imply that the script subtag is always important, and is simply being 
omitted in this tag for brevity or compatibility.

If we get rid of extended language subtags, of course, then this is a 
moot point: "nan" can be given a Suppress-Script of "Latn", or can be 
left alone, and neither choice would imply anything about Han.

I see no reason why the Min-Nan *orthography* needs its own subtag, 
unless (again) it needs to be distinguished from Min-Nan written in some 
other Latin orthography.  Thousands of languages use the Latin script 
with their own orthography.  There's nothing unique about Min-Nan in 
this regard.

--
Doug Ewell · Fullerton, California, USA · RFC 4645 · UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages 

John Cowan | 4 Sep 16:39

Re: Re: extlang

Doug Ewell scripsit:

> This is one of the things I dislike about Suppress-Script: it tends to 
> imply that the script subtag is always important, and is simply being 
> omitted in this tag for brevity or compatibility.

Well, we are stuck with it, and as currently written "zh-nan" *means*
"zh-nan-Hani", alas.  So if we go with extlangs, we need to add language
allowing, and giving the interpretation of, the Suppress-Script: header
when attached to an extlang subtag, and explaining that extlangs
don't inherit Suppress-Script: from their parents.

> If we get rid of extended language subtags, of course, then this is a 
> moot point: "nan" can be given a Suppress-Script of "Latn", or can be 
> left alone, and neither choice would imply anything about Han.

Indeed.

--

-- 
There is / One art                      John Cowan <cowan <at> ccil.org>
No more / No less                       http://www.ccil.org/~cowan
To do / All things
With art- / Lessness                     -- Piet Hein

Doug Ewell | 4 Sep 17:11

Re: Re: extlang

GerardM wrote:

> The confusing thing for me now is that zh implies a script.

I don't necessarily agree with you and John on this, but I'll leave this 
point alone.

> It is the one reason why it is is reasonable to include Min Nan as 
> being part of Chinese. Without it the whole argument of zh including 
> all the other languages falls for me flat on its face. So I do not 
> understand how Min Nan in another script can be accepted without 
> identifying the script either through the orthography or through 
> identifying the script.

*IF* we implement extended language subtags -- I don't feel comfortable 
using the term "extlang" any more, since it's not the formal term and 
doesn't appear in any official Terms and Definitions section -- then we 
are implementing them strictly on the basis of macrolanguage assignments 
made by ISO 639-3/RA.  If their decision is that Min Nan "is a" Chinese, 
then that is that; we don't second-guess them and toss out the ones we 
don't agree with.

The only exceptions are (1) when a primary language subtag already 
exists for the language in question, which is *exactly* analogous (IMHO) 
to using alpha-2 ISO 639-1 codes in preference to alpha-3 ISO 639-2 
codes when they are available, and (2) sign languages, which are 
fundamentally different and not really handled in a consistent manner 
between ISO 639-2 and -3.

John Cowan <cowan at ccil dot org> wrote:

>> This is one of the things I dislike about Suppress-Script: it tends 
>> to imply that the script subtag is always important, and is simply 
>> being omitted in this tag for brevity or compatibility.
>
> Well, we are stuck with it, and as currently written "zh-nan" *means* 
> "zh-nan-Hani", alas.  So if we go with extlangs, we need to add 
> language allowing, and giving the interpretation of, the 
> Suppress-Script: header when attached to an extlang subtag, and 
> explaining that extlangs don't inherit Suppress-Script: from their 
> parents.

I prefer "don't necessarily inherit."  We would certainly want extended 
language subtags to inherit Suppress-Script from their Prefix *by 
default*, so that all the Arabics, all the Quechuas, etc. would do the 
Right Thing.  But we would allow this to be overridden at the second 
level.

--
Doug Ewell · Fullerton, California, USA · RFC 4645 · UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages


Gmane