Avram Lyon | 12 Feb 23:24 2011

Fwd: Tagging transliterations from a specific script

Dear IETF-Languages,

I have a set of data available in several forms: Tatar, written in the
Arabic script (tt-Arab); Tatar, written in the Cyrillic script
(tt-Cyrl); transliteration of that same text into Latin script. The
original text is in tt-Arab, so the transliteration (since it follows
ALA-LC 1997) should certainly be tagged tt-alalc97. That tag, however,
is precisely what we'd use for a transliteration using the ALA-LC
system from tt-Cyrl as well. Thus, there's no way to distinguish
between the two very different representations of the same text (i.e.,
the ALA-LC system is defined for Arabic scripts and for Cyrillic
scripts, but the systems lead to very different representations).

The real-world case where this arises is in the multilingual version
of Zotero, the bibliographic data management software. There, we're
allowing the entry of alternate representations of key fields using
any valid language tag, which has been great so far. But now we can't
represent this distinction; it would be something like
*tt-Arab-alalc97, but subtags aren't supposed to override one another,
just refine each other.

I think it might be appropriate to introduce a variant subtag for
Tatar in the Arabic script, which was used until the introduction of
Janalif in 1927-1928 (tt-Latn, tt-baku1926), but I'd be glad to hear
other options for distinguishing these data.

Regards,

Avram
(Continue reading)

CE Whitehead | 14 Feb 00:22 2011
Picon

Re: Tagging transliterations from a specific script

Hi, Avram:
Avram Lyon ajlyon at ucla.edu
Sat Feb 12 23:24:44 CET 2011

> Dear IETF-Languages,
> I have a set of data available in several forms: Tatar, written in the
> Arabic script (tt-Arab); Tatar, written in the Cyrillic script
> (tt-Cyrl); transliteration of that same text into Latin script. The
> original text is in tt-Arab, so the transliteration (since it follows
> ALA-LC 1997) should certainly be tagged tt-alalc97. That tag, however,
> is precisely what we'd use for a transliteration using the ALA-LC
> system from tt-Cyrl as well. Thus, there's no way to distinguish
> between the two very different representations of the same text (i.e.,
> the ALA-LC system is defined for Arabic scripts and for Cyrillic
> scripts, but the systems lead to very different representations).
> The real-world case where this arises is in the multilingual version
> of Zotero, the bibliographic data management software. There, we're
> allowing the entry of alternate representations of key fields using
> any valid language tag, which has been great so far. But now we can't
> represent this distinction; it would be something like
> *tt-Arab-alalc97, but subtags aren't supposed to override one another,
> just refine each other.
If it's for a transliteration into Latin script then how would you tag it tt-Arab . . . ?
(I'm sorry to ask a dumb question.)
> I think it might be appropriate to introduce a variant subtag for
> Tatar in the Arabic script, which was used until the introduction of
You mean a variant to indicate the Romanization of Tatar that was originally written in the Arabic script . . . ?
> Janalif in 1927-1928 (tt-Latn, tt-baku1926), but I'd be glad to hear
> other options for distinguishing these data.
> Regards,
> Avram
 
 
I'm not sure that I completely understand the request (my apologies). 
Another option is to use metadata and certainly perhaps the text date would provide a clue as to the original script (if that's what you are asking for:  a way to distinguish the original script).
However I personally have no objection to having two variants indicating two distinct ala-lc romanizations,
but I hope we will hear from a few others regarding this matter (I am not the expert in ala-lc romanizations).
 
In any case
[alalc97] is not just for Tatar, is it? (let me know if it is)
 
See:
http://www.iana.org/assignments/lang-subtags-templates/alalc97
 
 
So would other Romanizations from Arabic script (from other languages) fit into your scheme? 
 
Best,
 
--C. E. Whitehead
cewcathar <at> hotmail.com
_______________________________________________
Ietf-languages mailing list
Ietf-languages <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/ietf-languages
Avram Lyon | 14 Feb 08:45 2011

Tagging transliterations from a specific script

[Re-sending, since I mistakenly sent this only to CE Whitehead last time.]

2011/2/14 CE Whitehead <cewcathar <at> hotmail.com>:
> If it's for a transliteration into Latin script then how would you tag it
> tt-Arab . . . ?
> (I'm sorry to ask a dumb question.)

That's my point, really. It can't be tt-Arab-alalc97, because that
would make no sense. But it does matter that this is the ALA-LC
romanization from Tatar in the Arabic script, and not from Tatar in
the modern Cyrillic script.

> I'm not sure that I completely understand the request (my apologies).
> Another option is to use metadata and certainly perhaps the text date would
> provide a clue as to the original script (if that's what you are asking
> for:  a way to distinguish the original script).
> However I personally have no objection to having two variants indicating two
> distinct ala-lc romanizations,
> but I hope we will hear from a few others regarding this matter (I am not
> the expert in ala-lc romanizations).

I know the original script, and indeed it's pretty obvious from
looking at the romanized text which source script was used for the
romanization. But I am looking for a way to tag it, since the tagged
text has to be used by bibliographic software that is supposed to
choose the form of the text that specific citation style guides
require.

> In any case
> [alalc97] is not just for Tatar, is it? (let me know if it is)
[..]
> So would other Romanizations from Arabic script (from other languages) fit
> into your scheme?

As the person who requested alalc97, I understand of course that it is
not just for Tatar. This general issue of distinguishing ALA-LC
romanizations from various scripts of the same language does indeed
affect other languages. It certainly matters for Azerbaijani, Bashkir,
Uzbek, and other Turkic languages that had a similar history of using
an Arabic script before the introduction of Janalif (baku1926). It
also matters for Turkish and Ottoman Turkish, but in that case the
latter is represented by "ota", so ota-alalc97 and tr-alalc97 are
distinct already.

I'd suggest a tag for the Turkic languages affected by the
introduction of Janalif, before the introduction of the same, but I
don't want to cause the same justifiable concern that was raised about
my proposed "pre1917" tag on this list last fall. Also, such a tag
would really just represent a script, so in most cases it would be
equivalent to, e.g., tt-Arab, az-Arab. It only really is needed, then,
when the actual script is not Arabic, so tt-Latn-ARABIC (not a real,
or legal, subtag). So tt-Arab and tt-ARABIC are completely identical.

If a subtag for the pre-Latin and pre-Cyrillic forms of these various
Turkic languages is deemed appropriate, I'll look into what diversity
there is in the Arabic scripts, so I can craft a defensible and strong
proposal for a new subtag. My understanding is that there were
multiple types of Arabic script used for these languages, so we may be
able to justify tags on the grounds of orthographic reforms and
shifts.

Again, thanks for your advice as I try to work this out.

Sincerely,

Avram Lyon
_______________________________________________
Ietf-languages mailing list
Ietf-languages <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/ietf-languages
Phillips, Addison | 14 Feb 17:22 2011

RE: Tagging transliterations from a specific script

> 
> I'd suggest a tag for the Turkic languages affected by the
> introduction of Janalif, before the introduction of the same, but I
> don't want to cause the same justifiable concern that was raised
> about
> my proposed "pre1917" tag on this list last fall. Also, such a tag
> would really just represent a script, so in most cases it would be
> equivalent to, e.g., tt-Arab, az-Arab. It only really is needed,
> then,
> when the actual script is not Arabic, so tt-Latn-ARABIC (not a real,
> or legal, subtag). So tt-Arab and tt-ARABIC are completely
> identical.

If I understand the problem correctly, you want to distinguish between "tt-alalc97" when transliterated
from the Arabic script vs. the Cyrillic script. This suggests to me that you want a subordinate subtag
(following alalc97) rather than trying to repurpose some unrelated but already defined subtag value. 

For example, you might consider registering a few subtags such as the following:

      Type:             variant
      Subtag:           sArab      (this would actually be lowercase in the registry)
      Description:      transliteration from the Arabic script
      Prefix:           tt-alalc97 (etc.....)
      Comments:         transliterated document's source script was Arabic; a document tagged
          with this subtag will be in the Latin script. Differences in transliteration
          occur depending on the source script.

Alternatively, it might be time to consider a transliteration extension to forestall increasingly
baroque subtag collections. Extensions allow for any subtag between 2 and 8 characters and can define
their own rules for legal usage. For example, if 't' were assigned to an extension for transliteration, it
might then define subtags to allow a tag like:

  "tt-alalc97-t-arab" // Tatar transliterated from the Latin script

Writing an extension turns out not to be very hard. The main problem would be deciding what to put in it (which
might be an intractable problem).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.
CE Whitehead | 14 Feb 21:34 2011
Picon

Tagging transliterations from a specific script

Hi.  [sarab] is o.k. (but I don't think it would be listed as [sArab] in the registry, since the variant subtag is technically all lower-case).  I personally want to suggest [transarb] -- if that makes any sense.

Phillips, Addison addison at lab126.com
Mon Feb 14 17:22:21 CET 2011
 
>>
>> I'd suggest a tag for the Turkic languages affected by the
>> introduction of Janalif, before the introduction of the same, but I
>> don't want to cause the same justifiable concern that was raised
>> about
>> my proposed "pre1917" tag on this list last fall. Also, such a tag
>> would really just represent a script, so in most cases it would be
>> equivalent to, e.g., tt-Arab, az-Arab. It only really is needed,
>> then,
>> when the actual script is not Arabic, so tt-Latn-ARABIC (not a real,
>> or legal, subtag). So tt-Arab and tt-ARABIC are completely
>> identical.
> If I understand the problem correctly, you want to distinguish between "tt-alalc97" when
> transliterated from the Arabic script vs. the Cyrillic script. This suggests to me that you want a
> subordinate subtag (following alalc97) rather than trying to repurpose some unrelated but already
> defined subtag value.
Thanks for restating this.  
This is what I understand too.  (I assume this is what Avram Lyon means.)
So ideally [alalc97] would be registered as the prefix.
However, I suppose we cannot have *-alalc97 registered as the prefix.
And, if we do not register a prefix with this subtag, it seems we cannot do so later, according to RFC 5646:
"If a record includes no ’Prefix’ field, a ’Prefix’ field MUST NOT be
added to the record at a later date. Otherwise, changes (additions,
deletions, or modifications) to the set of ’Prefix’ fields MAY be
registered, as long as they strictly widen the range of language tags  . . . ""
(If I recollect things you all did decide to continue to not allow wildcards, so the only option is to list all possible prefixes, or else to include information about the ordering of this variant after [alalc] in a comment. )
> For example, you might consider registering a few subtags such as the following:
>      Type:             variant
>       Subtag:           sArab      (this would actually be lowercase in the registry)
One small comment:  I don't think you can use an upper-case A in the variant subtag can you?
My preferences for the name are: [tranarab] or [fromarab] or [transarb] or something similar; the one option I do not like is [arabic], which I find to be a confusing name.
(This is just my personal preference.  Like I said before I am not the expert.)
Best,
 
--C. E. Whitehead
cewcathar <at> hotmail.com

>      Description:      transliteration from the Arabic script
>      Prefix:           tt-alalc97 (etc.....)
>      Comments:         transliterated document's source script was Arabic; a document tagged
>      with this subtag will be in the Latin script. Differences in transliteration
>      occur depending on the source script.
> Alternatively, it might be time to consider a transliteration extension to forestall increasingly baroque > subtag collections. Extensions allow for any subtag between 2 and 8 characters and can define their > own rules for legal usage. For example, if 't' were assigned to an extension for transliteration, it
> might then define subtags to allow a tag like:
>  "tt-alalc97-t-arab" // Tatar transliterated from the Latin script
> Writing an extension turns out not to be very hard. The main problem would be deciding what to put
> in it (which might be an intractable problem).
> Addison
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N, IETF IRI WGs)
> Internationalization is not a feature.
> It is an architecture.
 
_______________________________________________
Ietf-languages mailing list
Ietf-languages <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/ietf-languages
Michael Everson | 3 Mar 18:52 2011

Applying for a Neo subtag

I'll be publishing a translation of Alice's Adventures in Wonderland in Neo (Los Aventuros de Alis in
Marvoland) in Neo -- http://en.wikipedia.org/wiki/Neo_(constructed_language) -- later this year or
sometime next, and Neo hasn't got an ISO 639 language tag. I filed a request today with the ISO 639-3
authority for the tag "neu". Failing that we would need a subtag to "art" here... I guess neo1961 would do. 

Editions of Alice in Scots, Ulster Scots, Sussex, and Appalachian are also in progress. For Scots and
Ulster Scots we have "sco" and "sco-ulster" respectively. I could propose "sussex" for the first. For the
second...?  

Michael Everson * http://www.evertype.com/

Gmane