Mark Davis | 5 Apr 02:20
Favicon

FYI: Unicode 5.1 Released


---------- Forwarded message ----------
From: Rick McGowan <rick <at> unicode.org>
Date: Fri, Apr 4, 2008 at 3:54 PM
Subject: Unicode 5.1 Released
To: unicode <at> unicode.org


The Unicode Consortium is pleased to announce the release of Unicode 5.1.
This release contains over 100,000 characters, and provides significant
additions and improvements that extend text processing for software
worldwide. Some of the key features are: increased security in data
exchange, significant character additions for Indic and South East Asian
scripts, expanded identifier specifications for Indic and Arabic scripts,
improvements in the processing of Tamil and other Indic scripts,
linebreaking conformance relaxation for HTML and other protocols,
strengthened normalization stability, new case pair stability,
plus others given below.

The Version 5.1.0 data files and documentation are final and posted on the
Unicode site. In addition to updated existing files, implementers will
find new test data files (for example, for linebreaking) and new XML data
files that encapsulate all of the Unicode character properties. For
details, see the page for Unicode 5.1.0 at
http://www.unicode.org/versions/Unicode5.1.0/.

A major feature of Unicode 5.1.0 is the enabling of ideographic variation
sequences. These sequences allow standardized representation of glyphic
variants needed for Japanese, Chinese, and Korean text. The first
registered collection, from Adobe Systems, is now available at
http://www.unicode.org/ivd/.

Unicode 5.1 contains significant changes to properties and behaviorial
specifications. Several important property definitions were extended,
improving linebreaking for Polish and Portuguese hyphenation. The Unicode
Text Segmentation Algorithms, covering sentences, words, and characters,
were greatly enhanced to improve the processing of Tamil and other Indic
languages. The Unicode Normalization Algorithm now defines stabilized
strings and provides guidelines for buffering. Standardized named sequences
are added for Lithuanian, and provisional named sequences for Tamil.

Unicode 5.1.0 adds 1,624 newly encoded characters. These additions include
characters required for Malayalam and Myanmar and important individual
characters such as Latin capital sharp s for German. Version 5.1 extends
support for languages in Africa, India, Indonesia, Myanmar, and Vietnam,
with the addition of the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra,
Sundanese, and Vai scripts. Scholarly support includes important editorial
punctuation marks, as well as the Carian, Lycian, and Lydian scripts, and
the Phaistos disc symbols. Other new symbol sets include dominoes, Mahjong,
dictionary punctuation marks, and math additions. This latest version of
the Unicode Standard has exactly the same character assignments as ISO/IEC
10646:2003 plus Amendments 1 through 4.

The Unicode Collation Algorithm (UCA), the core standard for sorting all
text, is also being updated at the same time (see
http://www.unicode.org/reports/tr10/). The major changes in UCA include
coverage of all Unicode 5.1 characters, tightened conformance for canonical
equivalence, clearer definitions of internationalized search and matching,
specifications of parameters for customizing collation, and definitions of
collation folding. There are also important clarifications on the use of
contractions (such as "ch" in Slovak) in collation.

The next version of the Unicode locale project (CLDR) is also being
prepared on the basis of Unicode 5.1, and is now open for public data
submission (see http://www.unicode.org/cldr/).





--
Mark



--
Mark
Mark Davis | 10 Apr 21:52
Favicon

Last open item

I went through all the pending comments to see which didn't have objections raised on the list. The only open substantive issue I think we have left is about whether to change the following lines:

The field 'Preferred-Value' MUST NOT be modified once created in the registry. The field MAY be added to records according to the rules in Section 3.3 (Maintenance of the Registry).

(note: we don't have a similar phrase for Deprecated)

Values in the fields 'Type', 'Subtag', 'Tag', 'Added', 'Deprecated' and 'Preferred-Value' MUST NOT be changed and are guaranteed to be stable over time.

Here is my reasoning. We can't guarantee complete stability over time. Whenever Deprecated or Preferred-Value are added, it changes the canonical form; or if the target of a Preferred-Value is itself deprecated. Currently if a subtag is deprecated in a source standard in favor of another, nothing special needs to be done unless:
  1. there is a collision - the same subtag with a different meaning (like CS), OR
  2. the preferred subtag is an older deprecated code (eg BU => MM => BU), OR
  3. the target of the Preferred-Value itself becomes deprecated (we want to resolve it so that users don't have to).
In the first 2 cases, we are forced to use a "special" code, eg change MM to 104. That's perfectly fine in the case of collisions (#1). However, it is completely unnecessary in the case of older deprecated codes. So my proposal is to change the above lines to drop the first one, and change the second to:

Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be changed and are guaranteed to be stable over time. Values in the 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or subtag in the Preferred-Value field is itself deprecated, or if a deprecation is reversed in one of the source standards.

--
Mark
Mark Davis | 11 Apr 01:32
Favicon

Re: Last open item

Ok, not completely the last. Addison and I are trying to wrap this all up, and ran across one other.

Before forwarding a new registration to IANA, the Language Subtag Reviewer MUST ensure that values in the 'Subtag' field match case according to the description in Section 3.1.
=>

Before forwarding a new registration to IANA, the Language Subtag Reviewer MUST ensure that values in the 'Subtag' field match case according to the description in Section 3.1, and that all other requirements of this specification are followed.

Rationale. It has been implicit in all of our discussions and in the document that the LSR must follow all of the policies in the document are followed, not just casing. However, it isn't really explicit in the document, and should be. This is a minimal change to that end (although other wording would be possible).

Mark

On Thu, Apr 10, 2008 at 12:52 PM, Mark Davis <mark.davis <at> icu-project.org> wrote:
I went through all the pending comments to see which didn't have objections raised on the list. The only open substantive issue I think we have left is about whether to change the following lines:

The field 'Preferred-Value' MUST NOT be modified once created in the registry. The field MAY be added to records according to the rules in Section 3.3 (Maintenance of the Registry).

(note: we don't have a similar phrase for Deprecated)

Values in the fields 'Type', 'Subtag', 'Tag', 'Added', 'Deprecated' and 'Preferred-Value' MUST NOT be changed and are guaranteed to be stable over time.

Here is my reasoning. We can't guarantee complete stability over time. Whenever Deprecated or Preferred-Value are added, it changes the canonical form; or if the target of a Preferred-Value is itself deprecated. Currently if a subtag is deprecated in a source standard in favor of another, nothing special needs to be done unless:
  1. there is a collision - the same subtag with a different meaning (like CS), OR
  2. the preferred subtag is an older deprecated code (eg BU => MM => BU), OR
  3. the target of the Preferred-Value itself becomes deprecated (we want to resolve it so that users don't have to).
In the first 2 cases, we are forced to use a "special" code, eg change MM to 104. That's perfectly fine in the case of collisions (#1). However, it is completely unnecessary in the case of older deprecated codes. So my proposal is to change the above lines to drop the first one, and change the second to:

Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be changed and are guaranteed to be stable over time. Values in the 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or subtag in the Preferred-Value field is itself deprecated, or if a deprecation is reversed in one of the source standards.

--
Mark



--
Mark
Martin Duerst | 11 Apr 09:07
Picon
Gravatar

Re: Last open item

If Doug (and Michael) are fine with this change, I'm fine
with it, too.

Regards,   Martin.

At 08:32 08/04/11, Mark Davis wrote:
>Ok, not completely the last. Addison and I are trying to wrap this all up, and ran across one other.
>
>Before forwarding a new registration to IANA, the Language Subtag Reviewer MUST ensure that values in the
'Subtag' field match case according to the description in Section 3.1.
>=>
>
>Before forwarding a new registration to IANA, the Language Subtag Reviewer MUST ensure that values in the
'Subtag' field match case according to the description in Section 3.1, and that all other requirements of
this specification are followed.
>
>Rationale. It has been implicit in all of our discussions and in the document that the LSR must follow all of
the policies in the document are followed, not just casing. However, it isn't really explicit in the
document, and should be. This is a minimal change to that end (although other wording would be possible).
>
>Mark
>
>On Thu, Apr 10, 2008 at 12:52 PM, Mark Davis
<<mailto:mark.davis <at> icu-project.org>mark.davis <at> icu-project.org> wrote:
>>I went through all the pending comments to see which didn't have objections raised on the list. The only
open substantive issue I think we have left is about whether to change the following lines:
>>
>>The field 'Preferred-Value' MUST NOT be modified once created in the registry. The field MAY be added to
records according to the rules in
<http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-13.html#maintreg>Section 3.3
(Maintenance of the Registry).
>>
>>(note: we don't have a similar phrase for Deprecated)
>>
>>Values in the fields 'Type', 'Subtag', 'Tag', 'Added', 'Deprecated' and 'Preferred-Value' MUST NOT be
changed and are guaranteed to be stable over time.
>>
>>Here is my reasoning. We can't guarantee complete stability over time. Whenever Deprecated or
Preferred-Value are added, it changes the canonical form; or if the target of a Preferred-Value is itself
deprecated. Currently if a subtag is deprecated in a source standard in favor of another, nothing special
needs to be done unless: 
>>    * there is a collision - the same subtag with a different meaning (like CS), OR 
>>    * the preferred subtag is an older deprecated code (eg BU => MM => BU), OR 
>>    * the target of the Preferred-Value itself becomes deprecated (we want to resolve it so that users don't
have to). 
>>In the first 2 cases, we are forced to use a "special" code, eg change MM to 104. That's perfectly fine in the
case of collisions (#1). However, it is completely unnecessary in the case of older deprecated codes. So
my proposal is to change the above lines to drop the first one, and change the second to:
>>
>>Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be changed and are guaranteed to be
stable over time. Values in the 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or
subtag in the Preferred-Value field is itself deprecated, or if a deprecation is reversed in one of the
source standards.
>>
>>-- 
>>Mark 
>
>
>
>-- 
>Mark 
>_______________________________________________
>Ltru mailing list
>Ltru <at> ietf.org
>https://www.ietf.org/mailman/listinfo/ltru

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

John Cowan | 11 Apr 15:24

Re: Last open item

Mark Davis scripsit:

> Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be
> changed and are guaranteed to be stable over time. Values in the
> 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or
> subtag in the Preferred-Value field is itself deprecated, or if a
> deprecation is reversed in one of the source standards.

+1

--

-- 
I now introduce Professor Smullyan,             John Cowan
who will prove to you that either               cowan <at> ccil.org
he doesn't exist or you don't exist,            http://www.ccil.org/~cowan
but you won't know which.                               --Melvin Fitting
Doug Ewell | 12 Apr 18:53
Favicon

Re: Last open item

Mark Davis <mark dot davis at icu dash project dot org> wrote:

> So my proposal is to change the above lines to drop the first one, and 
> change the second to:
>
> Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be 
> changed and are guaranteed to be stable over time. Values in the 
> 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag 
> or subtag in the Preferred-Value field is itself deprecated, or if a 
> deprecation is reversed in one of the source standards.

I no longer have a strong opinion on this.  I know that at one point it 
was agreed that we should never, ever change Deprecated and 
Preferred-Value fields, in the interest of stability.  If that is no 
longer a concern, or if we were wrong about the amount of stability thus 
provided, then this change is fine.

My usual concern is that, if we want to change one of our own decisions 
from the past, we should understand why we came to that decision in the 
first place.  That criterion seems to be fulfilled here.

--
Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Doug Ewell | 12 Apr 18:55
Favicon

Re: Last open item

Mark Davis <mark dot davis at icu dash project dot org> wrote:

> Before forwarding a new registration to IANA, the Language Subtag 
> Reviewer MUST ensure that values in the 'Subtag' field match case 
> according to the description in Section 3.1, and that all other 
> requirements of this specification are followed.
>
> Rationale. It has been implicit in all of our discussions and in the 
> document that the LSR must follow all of the policies in the document 
> [are followed], *not just casing*. However, it isn't really explicit 
> in the document, and should be. This is a minimal change to that end 
> (although other wording would be possible).

Agreed.  This change is short and harmless.

--
Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Randy Presuhn | 13 Apr 23:06
Picon

Re: Last open item

Hi -

As a technical contributor...

> From: "Mark Davis" <mark.davis <at> icu-project.org>
> To: "LTRU Working Group" <ltru <at> ietf.org>
> Sent: Thursday, April 10, 2008 1:52 PM
> Subject: [Ltru] Last open item
>
> I went through all the pending comments to see which didn't have objections
> raised on the list. The only open substantive issue I think we have left is
> about whether to change the following lines:
> 
> The field 'Preferred-Value' MUST NOT be modified once created in the
> registry. The field MAY be added to records according to the rules in
> Section 3.3 (Maintenance of the
> Registry)<http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-13.html#maintreg>
> .
> 
> *(note: we don't have a similar phrase for Deprecated)*
> 
> Values in the fields 'Type', 'Subtag', 'Tag', 'Added', 'Deprecated' and
> 'Preferred-Value' MUST NOT be changed and are guaranteed to be stable over
> time.
> 
> Here is my reasoning. We can't guarantee complete stability over time.
> Whenever Deprecated or Preferred-Value are added, it changes the canonical
> form; or if the target of a Preferred-Value is itself deprecated. Currently
> if a subtag is deprecated in a source standard in favor of another, nothing
> special needs to be done unless:
> 
>    1. there is a collision - the same subtag with a different meaning
>    (like CS), OR
>    2. the preferred subtag is an older deprecated code (eg BU => MM =>
>    BU), OR
>    3. the target of the Preferred-Value itself becomes deprecated (we
>    want to resolve it so that users don't have to).
> 
> In the first 2 cases, we are forced to use a "special" code, eg change MM to
> 104. That's perfectly fine in the case of collisions (#1). However, it is
> completely unnecessary in the case of older deprecated codes. So my proposal
> is to change the above lines to drop the first one, and change the second
> to:
> 
> Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be
> changed and are guaranteed to be stable over time. Values in the
> 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or
> subtag in the Preferred-Value field is itself deprecated, or if a
> deprecation is reversed in one of the source standards.
> 
> -- 
> Mark

I object to making this change.  Case (1) is already prohibited in sections
3.4(10) and 3.6 of RFC 4646.  Case (2) is already covered by 3.4(9) - but 
perhaps we should emphasize the optionality implied by the "MAY" in that
clause, because generally, in the interest of stability, one would not want
to put in a new preferred value.  Case (3) is a non-problem.  The "deprecated"
in the record only refers to the status in the source standard, not the
validity of the string as a language tag, as should be very clear from the
first sentence of 3.4(0)

Adopting this change would add instability without providing any value,
and the cases cited as motivations are already covered by the RFC 4646
rules.

Randy

Mark Davis | 14 Apr 17:15
Favicon

Re: Last open item



On Sun, Apr 13, 2008 at 2:06 PM, Randy Presuhn <randy_presuhn <at> mindspring.com> wrote:
Hi -

As a technical contributor...

> From: "Mark Davis" <mark.davis <at> icu-project.org>
> To: "LTRU Working Group" <ltru <at> ietf.org>
> Sent: Thursday, April 10, 2008 1:52 PM
> Subject: [Ltru] Last open item
>
> I went through all the pending comments to see which didn't have objections
> raised on the list. The only open substantive issue I think we have left is
> about whether to change the following lines:
>
> The field 'Preferred-Value' MUST NOT be modified once created in the
> registry. The field MAY be added to records according to the rules in
> Section 3.3 (Maintenance of the
> Registry)<http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-13.html#maintreg>
> .
>
> *(note: we don't have a similar phrase for Deprecated)*
>
> Values in the fields 'Type', 'Subtag', 'Tag', 'Added', 'Deprecated' and
> 'Preferred-Value' MUST NOT be changed and are guaranteed to be stable over
> time.
>
> Here is my reasoning. We can't guarantee complete stability over time.
> Whenever Deprecated or Preferred-Value are added, it changes the canonical
> form; or if the target of a Preferred-Value is itself deprecated. Currently
> if a subtag is deprecated in a source standard in favor of another, nothing
> special needs to be done unless:
>
>    1. there is a collision - the same subtag with a different meaning
>    (like CS), OR
>    2. the preferred subtag is an older deprecated code (eg BU => MM =>
>    BU), OR
>    3. the target of the Preferred-Value itself becomes deprecated (we
>    want to resolve it so that users don't have to).
>
> In the first 2 cases, we are forced to use a "special" code, eg change MM to
> 104. That's perfectly fine in the case of collisions (#1). However, it is
> completely unnecessary in the case of older deprecated codes. So my proposal
> is to change the above lines to drop the first one, and change the second
> to:
>
> Values in the fields 'Type', 'Subtag', 'Tag', and 'Added' MUST NOT be
> changed and are guaranteed to be stable over time. Values in the
> 'Deprecated' and 'Preferred-Value' MUST NOT be changed unless the tag or
> subtag in the Preferred-Value field is itself deprecated, or if a
> deprecation is reversed in one of the source standards.
>
> --
> Mark

I object to making this change.  Case (1) is already prohibited in sections
3.4(10) and 3.6 of RFC 4646.  

Understood. I was summing up the position, and did not request any change in #1. Note that I am talking about two different, interrelated fields: Deprecated, and Preferred-Value.
 
Case (2) is already covered by 3.4(9) - but
perhaps we should emphasize the optionality implied by the "MAY" in that
clause, because generally, in the interest of stability, one would not want
to put in a new preferred value.

Case 2 is not covered by 3.4(9), and the fix that I am recommending would change current text.

3.4(9) is: "Codes assigned by ISO 639-1 that do not conflict with existing two-letter primary language subtags and which have no corresponding three-letter primary defined in the registry are entered into the IANA registry as new records of type 'language'." I'm not sure how it applies to the cases I'm discussing, and it certainly doesn't apply to region codes.

Let me start again. I'll apologize for being a bit long-winded, but I think it's important to have some specific examples. The situation I discuss in #2 is where there is a case like:

Initial State

Subtag: X
Preferred-Value: Y
Deprecated: 1989-01-01
...
Subtag: Y

X and Y could be 'in' and 'id', OR 'BU' and 'MM', OR some script tag (no instances yet). Note that both X and Y are valid, both with the same meaning, and Y is preferred.

The change.

ISO restores X to being the correct tag, and deprecates Y.

Our response.

There are three courses of action we could take. B2 is in the current text (although not clearly -- see below).

(A) Revert to old tag

Subtag: X
[removed old Preferred-Value and Deprecated]
...
Subtag: Y
Preferred-Value: X
Deprecated: 2010-01-01

(B1) Make new tag (and fix recursion)

Subtag: X
Preferred-Value: Z
Deprecated: 1989-01-01
...
Subtag: Y
Preferred-Value: Z
Deprecated: 2010-01-01
...
Subtag: Z

Z will be of the form '666' for a region or 'zzzzz' for a language subtag.

(B2) Make new tag (require users to do recursion)

Subtag: X
Preferred-Value: Y
Deprecated: 1989-01-01
...
Subtag: Y
Preferred-Value: Z
Deprecated: 2010-01-01
...
Subtag: Z

Z will be of the form '666' for a region or 'zzzzz' for a language subtag.

Note that in all of these cases both X and Y are still valid, both with the same meaning.

If we do B1 or B2, we needlessly deviate from ISO, and are forced into using an new, non-ISO code for something that is perfectly reasonable. Why is this a good idea?

We still have stability in *all* of these cases. All codes before and after any of these options are valid. The only difference is in the preferred form. In the case of A, we are closer to ISO; for B we pointlessly depart from ISO. And in case B2, we have the added disadvantage that if users don't do the recursion correctly, they will end up with incorrect preferred values.

Moreover, the current text is not clear with regard to the difference between B1 and B2.
  • "If a tag or subtag has a 'Preferred-Value' field in its registry entry, then the value of that field SHOULD be used to form the language tag in preference to the tag or subtag in which the preferred value appears." (4.1(3))
  • "Subtags of type 'Region' that have a Preferred-Value mapping in the IANA registry (see Section 3.1 (Format of the IANA Language Subtag Registry)) MUST be replaced with their mapped value. Note: In rare cases, the mapped value will also have a Preferred-Value.". (4.4(1))
4.1(3) doesn't mention the possibility of recursion. If the user doesn't realize that the registry doesn't already resolve the PV chains, he'll happily substitute Y for X and think he's done, not realizing that he has to go look at Y to see that it in turn should be replaced by Z.
4.4(1) does mention it, but could be clearer that the user must recurse.
 
Case (3) is a non-problem.  The "deprecated"
in the record only refers to the status in the source standard, not the
validity of the string as a language tag, as should be very clear from the
first sentence of 3.4(0)

I am talking about BOTH the Deprecated status and the Preferred-Value status, in all of my original message if you look back at it. These are distinct, yet intertwined.



Adopting this change would add instability without providing any value,
and the cases cited as motivations are already covered by the RFC 4646
rules.

I disagree. Can you go into more detail -- with examples -- of how you think approach (A) would add instability with regard to codes?

Because both codes X and Y are valid before and after any of these alternatives, the only instability that I think you might be referring to is the canonical form. But the canonical form is changing *all* of these cases: A, B1, and B2; Y is no longer the canonical form in any of them. I don't see why B2 is any "more" stable than A, not in any way that matters.

As to the recursion, although it wouldn't apply in this case if we allow A, it does in other cases. And resolving the recursion for the user in the registry just prevents needless mistakes.
 


Randy


_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru



--
Mark
John Cowan | 14 Apr 17:47

Re: Last open item

Mark Davis asks rhetorically:

> If we do B1 or B2, we needlessly deviate from ISO, and are forced into using
> an new, non-ISO code for something that is perfectly reasonable. Why is this
> a good idea?

I agree: it is not a good idea.  We should restore the original tag as the
preferred value in such cases.

> As to the recursion, although it wouldn't apply in this case if we allow A,
> it does in other cases. And resolving the recursion for the user in the
> registry just prevents needless mistakes.

+1.  The only arguments I can see against doing the recursion for the user
are:

1) Too much risk of error; but we have plenty of vigilant eyes in ietf-languages.

2) The ability to reconstruct the registry as of a given point in time is lost;
but Doug Ewell has argued that this ability does not exist anyhow.

--

-- 
John Cowan  cowan <at> ccil.org   http://ccil.org/~cowan
Consider the matter of Analytic Philosophy.  Dennett and Bennett are well-known.
Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett.
There is also one Dummett.  By their works shall ye know them.  However, just as
no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly
known by his works.  Indeed, Bummett does not exist.  It is part of the function
of this and other e-mail messages, therefore, to do what they can to create him.

Gmane