Chuck Soper | 2 Sep 2004 09:39
Favicon

Localization Facilitation

Over a month ago Mark Davis posted an email that mentioned issues 
related to translations of time zone IDs provided by the common 
locale data repository project (CLDR). I'm interested to find about 
the status of this localization and to find a way to keep informed of 
the progress.

The initial email discussion was somewhat lengthy and I'm sure that I 
did not digest all the points mentioned.

The beginning of the CLDR proposal explains the current mechanism for 
localizing Olson Time Zone Identifiers (Olson TZIDs). And in the tz 
code directory, the rules for naming Olson TZIDs are in the Theory 
file. Many Olson TZIDs simply do not localize well. I believe that 
the Olson TZIDs were not designed with localization in mind.

I suggest a zoneMeta.tab file (very similar to zone.tab) be created 
for the propose of facilitating localization. Such a file could 
include the following columns:

  - ISO 3166-1
  - latitude/longitude (from zone.tab)
  - Olson TZID
  - city or place name
  - sub-division (i.e. state, province, prefecture, or kingdom)
  - time zone name, general (e.g. Eastern Time)
  - time zone name, specific if is exists
            (e.g. Eastern Standard Time - Indiana - most locations)
  - time zone name, std (e.g. Eastern Standard Time)
  - time zone name, dst (e.g. Eastern Daylight Saving Time)
  - historic (yes/no)
(Continue reading)

Paul Eggert | 3 Sep 2004 07:05
Favicon

Re: Localization Facilitation

Chuck Soper <chucks <at> lmi.net> writes:

> Many Olson TZIDs simply do not localize well. I believe that the
> Olson TZIDs were not designed with localization in mind.

I'm not sure what you mean here.  TZIDs are merely identifiers for
regions of the globe.

> I suggest a zoneMeta.tab file (very similar to zone.tab) be created
> for the propose of facilitating localization. Such a file could
> include the following columns:
>
>   - ISO 3166-1
>   - latitude/longitude (from zone.tab)
>   - Olson TZID
>   - city or place name

This is all in zone.tab now, for English only.  The city or place name
should be localized of course.

>   - sub-division (i.e. state, province, prefecture, or kingdom)

Often the TZID subdivisions are ad hoc; this reflects the ad hoc
nature of time zone and DST political decisions.  For the best TZ
subdivision data I know of, please see Gwillim Law's Statoids
<http://www.statoids.com/statoids.html>.  Another source is Oscar van
Vlijmen's Time zone boundaries for multizone countries
<http://home-4.tiscali.nl/~t876506/Multizones.html>.

>   - time zone name, general (e.g. Eastern Time)
(Continue reading)

Picon

Proposed strftime.c and localtime.c changes for long years

Attached are changes (based on those circulated on the list earlier by Paul
Eggert) to strftime.c and localtime.c; these changes allow for the
possibility that adding an int year to TM_YEAR_BASE might overflow an int,
so the relevant math is done with longs. There are also a couple of cosmetic
changes. Note that no changes to the way strftime handles two-digit year
output requests are being (deliberately) made at this point.

I hope to update the ftp bundle to incorporate these changes in about a
week; I'd welcome any feedback before then.

					--ado

------- localtime.c -------
*** /tmp/geta17327	Tue Sep  7 08:12:24 2004
--- /tmp/getb17327	Tue Sep  7 08:12:24 2004
***************
*** 5,11 ****

  #ifndef lint
  #ifndef NOID
! static char	elsieid[] = " <at> (#)localtime.c	7.78";
  #endif /* !defined NOID */
  #endif /* !defined lint */

--- 5,11 ----

  #ifndef lint
  #ifndef NOID
! static char	elsieid[] = " <at> (#)localtime.c	7.79";
  #endif /* !defined NOID */
(Continue reading)

Paul Eggert | 7 Sep 2004 20:43
Favicon

Re: Proposed localtime.c changes for long years

Thanks for looking into that.  I'm responding to the proposed
localtime.c changes; a separate email will address the strftime.c
changes.

"Olson, Arthur David (NIH/NCI)" <olsona <at> dc37a.nci.nih.gov> writes:

> extern char *   asctime_r();

This could be a problem if asctime_r is a macro in time.h.  Also, it
should probably be moved to private.h, since it's a header fixup (like
errno) that is useful in other modules.  (Also, being an old-timer I
have some qualms about declaring externs inside blocks, due to bugs in
older compilers.)

> ! #define tm_year	USE_Y_NOT_YOURTM_TM_YEAR
>...
> + #undef tm_year

I'd remove this trick.  Strictly speaking, the C Standard doesn't
allow it: it says that identifiers in standard headers are reserved
for use as macro names.  More practically, I worry that some <time.h>
somewhere might #define tm_year to something else, and the
#define/#undef trick won't work there.  The #define/#undef isn't
needed for the code to run, so I'd leave it out.

! 	** Turn y into an actual year number for now.
  	** It is converted back to an offset from TM_YEAR_BASE later.

This hack has bothered me for a while, due to the possibility of
overflow when adding TM_YEAR_BASE.  It's a minor point, but how about
(Continue reading)

Paul Eggert | 7 Sep 2004 22:03
Favicon

Re: Proposed strftime.c changes for long years

The proposed strftime.c code for the %C conversion uses _lconv, but
the converted value can't possibly fall outside the range
INT_MIN..INT_MAX so it might be a bit more efficient to use _conv.

The %C conversion expression "(t->tm_year + (long) TM_YEAR_BASE) /
100" returns the wrong result in some cases due to overflow, e.g., if
t->tm_year == INT_MAX and INT_MAX==LONG_MAX.

It looks like you've leaning towards having %y generate year%100 even
when the year is negative, but I'd like to present a new arguemnt for
generating year mod 100 instead, on the grounds of preventing buffer
overruns in older code.  Most programmers expect strftime %y to
generate exactly two digits, in the range 00-99 as the the C Standard
explicitly requires.  Generating strings like "-1" or (especially)
"-99" might cause errors (e.g., buffer overruns) in older code.

The situation for %g is similar.

%C is less straightforward, since here the C Standard requires the
impossible (it says the output range must be 00-99).  However, I'd
argue that it's least surprising if %C is consistent with %y, i.e., if
%C is (year - (year mod 100)) / 100.

The following proposed patch addresses the above issues.

===================================================================
RCS file: RCS/strftime.c,v
retrieving revision 2001.4.1.1
retrieving revision 2001.4.0.2
diff -pu -r2001.4.1.1 -r2001.4.0.2
(Continue reading)

Chuck Soper | 9 Sep 2004 12:03
Favicon

Re: Localization Facilitation

Thank you for responding to my long email and I'm sorry for my 
delayed response.

I think that my original post went too far towards addressing 
localization issues. I will scale back that suggestion and try to 
explain my rationale.

When an operating system organization, a software developer, or 
perhaps, the common locale data repository project (CLDR) needs to 
display time zone location names (based on zone info files) in 
English or localized to another language that zone.tab and other 
files in tzdata is generally the starting point. Many TZIDs do not 
(nor should they) express the city and state/province for the TZID. I 
know of at least 48 such TZIDs.

I suggest that two new columns be added to zone.tab, location (city) 
and sub-division (state/province). The sub-division would be for the 
city not the entire time zone region. For example, Illinois for 
America/Chicago. For many or most TZIDs the sub-division field may be 
blank. I'm fairly sure that most or all of the information to do this 
is already in the tzdata files.

I believe that the task of assembling English readable tz location 
names that correspond with TZIDs for the purpose of display (English 
or localized) has been done many times. Each time this task is done 
it would be reasonable to expect slightly different results.

Having these two new columns, location and sub-division, would add 
consistency and improve maintenance of tz location names. Also, 
having the columns would prevent the task from continually being 
(Continue reading)

Picon

RE: Proposed strftime.c changes for long years

Based on Paul Eggert's feedback, below find an updated set of changes to
avoid problems with overflow when TM_YEAR_BASE is added to tm_year.
I eliminated the tm_year #define that was designed to catch coding problems;
I moved the asctime_r define to private.h (and it is now
conditionalized); I changed strftime's handling of the 'C' format to use
_conv rather than _lconv; and I added a note to strftime about the need to
figure out what to do when a format asks for the last two digits of a year
(or the century of a year) and the year is negative (or less than 100).

				--ado

------- private.h -------
*** /tmp/geta29972	Thu Sep  9 11:48:56 2004
--- /tmp/getb29972	Thu Sep  9 11:48:56 2004
***************
*** 21,27 ****

  #ifndef lint
  #ifndef NOID
! static char	privatehid[] = " <at> (#)private.h	7.53";
  #endif /* !defined NOID */
  #endif /* !defined lint */

--- 21,27 ----

  #ifndef lint
  #ifndef NOID
! static char	privatehid[] = " <at> (#)private.h	7.54";
  #endif /* !defined NOID */
  #endif /* !defined lint */
(Continue reading)

Paul Eggert | 9 Sep 2004 22:33
Favicon

Re: Localization Facilitation

Chuck Soper <chucks <at> lmi.net> writes:

> I suggest that two new columns be added to zone.tab, location (city)
> and sub-division (state/province). The sub-division would be for the
> city not the entire time zone region. For example, Illinois for
> America/Chicago.

But America/Chicago identifies a fairly large chunk of the United
States, including Iowa, Missouri, most of (but not all of) Kansas,
etc.

The main idea behind the "America/Chicago" and the current
latitude/longitude is to identify a single point in the region, a
point that will continue to be identified if the region splits (an
event that occurs from time to time).  The latitude/longitude is a
quite-inadequate substitute for what is really needed (namely, the
entire region boundary), but it's the best we've got right now.  I
worry that adding a column with data like "Illinois" would be a step
in the wrong direction, and would cause more confusion than it would
cure, since "Illinois" is an attribute of Chicago, and is not a direct
attribute of the America/Chicago TZID.

What we really need are the region boundaries (ideally hooked up to
GPS :-), or some data that will let us derive the region boundaries
from other databases.  The current "comments" column is an informal
attempt in that direction, and I'd rather focus our efforts there.

Paul Eggert | 9 Sep 2004 23:56
Favicon

Re: Proposed strftime.c changes for long years

"Olson, Arthur David (NIH/NCI)" <olsona <at> dc37a.nci.nih.gov> writes:

> + ** XXX To do: figure out correct (as distinct from standard-mandated)
> + ** output for "two digits of year" and "century" formats when
> + ** the year is negative or less than 100. --ado, 2004-09-09

Hmm, what's the problem for years less than 100?  %C is clearly
zero-origin, since it reports 19 for 1999 and 20 for 2000 (which isn't
the same as "19th century" or "20th century").  So even though the
year 50 is 1st-century, its %C should be 00.

Maybe there's some doubt about the proper value for the year 0 (i.e.,
the year 1 B.C. according to the Venerable Bede's system) but I don't
see any doubt about years 1 through 99.

One other point that might be worth mentioning somewhere: this whole
set of patches is motivated by the common case where time_t and long
are 64 bits and int is 32 bits, but it doesn't suffice for some other
cases, e.g., int==long==32 bits and time_t==64 bits (a case that is
not allowed by C89 but is allowed by C99).  Personally I'm inclined to
not worry about these weird cases unless a real portability problem
arises, just as we currently don't worry about the case where
time_t==float (which the standards have always allowed).

Garrett Wollman | 10 Sep 2004 00:12
Picon

Re: Proposed strftime.c changes for long years

<<On Thu, 09 Sep 2004 14:56:44 -0700, Paul Eggert <eggert <at> CS.UCLA.EDU> said:

> are 64 bits and int is 32 bits, but it doesn't suffice for some other
> cases, e.g., int==long==32 bits and time_t==64 bits (a case that is
> not allowed by C89 but is allowed by C99).

We've talked a bit about doing this for FreeBSD/i386 (so that all
architectures would use the same width time_t), but such a transition
would be so difficult that I doubt it would happen.

-GAWollman


Gmane