Rich Felker | 1 Apr 01:10 2007

Re: perl unicode support

On Sat, Mar 31, 2007 at 06:56:05PM -0400, Daniel B. wrote:
> > > > Normally, you should not have to ever convert strings between
> > > > encodings.
> > >
> > > Then how do you process, say, a multi-part MIME body that has parts
> > > in different character encodings?
> > 
> > Excellent example. Email is absolutely something that you can work
> > with on a byte-by-byte basis and have no need for considering
> > characters. 
> 
> What operations are you excluding when you say "work with?"  You're
> being quite non-specific.  Maybe that's part of the cause of our
> arguing.

Indeed, that would be good to clarify.

> Certainly searching for a given character string across multiple
> MIME parts requires handling different encodings for different parts.

Not if it was all converted at load-time.

Rich

Rich Felker | 1 Apr 01:05 2007

Re: perl unicode support

On Sat, Mar 31, 2007 at 06:36:06PM -0400, Daniel B. wrote:
> ???????? wrote:
>  
> 
> > > > The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
> > > > instill faith...
> > >
> > > Maybe you should think more clearly.  I didn't write my mailer, so the
> > > quality of its behavior doesn't reflect my knowledge.
> > 
> > it does reflect your lack of interesting in getting your email utf-8 compatible.
> 
> How the hell do you think you know what it reflects?  (Have you ever
> considered it might have something to do with bookmark management?)

Just because you insist on using an ancient, horribly broken,
proprietary web browser to manage your bookmarks doesn't mean you have
to use it for email too... especially when it breaks email so badly.
In any case it reflects priorities I think, and also indicates that
you're using backwards software, which goes alongside with discussing
the UTF-8 issue as if we were living in 1997 instead of 2007.

All of this is stuff you're entitled to do if you like, and it's not
really my business to tell you what you should be using. But it
does reframe the discussion.

Rich

Daniel B. | 1 Apr 01:44 2007
Picon

Re: perl unicode support

Rich Felker wrote:
> 

> > > Other similar problem: I open a file in a text editor and it contains
> > > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
> >
> > Again, you seem to be dealing with special cases.
> 
> Again, software which does not handle corner cases correctly is crap.

Why are you confusing "special-case" with "corner case"?

I never said that software shouldn't handle corner cases such as illegal
UTF-8 sequences.

I meant that an editor that handles illegal UTF-8 sequences other than
by simply rejecting the edit request is a bit if a special case compared
to general-purpose software, say a XML processor, for which some 
specification requires (or recommends?) that the processor ignore or 
reject any illegal sequences.  The software isn't failing to handle the 
corner case; it is handling it--by explicitly rejecting it.

> > If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
> > would you expect a UTF-8 text editor to work on it?
> 
> I expect my text editor to be able to edit any file without corrupting
> it. 

Okay, then it's not UTF-8-only text editor.  

(Continue reading)

Rich Felker | 1 Apr 07:33 2007

Re: perl unicode support

On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> > Again, software which does not handle corner cases correctly is crap.
> 
> Why are you confusing "special-case" with "corner case"?
> 
> I never said that software shouldn't handle corner cases such as illegal
> UTF-8 sequences.
> 
> I meant that an editor that handles illegal UTF-8 sequences other than
> by simply rejecting the edit request is a bit if a special case compared
> to general-purpose software, say a XML processor, for which some 
> specification requires (or recommends?) that the processor ignore or 
> reject any illegal sequences.  The software isn't failing to handle the 
> corner case; it is handling it--by explicitly rejecting it.

It is a corner case! Imagine a situation like this:

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.

2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.

3. I save the file, and suddenly I’ve irreversibly destroyed huge
amounts of data.

It’s simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
(Continue reading)

Ben Wiley Sittler | 1 Apr 09:00 2007
Picon

Re: perl unicode support

please before embarking on such a path think about what happens when
someone else happens to use an actual character in the PUA which
collides with your escape. better to use something invalid to
represent something invalid. markus kuhn said it best, see e.g. here:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html

and specifically, "option d", "Emit a malformed UTF-16 sequence for
every byte in a malformed UTF-8 sequence", basically each invalid
input 0xnn byte is mapped to the unpaired surrogate 0xDCnn (which are
all in the range 0xDC80 ... 0xDCFF). on output, the reverse is done
(unpaired surrogates from that range are mapped to the corresponding
bytes.)

the particular scheme described there has a name ("utf-8b") and
several implementations, and is widely applicable to situations
involving mixed utf-8 and binary data where the binary needs to be
preserved while also treating the utf-8 parts with Unicode or UCS
semantics.

-ben

On 3/31/07, Rich Felker <dalias <at> aerifal.cx> wrote:
> On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
> > Rich Felker wrote:
> > > Again, software which does not handle corner cases correctly is crap.
> >
> > Why are you confusing "special-case" with "corner case"?
> >
> > I never said that software shouldn't handle corner cases such as illegal
(Continue reading)

Egmont Koblinger | 2 Apr 13:27 2007
Picon

Re: Perl Unicode support

On Fri, Mar 30, 2007 at 02:04:14PM -0400, Rich Felker wrote:

Hi,

> As you’ll see, the only arguments with which you can portably call
> setlocale are NULL, "", "C", "POSIX", and perhaps also a string
> previously returned by setlocale.

You can portably _call_ setlocale() with any argument, as long as you check
its return value and properly handle if it failed to fulfill your request.
The arguments you listed are probably those for which you can always assume
setlocale() to succeed. In the other cases you still might give it a chance
and see whether it succeeds.

> I’m interested only in portable applications, not “GNU/Linux
> applications”.

Our goals differ. Since I'm developing a Linux distro, I'm only interested
in developing GNU/Linux applications. We don't have any resources to check
the portability of our applications, neither want to make our job harder by
working with only a subset of the available functions and re-implement
what's already implemented in glibc. I don't think newer features that get
implemented in glibc are only to make its size bigger. I think they are for
the developers to use them when appropriate. They might not be appropriate
for a portable application, but usually are apropriate for our goals.

> > the documentation of newlocale() and uselocale() and *_l() functions
> 
> These are nonstandard extensions and are a horrible mistake in design
> direction. Having the character encoding even be selectable at runtime
(Continue reading)

Daniel B. | 5 Apr 03:45 2007
Picon

Re: perl unicode support

Rich Felker wrote:
> 
> On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
> > Rich Felker wrote:
> > > Again, software which does not handle corner cases correctly is crap.
> >
> > Why are you confusing "special-case" with "corner case"?
> >
> > I never said that software shouldn't handle corner cases such as illegal
> > UTF-8 sequences.
> >
> > I meant that an editor that handles illegal UTF-8 sequences other than
> > by simply rejecting the edit request is a bit if a special case compared
> > to general-purpose software, say a XML processor, for which some
> > specification requires (or recommends?) that the processor ignore or
> > reject any illegal sequences.  The software isn't failing to handle the
> > corner case; it is handling it--by explicitly rejecting it.
> 
> It is a corner case!

We seem to be having a communication problem, but I don't quite see
what the cause is.

I agree that it is a corner case.  However, (seemingly) clearly, what 
you wrote indicates you think I don't or wouldn't. 

(I was arguing that handling the corner case by doing something other
than simply rejecting the illegal UTF-8 sequences was a bit of a 
special case, just like, say, handling ill-formed XML is not something
a general XML processor (parser) has to do (it rejects it) but _is_ 
(Continue reading)

Daniel B. | 5 Apr 03:50 2007
Picon

Re: perl unicode support

Egmont Koblinger wrote:
>...
> 
> What do you mean by a browser's default encoding? Is it the encoding to be
> assumed for pages lacking charset specification? 

Isn't that defined by the HTTP or HTML specification? 

Daniel
--

-- 
Daniel Barclay
dsb <at> smart.net

Daniel B. | 5 Apr 04:05 2007
Picon

Re: perl unicode support

Rich Felker wrote:

> > Certainly searching for a given character string across multiple
> > MIME parts requires handling different encodings for different parts.
> 
> Not if it was all converted at load-time.

Huh?  (Converting at load time doesn't avoid the need to handle 
different encodings for different parts.)

I think I see part of our communication problem.  

(It seems to me that) you've read more into what I wrote than I actually
wrote, or have thought I'm arguing different points than I have been.

Above, assuming I recall correctly, I was responding to some earlier 
claim about just setting a single platform-level encoding and processing
everything according to that encoding (presenting the MIME multipart case 
as a counterexample--that you have to handle multiple encodings 
(regardless of whether you convert at load time or at search time)).  

Daniel
--

-- 
Daniel Barclay
dsb <at> smart.net

Daniel B. | 5 Apr 04:21 2007
Picon

Re: Perl Unicode support

Fredrik Jervfors wrote:
> 
...
> >
> > X writes a homepage in French, using either latin1 or utf8 encoding (but
> > mentions this encoding properly), and of course he uses all the french
> > letters, including e.g. è (e with grave accent).
> >
> > Y is sitting in Poland for example, using a system configured to use a
> > latin2 locale by default. Latin2 lacks e with grave accent. Y visits the
> > homepage of X with some popular graphical web browser.
> >
> > What should happen?
> >
> > Rich says that his browser must (or should?) think in latin2 and hence
> > drop the è letters, maybe replace them with unaccented e or question
> > marks or similar.
> >
> > I say that his browser mush show è correctly, it doesn't matter what its
> > locale is.
> 
> That depends on the configuration of the browser.
> 
> The browser should by default (programmer's choice really) think in the
> encoding X used, since it's tagged with that encoding information.

In what sense do you mean "think in encoding X"?  (Are you talking about
internal browser operations like displaying text, or external operations
like saving files?)

(Continue reading)


Gmane