Re: detect an email with japanese characters
Robert Bonomi <bonomi <at> mail.r-bonomi.com>
2012-05-22 02:44:45 GMT
> From procmail-bounces <at> lists.RWTH-Aachen.de Mon May 21 20:34:42 2012
> Subject: Re: detect an email with japanese characters
> From: LuKreme <kremels <at> kreme.com>
> Date: Mon, 21 May 2012 19:31:42 -0600
> To: "procmail <at> lists.RWTH-Aachen.de" <procmail <at> lists.RWTH-Aachen.de>
>
> On May 21, 2012, at 17:25, Robert Bonomi <bonomi <at> mail.r-bonomi.com> wrote:
>
> > Note: you cannot 'safely' drop 'anything' with such a glyph in it
> > since Microsoft products routinely use use several 3-byte glyphs --
> > things like 'smartquotes', dashes, etc. (*snarl*)
>
> Oh, it's not just MSFT, there are many high byte characters in UTF-8 tha
> t are perfectly usable and proper. The days of 7-bit email are long behin
> d us, and that's a good thing.
In 'western' usage, it is exceedingly rare to -need- anything beyond the
so-called C0 through C3 glyph sets (roughly 256 'printable' symbols).
Microsoft is well known for it's egregious MISUSE of UTF-8 multi-byte
glyphs. *Especially* in documents that are identified as using something
_other_ than UTF-8. One simply cannot 'trust' MS products to get the
'content-type' right. Their products are notorious for, say, _declaring_
a document as 'iso-8859-1' or 'Windows-1251', but including in that
document a handful of UTF-8 3-byte sequences from the '0xe2', '0xe7',
and '0xef' ranges.
For processing arbitrary e-mail from a Microsoft product, one has to
essentially throw away the declared charset, parse out the 'valid'
ASCII/ISO-8859/WINDOWS-125x/UTF-8 glyphs that one can recognize, and
(Continue reading)