Konstantin | 21 May 2012 20:59
Picon
Favicon

detect an email with japanese characters


Hi,

How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?

The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp

Regards,
Konstantin.
Alan Clifford | 21 May 2012 22:32
Favicon

Re: detect an email with japanese characters

Konstantin wrote (at 14:59 (-0400) on Monday, 21st May, 2012):

>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese 
> characters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is 
> apparently invalid, and may not necessarily contain .jp
>

I took a copy of this some time ago.  It might be useful

http://clifford.ac/chinese.html

The mentioned files are in chinese.zip

Alan
Robert Bonomi | 21 May 2012 23:52

Re: detect an email with japanese characters

> From procmail-bounces <at> lists.RWTH-Aachen.de  Mon May 21 14:01:49 2012
> Date: Mon, 21 May 2012 14:59:41 -0400
> From: Konstantin <klk206 <at> panix.com>
> To: procmail <at> lists.RWTH-Aachen.de
> Subject: detect an email with japanese characters
>
>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp
>
> Regards,
> Konstantin.
>
> ____________________________________________________________
> procmail mailing list   Procmail homepage: http://www.procmail.org/
> procmail <at> lists.RWTH-Aachen.de
> http://mailman.rwth-aachen.de/mailman/listinfo/procmail
>
Robert Bonomi | 22 May 2012 01:25

Re: detect an email with japanese characters


 Konstantin <klk206 <at> panix.com> wrote:
>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese chara
> cters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is apparently i
> nvalid, and may not necessarily contain .jp

What I do is:

  a) specify a list of charsets that I understand:

     OK_CHARSET=(ASCII|DISPAY|ISO-8859-[12]|WINDOWS-125[012]|utf-8|utf8)

  b) filter anything that (1) specifies charset, and (2) does -not- have
     one of those charsets:h

     :0 H
     * ^(From|To|Subject): *\=\?\?.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

     :0 H
     * ^Content-Type:.*charset\/.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

(Continue reading)

LuKreme | 22 May 2012 03:31
Favicon

Re: detect an email with japanese characters

On May 21, 2012, at 17:25, Robert Bonomi <bonomi <at> mail.r-bonomi.com> wrote:

> Note: you cannot 'safely' drop 'anything' with such a glyph in it
>     since Microsoft products routinely use use several 3-byte glyphs --
>     things like 'smartquotes', dashes, etc.   (*snarl*)

Oh, it's not just MSFT, there are many high byte characters in UTF-8 that are perfectly usable and proper.
The days of 7-bit email are long behind us, and that's a good thing.
Robert Bonomi | 22 May 2012 04:44

Re: detect an email with japanese characters

> From procmail-bounces <at> lists.RWTH-Aachen.de  Mon May 21 20:34:42 2012
> Subject: Re: detect an email with japanese characters
> From: LuKreme <kremels <at> kreme.com>
> Date: Mon, 21 May 2012 19:31:42 -0600
> To: "procmail <at> lists.RWTH-Aachen.de" <procmail <at> lists.RWTH-Aachen.de>
>
> On May 21, 2012, at 17:25, Robert Bonomi <bonomi <at> mail.r-bonomi.com> wrote:
>
> > Note: you cannot 'safely' drop 'anything' with such a glyph in it
> >     since Microsoft products routinely use use several 3-byte glyphs --
> >     things like 'smartquotes', dashes, etc.   (*snarl*)
>
> Oh, it's not just MSFT, there are many high byte characters in UTF-8 tha
> t are perfectly usable and proper. The days of 7-bit email are long behin
> d us, and that's a good thing.

In 'western' usage, it is exceedingly rare to -need- anything beyond the 
so-called C0 through C3 glyph sets (roughly 256 'printable' symbols).

Microsoft is well known for it's egregious MISUSE of UTF-8 multi-byte 
glyphs.  *Especially* in documents that are identified as using something 
_other_ than UTF-8.  One simply cannot 'trust' MS products to get the 
'content-type' right.  Their products are notorious for, say, _declaring_
a document as 'iso-8859-1' or 'Windows-1251', but including in that 
document a handful of UTF-8 3-byte sequences from the '0xe2', '0xe7', 
and '0xef' ranges.  

For processing arbitrary e-mail from a Microsoft product, one has to
essentially throw away the declared charset, parse out the 'valid'
ASCII/ISO-8859/WINDOWS-125x/UTF-8 glyphs that one can recognize, and
(Continue reading)

Konstantin | 23 May 2012 04:33
Picon
Favicon

Re: detect an email with japanese characters


Great! Thank you all for suggestions.

Konstantin.

On 5/21/2012 4:32 PM, Alan Clifford wrote:
> Konstantin wrote (at 14:59 (-0400) on Monday, 21st May, 2012):
> 
>>
>> Hi,
>>
>> How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?
>>
>> The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp
>>
> 
> I took a copy of this some time ago. It might be useful
> 
> http://clifford.ac/chinese.html
> 
> The mentioned files are in chinese.zip
> 
> 
> Alan
> 
> ____________________________________________________________
> procmail mailing list Procmail homepage: http://www.procmail.org/
> procmail <at> lists.RWTH-Aachen.de
> http://mailman.rwth-aachen.de/mailman/listinfo/procmail
(Continue reading)

Re: detect an email with japanese characters

At 19:33 2012-05-22, Konstantin wrote:

>Great! Thank you all for suggestions.

One late arrival:

Check out "furrin.rc" at:

<http://www.professional.org/procmail/spam.html>

That groups various character sets and checks for hibit characters, 
etc.  There are a number of links to references and pertinent RFCs as well.

I wrote it quite a few years ago (last time that was even altered was 
9 years ago), and it's been quite effective for me.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.
LuKreme | 24 May 2012 00:19
Favicon

Re: detect an email with japanese characters

PSE-L <at> mail.professional.org (Professional Software Engineering) spake on Tuesday 22-May-2012 <at> 21:58:42
> At 19:33 2012-05-22, Konstantin wrote:
> 
>> Great! Thank you all for suggestions.
> 
> One late arrival:
> 
> Check out "furrin.rc" at:
> 
> <http://www.professional.org/procmail/spam.html>
> 
> That groups various character sets and checks for hibit characters, etc.  There are a number of links to
references and pertinent RFCs as well.
> 
> I wrote it quite a few years ago (last time that was even altered was 9 years ago), and it's been quite
effective for me.

furrin.rc does a pretty decent job, but i there are no habit characters in the subject, as there often
aren't, then it fails on utf-8 encoded foreign spam. OTOH, body checks for habit seem a rather high price to pay.

--

-- 
Against stupidity the gods themselves contend in vain.
Jerry K | 24 May 2012 03:46

PARP ::was::::Re: procmail abandonded?


Has anyone looked at/or is using PARP as a procmail replacement?

http://adamspiers.org/computing/parp/

In my brief review, it seems to have several pluses and minuses.

Thanks for any comments,

Jerry

On 10/14/10 12:29 PM, Ed Blackman wrote:
> On Thu, Oct 14, 2010 at 08:52:26AM -0500, cbarnard wrote:
>> It works, so why mess with it?  It does what in needs, no more
>> development is needed...
> 
> If procmail was still processing the mail stream of 10 years ago, I'd
> agree.  But increasingly I'm seeing headers that procmail doesn't handle
> well without external help, especially RFC2047 encoded strings.  A user
> sees "Subject: test message" in their mail reader, and creates a
> "^Subject: test message" recipe, but doesn't understand why it doesn't
> match.  That's because it was sent as =?ISO-8859-1?Q?test=20message?= or
> =?UTF-8?B?dGVzdCBtZXNzYWdlCg==?= or even =?US-ASCII?Q?test=20message?=. 
> Mail readers decode those strings for display, procmail can't without
> assistance.
> 
> Shelling out to perl or whatever for the decode is just kludgy when it's
> a handful of messages per week.  But I'm seeing an uptick in emails that
> have RFC2047-encoded headers when none of the characters actually
> required encoding, suggesting to me that some tools are encoding by
(Continue reading)


Gmane