Charles Lindsey | 1 Nov 2000 11:37
Picon
Picon

Re: Mail-Copies-To.00

In <Pine.LNX.4.10.10010311102080.2196-100000 <at> spock.peak.org> John Stanley <stanley <at> peak.org> writes:

>I don't see that in the definition of the From: header, and like I said,
>even if it was there, it's defining mail actions in a news standard. I
>thought we couldn't do that, according to what people say here.

Well we do do that in our document, so it had better be all right.

6.1.  Reply-To

   The Reply-To header specifies a reply address(es) to be used for
   personal replies for the author(s) of the article when this is
   different from the author's address(es) given in the From header...

I think that make it pretty clear that both From and Reply-To are provided
for the purpose of facilitating replies by email. There is lots more in
the draft that implies the same.

>It's only when a few people decide they must send mail that any DNS
>lookups are done, and the load is usually less (for an immediately
>rejected invalid host name like "aol.com.nospam"). In this case, I will
>agree there are unnecessary lookups, but that's not because the address is
>munged, it is because people think they have to send email. If they don't
>want to eat the cost of the DNS lookups, then how in God's name can they
>pay for actually sending an email to someone?

Exactly so. Moreover, my understanding is that failed DNS lookups do not
get cached, and so put an extra load on the DNS system. No, this won't
bring the internet to its knees, but it can still fairly be described as
an "interoperability" issue.
(Continue reading)

Henry Spencer | 1 Nov 2000 14:39

Re: Mail-Copies-To.00

On Wed, 1 Nov 2000, Charles Lindsey wrote:
> ...Moreover, my understanding is that failed DNS lookups do not
> get cached, and so put an extra load on the DNS system...

Such "negative caching" was part of the design of DNS from the start,
although it is technically optional and not all implementations do it.

                                                          Henry Spencer
                                                       henry <at> spsystems.net

Bjoern Hoehrmann | 2 Nov 2000 07:08
Picon

Recode articles to UTF-8

Hi,

I want to recode Usenet articles to UTF-8. My problem is how i should treat
the header. RFC 2047 defines a syntax like '=?ISO-8859-1?Q?J=FCnger?=' to have
US-ASCII characters only (and a wrote a Perl script to recode this to UTF-8,
see [1]) but this encoding is not always used, in de.ALL I often see headers
like:

  Subject: Frühstück

held in ISO-8859-1. A worse case would mix those two like

  Subject: Fr=?ISO-8859-1?Q?=FC?=stück

Now I can first encode a header field to mime words and then decode those mime
words to UTF-8. The problem is, that I'd have to assume ISO-8859-1 as charset
which conflicts with [2] where is said, that I have to assume UTF-8.

Something wrong to this point?

Ok, if I want to be compatible to a new news article format standard I have to
consider both cases, UTF-8 and ISO-8859-1 in the article header. Is there a
way to determine whether ISO-8859-1 or UTF-8 is used in a certain string? If
there is no way, wouldn't the change to UTF-8 break compatibility or is it
illegal to use ISO-8859-1 characters in article headers (i thought so).

[1] news:3a0ae2f3.15874526 <at> news.bjoern.hoehrmann.de
[2] http://www.ietf.org/internet-drafts/draft-ietf-usefor-article-03.txt
--
Björn Höhrmann ^ mailto:bjoern <at> hoehrmann.de ^ http://www.bjoernsworld.de
(Continue reading)

Clive D.W. Feather | 2 Nov 2000 15:10

Re: Recode articles to UTF-8

Bjoern Hoehrmann said:
> ISO-8859-1 in the article header. Is there a
> way to determine whether ISO-8859-1 or UTF-8 is used in a certain string?

Only heuristically. There exist strings that are both well-formed
ISO-8859-1 strings and well-formed UTF-8 strings (e.g. 0xC3 0xA2 is either
A-tilde followed by cent sign, or it's UTF-8 for a-circumflex).
But, basically:

* If all octets are 0x00 to 0x7F, it doesn't matter.

* ISO-8859-1 shouldn't contain octets 0x80 to 0x9F, so if you see those
  it's probably UTF-8.

* UTF-8 should only contain octets with the top bit set in one of the
  patterns:
  - one octet 0xC2 to 0xDF, then one octet 0x80 to 0xBF
  - one octet 0xE0 to 0xEF, then two octets 0x80 to 0xBF
  - one octet 0xF0 to 0xF7, then three octets 0x80 to 0xBF
  - one octet 0xF8 to 0xFB, then four octets 0x80 to 0xBF
  - one octet 0xFC to 0xFD, then five octets 0x80 to 0xBF
  so if you see any other case then it's probably ISO-8859-1.

A simple heuristic would be to count the number of times each of the latter
two rules is broken, and choose the encoding that has less. If none are,
you're in trouble.

--

-- 
Clive D.W. Feather  | Work:  <clive <at> demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive <at> davros.org>  | Fax:  +44 20 8371 1037
(Continue reading)

Clive D.W. Feather | 2 Nov 2000 17:22

Re: Mail-Copies-To.00

John Stanley said:
>> And 
>> munging it has visible evil effects on the network (like causing extra 
>> unnecessary DNS lookups). 
> 
> First of all, "lookups" are not visible to anyone not running a
> network monitor when it happens, so "visible" is a specious claim.
[...]

A couple of years ago I got the operator of a root name server to run some
stats for me.

* The network connection to the server was not capable of feeding in
requests fast enough to overload the CPU.

* Approximately 4% of requests were for real TLDs and 0.7% for obviously
  bogus ones like .nospam.

* Something like 95% of the traffic was looking for "NT-SERVER" and
  "WORKGROUP".

--

-- 
Clive D.W. Feather  | Work:  <clive <at> demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive <at> davros.org>  | Fax:  +44 20 8371 1037
Demon Internet      | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc            |                            | Mobile: +44 7973 377646 

John Stanley | 2 Nov 2000 18:36

Re: Mail-Copies-To.00


Charles Lindsey (chl <at> clw.cs.man.ac.uk):

Quoting the Reply-TO header definition...

> I think that make it pretty clear that both From and Reply-To are provided 
> for the purpose of facilitating replies by email. 

Facilitating, hmmm. Keep that thought...

> Moreover, my understanding is that failed DNS lookups do not 
> get cached, and so put an extra load on the DNS system. 

The only way that failure to cache a failed lookup puts any "extra" load
on the "system" is if someone tries to look up the same address multiple
times. "Gee, it bounced when I sent email to foo <at> aol.comnospam, let's try
it again to see if the address is good now. Nope, let's try it again...
Gee, I wonder if the hundredth time will work. Nope, let's try 101..."

Trying to claim that a cached lookup saves significant resources (and thus
a non-cached failure expends them)  is if a significant number of people
are trying to send email to the same address through the same system. This
is rarely, if ever, the case. For it to be significant, it would take a
quantity of email that would fill the recipient's mailbox and start
resulting in bounces anyway.

> No, this won't 
> bring the internet to its knees, but it can still fairly be described as 
> an "interoperability" issue. 

(Continue reading)

Charles Lindsey | 2 Nov 2000 16:37
Picon
Picon

Re: Recode articles to UTF-8

In <004701c04493$9114e1c0$09cbb43e <at> de> "Bjoern Hoehrmann" <derhoermi <at> gmx.net> writes:

>Ok, if I want to be compatible to a new news article format standard I have to
>consider both cases, UTF-8 and ISO-8859-1 in the article header. Is there a
>way to determine whether ISO-8859-1 or UTF-8 is used in a certain string? If
>there is no way, wouldn't the change to UTF-8 break compatibility or is it
>illegal to use ISO-8859-1 characters in article headers (i thought so).

The draft of the new standard says that all headers are to be assumed to
be in UTF-8, so essentially you just go ahead and do it. You MAY use the
RFC 2047 nonsense if you want, but it will be deprecated.

Currently, of course, noone implements the new draft, and present practice
requires strict ASCII (or else the use of RFC 2047). It would, of course,
also be OK to use Fr=?UTF-8?Q?..., though I doubt many systems would
recognise it.

Use of raw ISO-8859-1 is WRONG in either the present or the future standard
(though that has not stopped people doing it, as you say).

If you want UTF-8 in the body, then you have to say so in the charset
parameter of the Content-Type. Again, few systems will currently
understand it. Content-Transfer-Encoding should always be 8bit.

--

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl <at> clw.cs.man.ac.uk  Web:   http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5

(Continue reading)

Terje Bless | 2 Nov 2000 23:27
X-Face
Picon

Re: Mail-Copies-To.00

On 02.11.00 at 09:36, John Stanley <stanley <at> peak.org> wrote:

>The Internet wouldn't even notice it. But it is nice that you've admitted
>that "facilitating" is the definition de-jure for "interoperability".

I think it's a question of "loose" vs. "strict" interpretation of the term
"interoperability". We sometimes choose to use the term loosely and so to
mean "what's convenient for users" or "what causes the least amount of
grief" when properly we should always use it to mean "something that causes
significant harm to the infrastructure or operation of the Internet". The
trouble is that the latter interpretation is a sliding window depending on
each person's view of what constitutes "The Internet".

This is the same trouble we have with defining what's for GNKSA and what's
fair game for us and probably several other places as well. I think we
either need to get better at "agreeing to disagree" or get someone who can
cut through the chase and say "This is how it's gonna be. Deal with it!".
Isn't this latter what we have a Chair for, BTW? :-)

BTW, DNS lookups are a silly argument not because they cause only neglible
overhead for the net, but because spurious DNS lookups aren't the problem
we are trying to solve when it is brought up. "Interoperability" with
humans is and they aren't impacted by the DNS lookups to any significant
degree.

--

-- 
"I don't want to learn to manage my anger; I want to FRANCHISE it!"

                          -- Kevin Martin <brasscannon <at> bigfoot.com>

(Continue reading)

Erland Sommarskog | 3 Nov 2000 10:38
Picon
Picon

Re: Recode articles to UTF-8

From: "Clive D.W. Feather" <clive <at> demon.net> writes:
> * ISO-8859-1 shouldn't contain octets 0x80 to 0x9F, so if you see those
>   it's probably UTF-8.

Alas, it might be CP1252, that is Windows Latin-1 which is 8859-1
with letters added in the forbidden range.

> * UTF-8 should only contain octets with the top bit set in one of the
>   patterns:
>   - one octet 0xC2 to 0xDF, then one octet 0x80 to 0xBF
>   - one octet 0xE0 to 0xEF, then two octets 0x80 to 0xBF
>   - one octet 0xF0 to 0xF7, then three octets 0x80 to 0xBF
>   - one octet 0xF8 to 0xFB, then four octets 0x80 to 0xBF
>   - one octet 0xFC to 0xFD, then five octets 0x80 to 0xBF
>   so if you see any other case then it's probably ISO-8859-1.
>

If I were to do this, I would work from the assumption that if the
string is legal UTF-8, it is  UTF-8.

Then again, overlong UTF-8 sequences are not strictly outlawed as
I understand, although they certainly frowned upon. An overlong
sequence is when you encode a character in a longer way than what
is necessary. For instance you could encode characters in the ASCII
range as multi-octect sequences. You you certainly shouldn't do so
yourself, but if you are to look at texts from outer space, you never
know what you run into.

I include subscription information for the Unicode mailing list,
where the knowledge on UTF-8 might be better than in this forum.
(Continue reading)

Paul Overell | 3 Nov 2000 10:46

Re: Recode articles to UTF-8

In article <G3EM27.3zu <at> clw.cs.man.ac.uk>, Charles Lindsey
<chl <at> clw.cs.man.ac.uk> writes
>In <004701c04493$9114e1c0$09cbb43e <at> de> "Bjoern Hoehrmann" <derhoermi <at> gmx.net> 
>writes:
>
>
>If you want UTF-8 in the body, then you have to say so in the charset
>parameter of the Content-Type. Again, few systems will currently
>understand it.

Well, Outlook Express, Netscape and Turnpike, to name but three, all
understand UTF-8 in the body - hardly "few systems".

Regards
--

-- 
Paul Overell                                             T U R N P I K E


Gmane