Perry E. Metzger | 13 Dec 2004 22:01
Gravatar

Re: Global congestion collapse


Sorry for not replying for a long time...

Michael Welzl <michael.welzl <at> uibk.ac.at> writes:
> Does anybody here have stories about the Internet's congestion
> collapse(s) during the 80's? Some details would be great!
[...]
> So, I wonder, what was it like? What are your experiences?
> When did folks first notice it?

I strongly remember a point in '88 or so (perhaps it was 87 -- it
probably wasn't '89) when it became impossible to move data back and
forth between Bellcore and NYNEXs research lab in White Plains over
the net because of congestion related problems. I was working on some
collaboration with them and suddenly found myself forced to make use
of mag tapes as the only practical way to move even fairly small files
back and forth. A mailing list I ran off of one of my machines also
started having trouble moving bits through efficiently.

As I recall, the arrival of kernel patches implementing congestion
control rapidly began to reverse the situation.

The first time I saw such patches was when Phil Karn handed them to me
one day, and I swiftly added them to the kernels of my lab's
Sun-3s. The world was somewhat different back then... :)

Perry

David L. Mills | 14 Dec 2004 02:05
Picon
Favicon

Re: Global congestion collapse

Perry,

Well, if your incident was during 1986-1988 and involved transit of the 
NSFnet Phase-I backbone, I'm the perp. The NSFnet routers ran my code, 
which was horribly overrun by supercomputer traffic. I found the best 
way to deal with the problem was to find the supercomputer elephants and 
shoot them. More is in a 1988 SIGCOMM Symposium paper. More recently the 
USNO and NIST time servers are being overrun with NTP traffic. See my 
recent PTTI paper at www.eecis.udel.edu/+mills/papers.html.

The NSFnet meltdown occured primarily because the fuzzball routers used 
smart interfaces that retransmitted when either an error occured or the 
receiver ran dry of buffers. The entire network locked up for a time 
because all the buffers in all six machines filled up with retransmit 
traffic and nothing could get in or out. As I recall, the ARPAnet also 
had a similar problem with reassembly buffers.

Dave

Perry E. Metzger wrote:

>Sorry for not replying for a long time...
>
>Michael Welzl <michael.welzl <at> uibk.ac.at> writes:
>  
>
>>Does anybody here have stories about the Internet's congestion
>>collapse(s) during the 80's? Some details would be great!
>>    
>>
(Continue reading)

Perry E. Metzger | 14 Dec 2004 03:32
Gravatar

Re: Global congestion collapse


"David L. Mills" <mills <at> udel.edu> writes:
> Well, if your incident was during 1986-1988 and involved transit of
> the NSFnet Phase-I backbone, I'm the perp. The NSFnet routers ran my
> code, which was horribly overrun by supercomputer traffic. I found the
> best way to deal with the problem was to find the supercomputer
> elephants and shoot them. More is in a 1988 SIGCOMM Symposium
> paper. More recently the USNO and NIST time servers are being overrun
> with NTP traffic. See my recent PTTI paper at
> www.eecis.udel.edu/+mills/papers.html.
>
> The NSFnet meltdown occured primarily because the fuzzball routers
> used smart interfaces that retransmitted when either an error occured
> or the receiver ran dry of buffers. The entire network locked up for a
> time because all the buffers in all six machines filled up with
> retransmit traffic and nothing could get in or out. As I recall, the
> ARPAnet also had a similar problem with reassembly buffers.

Interesting. Bellcore switched from a 56k link to the IMP at Columbia
to NSFnet towards the end (latter half?) of that time, but I can't
remember if the horrible congestion was before or after our switch.

Either way, though, it was pretty shortly thereafter that I remember
getting my first replacement .o files with yummy new TCP congestion
control algorithms in them.

Perry

David L. Mills | 14 Dec 2004 05:37
Picon
Favicon

Re: Global congestion collapse

Perry,

Not so fast. Steve Wolff of NSF and I had a nasty little secret we did 
not tell the NSFnet maintenance crew who could never keep a secret. I 
built in priority queueing and preemption in the fuzzball routers. The 
former wiretapped the telnet port and made it just below NTP on the 
priority scale. We put mail on the bottom just below ftp. A lot of 
telnet users stopped complaining because they thought we "fixed" the 
network.

The other thing was to shoot the elephants. When a new packet arrived 
and no buffer space was available, the output queues were scanned 
looking for the biggest elephant (total byte count on all queues from 
the same IP address) and killed its biggest  packet. Gunshots continued 
until either the arriving packet got shot or there was enough room to 
save it. It all worked gangbusters and the poor ftpers never found out.

Dave

Perry E. Metzger wrote:

>"David L. Mills" <mills <at> udel.edu> writes:
>  
>
>>Well, if your incident was during 1986-1988 and involved transit of
>>the NSFnet Phase-I backbone, I'm the perp. The NSFnet routers ran my
>>code, which was horribly overrun by supercomputer traffic. I found the
>>best way to deal with the problem was to find the supercomputer
>>elephants and shoot them. More is in a 1988 SIGCOMM Symposium
>>paper. More recently the USNO and NIST time servers are being overrun
(Continue reading)

Michael Welzl | 14 Dec 2004 07:47
Picon
Favicon

Re: Global congestion collapse

Folks,

Thanks a lot for answering my original question; this
discussion is getting more and more exciting  :)

Cheers,
Michael

On Tue, 2004-12-14 at 05:37, David L. Mills wrote:
> Perry,
> 
> Not so fast. Steve Wolff of NSF and I had a nasty little secret we did 
> not tell the NSFnet maintenance crew who could never keep a secret. I 
> built in priority queueing and preemption in the fuzzball routers. The 
> former wiretapped the telnet port and made it just below NTP on the 
> priority scale. We put mail on the bottom just below ftp. A lot of 
> telnet users stopped complaining because they thought we "fixed" the 
> network.
> 
> The other thing was to shoot the elephants. When a new packet arrived 
> and no buffer space was available, the output queues were scanned 
> looking for the biggest elephant (total byte count on all queues from 
> the same IP address) and killed its biggest  packet. Gunshots continued 
> until either the arriving packet got shot or there was enough room to 
> save it. It all worked gangbusters and the poor ftpers never found out.
> 
> Dave
> 
> Perry E. Metzger wrote:
> 
(Continue reading)

Scott W Brim | 14 Dec 2004 13:31
Picon
Favicon

Re: Global congestion collapse

On Tue, Dec 14, 2004 04:37:30AM +0000, David L. Mills allegedly wrote:
> Perry,
> 
> Not so fast. Steve Wolff of NSF and I had a nasty little secret we did 
> not tell the NSFnet maintenance crew who could never keep a secret. I 
> built in priority queueing and preemption in the fuzzball routers. The 
> former wiretapped the telnet port and made it just below NTP on the 
> priority scale. We put mail on the bottom just below ftp. A lot of 
> telnet users stopped complaining because they thought we "fixed" the 
> network.

The news leaked out pretty quickly iirc :-)

Another thing I noticed was that people adjusted their behavior.  The
congestion spread in time when it couldn't spread any other way, and
filled most of the night.

Craig Partridge | 14 Dec 2004 15:09
Picon

Re: Global congestion collapse


In message <87r7lt7gpp.fsf <at> snark.piermont.com>, "Perry E. Metzger" writes:

>Interesting. Bellcore switched from a 56k link to the IMP at Columbia
>to NSFnet towards the end (latter half?) of that time, but I can't
>remember if the horrible congestion was before or after our switch.

ARPANET had trouble too.  I remember much tuning.

>Either way, though, it was pretty shortly thereafter that I remember
>getting my first replacement .o files with yummy new TCP congestion
>control algorithms in them.

That would have been Van's TCP mods (described in the SIGCOMM '88 paper).
It was astonishing how big a difference they made.

Craig

Perry E. Metzger | 14 Dec 2004 17:18
Gravatar

Re: Global congestion collapse


Craig Partridge <craig <at> aland.bbn.com> writes:
>>Either way, though, it was pretty shortly thereafter that I remember
>>getting my first replacement .o files with yummy new TCP congestion
>>control algorithms in them.
>
> That would have been Van's TCP mods (described in the SIGCOMM '88 paper).

Of course. :)

> It was astonishing how big a difference they made.

Yes, though apparently (according to David Mills in the last few notes
to this list) more was going on than I was aware of at the
time. (That's not surprising -- my research work around then was
debuggers for highly parallel systems, and I was not paying much
attention to the network except as a way of getting my work done...)

Perry

Joe Touch | 15 Dec 2004 16:11
Picon
Favicon

Re: Global congestion collapse


Craig Partridge wrote:
...
>>Either way, though, it was pretty shortly thereafter that I remember
>>getting my first replacement .o files with yummy new TCP congestion
>>control algorithms in them.
> 
> That would have been Van's TCP mods (described in the SIGCOMM '88 paper).
> It was astonishing how big a difference they made.

Not to downplay the utility of Van's variant, but it seems like _any_ 
congestion control would have (or may have - e.g. Dave's mods) made an 
astonishing impact.

Joe
Joe Touch | 15 Dec 2004 16:09
Picon
Favicon

Re: Global congestion collapse


David L. Mills wrote:
> Perry,
> 
> Not so fast. Steve Wolff of NSF and I had a nasty little secret we did 
> not tell the NSFnet maintenance crew who could never keep a secret. I 
> built in priority queueing and preemption in the fuzzball routers. The 
> former wiretapped the telnet port and made it just below NTP on the 
> priority scale. We put mail on the bottom just below ftp. A lot of 
> telnet users stopped complaining because they thought we "fixed" the 
> network.
> 
> The other thing was to shoot the elephants. When a new packet arrived 
> and no buffer space was available, the output queues were scanned 
> looking for the biggest elephant (total byte count on all queues from 
> the same IP address) and killed its biggest  packet. Gunshots continued 
> until either the arriving packet got shot or there was enough room to 
> save it. It all worked gangbusters and the poor ftpers never found out.

RED would benefit from two variants - per packet (when per-packet ops 
are the bottleneck) and per-byte weighting, though it doesn't seem to be 
described that way much. This sounds a lot like per-byte (the more 
common case now anyway), except that RED is statistical (everyone gets 
slammed, proportional to their load) and this hits each in series 
(largest user first, then next-largest when largest backs off, etc.). 
Was there ever any backlash (software oscillation or people complaining) 
from that?

Joe
(Continue reading)


Gmane