Re: recommendations for M3UA/SUA BEATs when running over SCTP
Jeff Morriss <jeff.morriss <at> ulticom.com>
2005-10-03 14:29:40 GMT
Hi Brian,
Brian F. G. Bidulock wrote:
> Jeff,
>
> Jeff Morriss wrote: (Fri, 30 Sep 2005 17:35:49)
>
>>Hi Brian,
>>
>>Brian F. G. Bidulock wrote:
>>
>>>Jeff,
>>>
>>>Jeff Morriss wrote: (Fri, 30 Sep 2005 14:41:35)
>>>
>>>
>>>>Hi list,
>>>>
>>>>Neither the M3UA nor the SUA RFC recommend sending BEATs when used over
>>>>SCTP. ETSI goes a bit further and precludes the use of BEATs.
>>>>
>>>>However, SCTP (in particular, the I-G) allows the association to get
>>>>"stuck" such that it will pass no data: a receiver is allowed to hold
>>>>its window closed "for an indefinite time" (new text for section 6.1 A).
>>>
>>>
>>>No, no, no. SCTP is not permitted to hold its window closed
>>>indefinitely just for the sake of it. Only while its receive buffer is
>>>indeed full and the user is not servicing it. This was always the case
>>>for SCTP. The IG merely adds the zero window probe procedure to keep
>>>the sender from stalling if the data happens to be unidirectional and
>>>SACKs are being lost from the receiver.
>>
>>RFC 2960 doesn't specifically mention that the receiver is allowed to
>>hold it closed indefinitely while the I-G specifically mentions it:
>>
>>~~
>> If the sender continues to receive new packets from the receiver
>> while doing zero window probing, the unacknowledged window probes
>> should not increment the error counter for the association or any
>> destination transport address.The reason is that the receiver MAY
>> keep its window closed for an indefinite time. Refer to
>>~~
>>
>>(more below)
>
>
> Because the application is stuck.
No, the application is fine. But (one of) the SCTP assocs it is using
is stuck (due to a bug or whatever).
>>>Also, this has nothing to do with UA BEATs. The UA BEAT message cannot
>>>arrive at the peer UA if the peer UA is not servicing its receive buffer
>>>and the buffer is full (rwnd = 0).
>>
>>Yes, that's exactly the point. If the BEATs fail that means the assoc
>>isn't carrying traffic.
>>
>>
>>>>I have encountered a number of problems in peer SCTPs which have caused
>>>>those peers to close their windows and keep them closed indefinately.
>
>
> There is no point in sending the messages. SCTP provides a lifetime
> capability at the sender which allows the association to be aborted
> if a message ages in the send buffer beyond an interval. Use that
> instead if you simply want to abort when the receiver sticks.
>
>
>>>
>>>Not a recommended practice. Some implementations might artificially
>>>adjust rwnd. This is not correct, and has been counter-recommended on
>>>TSVWG many times. The only reason for closing rwnd is the actually
>>>filling of the receive buffer. Anything else is an incorrect indication
>>>to the peer and any resulting performance or reliability problem is the
>>>fault of the SCTP implementation artificially closing the window.
>>
>>That's all well and good, but "problems" (read: bugs) do crop up and
>>"the system" should be able to detect the problem, reset, and keep going
>>(or at least try real hard to do so).
>
>
> A similar bug in SS7 would cause problems too.
Hmmm, I'm not so sure...
MTP2 has T7 whose expiry will kill the link for excessive delay of ACK
and T6 which will kill the link due to excessive congestion. SCTP has
neither, though it does, as you say, have an optional (and barely
documented) data lifetime. (I'll have to look at how many SCTPs
implement that...)
MTP3 has (optional) periodic SLTMs in case those two MTP2 safeguards
fail. M3UA has (optional) BEATs but doesn't recommend them; neither
does it recommend using SCTP's lifetime feature.
These differences mean, to me, that M3UA (as specified) has a big hole
in it (through which someone could drive several hundred thousand call
failures).
> For SS7, validation and
> interworking testing is performed to ensure that the protocol stacks are
> not so poorly designed.
No amount of testing and validation will uncover every single bug in the
system. (And I do hope the above was a typo and you do you know the
difference between a bug and a design problem.)
>>>>This leads to very serious traffic loss since M3UA will continue to
>>>>queue mesages to the "stuck" association until the queues fill up and/or
>>>>overflow.
>>>
>>>
>>>The sending M3UA can always monitor its send buffer occupancy to such a
>>>peer and respond accordingly. If you look at the same ETSI spec that
>>>did away with BEATs, it also says that congestion procedures must be
>>>implemented. An M3UA SG queuing messages to such a stuck ASP would,
>>>when following ETSI, have to indicate congestion back to the SS7 network
>>>or other sending ASPs.
>>>
>>>Then queues neither fill up, nor overflow.
>>
>>True, but what if the window being closed is due to a bug in the peer
>>and that peer won't ever recover until the assoc is reset (read: is torn
>>down and gets a fresh start on life)?
>
>
> So abort it. Set a lifetime or a buffer threshold to abort. But your
> still going to loose all buffered messages or risk duplicating them if
> you don't follow the corrid procedures.
I'll look into that. But should the RFCs be updated to have some kind
of recommendation? I'd hate for other people/implementations to run
into the same problem (isn't that part of what recommendations are for?).
>>>>Using an end-to-end health check mechanism (such as M3UA or SUA BEATs)
>>>>solves this problem pretty nicely (similar to the way periodic SLTMs do
>>>>in MTP3).
>>>
>>>
>>>SLTMs do not do this. They are for detecting circuit assignment
>>>problems more than anything else. A link will should never be taken
>>>down from a failure to acknowledge an SLTM due to queuing delay. I have
>>>see cascading network failures from the failure to follow this
>>>principle.
>>
>>As per above, we're not talking about congestion nor queuing delay.
>>We're talking about a "stuck" assoc. Those BEATs/SLTMs will _never_ be
>>responded to; in MTP3 2 such failures will cause the link to be failed
>>(T1.111.7 section 2.2).
>
>
> No. Not while in service, only during activation. It is optional to
> even send SLTM while in service.
Agreed (that's why I specified that I was talking about periodioc
SLTMs). But MTP3 can afford to make that optional since MTP2 has T6 and T7.
> I remember in the early days of SS7 the switch line modules used to
> nicely send an echo SLTMs even though MTP on the front-end was dead.
> Sending heartbeats does not cure anything. Proper design of the
> application does.
Sounds like that MTP was poorly designed; there's no point in having
health checks if they're responded to by the wrong layer.
>>>>However, I suspect many M3UA/SUA implementors may not implement or may
>>>>not turn on (by default) BEATs under the (false, due to the reasons
>>>>listed above) pretense that SCTP heartbeats are sufficient to ensure
>>>>the viability of the association.
>
>
> SCTP heartbeats do ensure the viability of the SCTP association. If an
> application fails, yet does so without closing the association, it is
> simply in error. I don't see the need for the end operating correctly
> to compensate for the end operating in error.
So that there are never, ever, sustained call failures? Or at least not
avoidable ones?
If SIGTRAN hopes to be as reliable as SS7, "the other side is broke,
it's their problem" is the wrong attitude to have.
(And again: the application is fine. But one assoc is stuck.)
> Nevertheless, set a lifetime on each sent message. Abort the
> association if a lifetime expires without being acknowledged by SCTP
> and consult the corrid draft for how to handle messages in transit.
>
>
>>>
>>>If you follow ETSI's congestion requirements you will not have a
>>>problem. (Note that the BEATs will not get through under those
>>>conditions anyway.)
>>
>>Except that the bug has now caused the peer to be permanently congested
>>(unless there's a failure due to excessive congestion--I haven't checked).
>
>
> Which is precisely why congestion should be signalled, so the rest
> of the network can avoid sending to the overloaded component.
But it still misses the point that the application itself is _not_
congested. It could even have a second assoc which is perfectly fine.
Congestion would tell the network to stop sending to the application
(meaning few if any calls would go through); that could actually be
worse than the current problem which, if the application had, say, 2
(loadshared) assocs would "only" lead to 50% call failure.
A timer guarding the congestion would solve all that, so that's probably
the way to go.
>>>>What to do? Should the RFCs recommend (or even require) using BEATs,
>>>>even when run over SCTP? (This is just a recommendation change for the
>>>>RFCs but is a complete reversal for ETSI--but I guess that's not this
>>>>list's problem.)
>>>
>>>
>>>Just implement congestion control. Also if you have an alternative ASP
>>>or SGP (e.g. in a loadsharing arrangement) switch over to that. When
>>>switching traffic, use the proceedures of draft-bidulock-sigtran-corrid
>>>and you will avoid message loss or duplication, even to a non-corrid
>>>aware host (but to use all the corrid procedures you must be able to
>>>send BEATs).
>>
>>I suppose that the peer being marked as congested will certainly raise
>>alarms throughout the network (which would at least make the problem
>>visible to everybody, not just the adjacent node) but in my experience,
>>there are a fair number of bugs which can be automatically (no manual
>>intervention--which may not arrive for several hours) fixed by simply
>>resetting the troubled thing. Unless there is an "excessive congestion"
>>clause, this won't help.
>
>
> So set lifetime and abort the assocation. But that is surely not only
> implementation dependent but is an operational consideration that has
> not place in the protocol specification.
Hmmm, I disagree. Reading the specs would lead one to believe that one
is reasonably safe in letting SCTP manage the viability of the assocs
(and M3UA's ability to send data on them). This could lead implementors
to ignore this potential problem until they run into it.
(But obviously it's not my call; I just thought people here might be
interested...)
Regards,
-Jeff