At 02:51 PM 3/23/2006, Dror Goldenberg wrote:
From: Michael Krause
mailto:krause <at> cup.hp.com
Sent: Thursday, March 23, 2006 10:58 PM
At 04:04 PM 3/21/2006, H.K. Jerry Chu wrote:
Allison Mankin, the transport area AD who brought up some
concern during IESG review of the connected draft regarding
simultaneous retransmissions at different layers, has suggested
and Vivek agreed to the following change to section 7.1
"A Cautionary Note on IPoIB-RC".
The revised section reads like this:
mode of InfiniBand guarantees in-order delivery of
packets. Every message transmitted over the RC connection is
into physical MTU sized packets by the RC connection. If
packet is lost, it is retransmitted until the complete
message is exchanged. Therefore, there is a possibility of an
transport layer experiencing a timeout, while the RC layer
still in the process of transferring the complete message.
will view the timeout as an indicator of congestion and
slow-start thereby affecting throughput drastically
[RFC2581]. Other upper layer protocols might insert
retransmissions into the fabric adding to the already existing
applicability of Infiniband reliability is on a fabric
short latencies (not wide area). Therefore, the RC timer
should be short compared with the starting minimum
values used by the upper end-to-end transports. In
addition, because the RC mode does not have measurement
reliable transmission, its use over fabrics with long
latency or very dynamic latency may be a concern for congestion-
traffic traversing those fabrics.
If you have any comment/issue on the proposed change please post
to the list before COB this Friday (3/24/06). I'd like to move
draft past IESG review so we can wrap up the WG asap.
BTW, please be informed that my email address will change after
Friday. My new address is hkjerry.chu <at> gmail.com.
What about RNR which can go into an infinite timer state? It
should also be noted that IB does not mandate timer ranges that
necessarily correspond to TCP timers. In fact, in the face of real
IB congestion or a port / VL arbitration policy that places such traffic
in a best effort QoS slot, then it is quite possible to see delays that
would be interpreted by any ULP such as TCP as congestion. Is this
really an issue as the applications continue to operate albeit at a
I am not sure what is
the problem in question now.
If people are afraid of
retransmission of TCP while a retransmission of IB RC is taking place,
then: 1) packet drop is relatively rare in IB and 2) usually
retransmission timeouts are fast in IB due to being low latency fabric.
Given the above, I'd say that this event is rare anyway and if and when
IB retransmission happens, there is a good chance to recover it at the IB
level a while before TCP notices it.
Packet drops in IB as measured are relatively rare - these measurements
are largely within a small diameter fabric. IB over a WAN, which
the protocol was never intended to operate, is another matter
The case that you
mentioned, RNR driven retransmission, is another case of retransmission.
It has nothing to do with congestion or packet drop in the fabric. It is
just being impacted by the ability of the receiver to post receive
buffers on the RQ/SRQ. Here the timeouts are application based and as you
wrote can be configured to infinity. I agree that infinit number of
retries would be a bad choice of
RNR retry count. So,
maybe we should recommend on selecting also the RNR timeout and retry
count to be low too. I am also wondering what happens when there is a
slow receiver, e.g. posting too slow on the RQ/SRQ, in which case the RNR
Nak will happen very frequently and might cause the QP to get into error
state because of RNR timeout exahusted.
The question raised was what impact will retransmission have on a ULP
such as TCP which would treat timeouts as congestion events. I
raised two cases which are not easy to control at the IB protocol level
since they are admin managed:
which can lead to very long delays in the connection transmission
Arbitration which can lead to very slow forward progress on a given QoS
Both of these should be comprehended minimally as informative text to
guide developers and management solutions to do the right thing.
They also need to be aware that in the case of RNR, there is no single
right answer and its usage may entail long delays while the OS does
whatever it needs that triggered the RNR operation.
One approach would be to
downgrade such peers to IPoIB-UD.
Given either condition can occur at any time, you end up with split flows
or require in essence a fail-over between the pairs to the UD
paradigm. Neither is optimal to say the least.
Other option would be to
indirectly detect those cases and provide hints to the TCP stack to slow
Need to treat all layer 2 technologies the same.
By indirectly detecting it, I
am thinking along the lines of when the send queue becomes full, we can
start dropping packets at the transmitter (RED algorithm or something
similar). This packet drop can cause TCP to back off and activate the
congestion flow and reduce the transmission rate. This has the right
impact in my opinion.
This is done for most layer 2 technologies. However, it does not
solve the problem for RNR which can be very long in its delay.
And, the last thing that
you mentioned has to do with TCP retransmission triggering when the IB is
congested. IB packet delivery slows down because of congestion in the IB
fabric (slowing down can be because of real congestion which
backpressures the transmitters or because of IB congestion management).
Eventhough there is no packet drop at the IB level and no retransmission
at the IB level, still TCP times out. Luckily, TCP is doing RTT
evaluation, so hopefully this will be rare too. If it happens, then the
desired effect would be to slow down the TCP requester (at the TCP level)
which is OK. The side effect will be retransmission. And the big question
is whether this is so bad to have this retransmission. BTW, I believe
that also in this case, dropping packets at the transmitter will ease the
IB can suffer congestion spreading even with its congestion management
notification protocol enabled. Worse-case spreading can
result in the fail-safe timers being triggered which is measured in
seconds and which can yield packet loss. I agree that under most
loads, there is a small probability of this occurring. However, it
is a case where IB drops packets and is a case that can take a
significant amount of time to realize what is going on and trigger
recovery. There should be informative text perhaps in the spec to
make sure developers understand exactly how IB operates when it comes to
its own congestion or fail-safe modes.