Robert Raszuk | 4 Jul 2010 23:58
Picon
Favicon

Fwd: New Version Notification for draft-raszuk-wide-bgp-communities-00

FYI.

We plan to discuss this in next IDR WG meeting in Maastricht, but all 
comments are welcome to the list before the meeting.

Many thx,
R.

A new version of I-D, draft-raszuk-wide-bgp-communities-00.txt has been 
successfully submitted by Robert Raszuk and posted to the IETF repository.

URL:  http://tools.ietf.org/html/draft-raszuk-wide-bgp-communities-00

Filename:	 draft-raszuk-wide-bgp-communities
Revision:	 00
Title:		 Wide BGP Communities
Creation_date:	 2010-07-04
WG ID:		 Independent Submission
Number_of_pages: 25

Abstract:
Communicating various routing policies via route tagging plays an
important role in external BGP peering relations.  The most common
tool used today to attach various information about routes is
realized with the use of BGP communities.  Such information is
important for the peering AS to perform some mutually agreed actions
without the need to maintain a separate offline database for each
pair of prefix and an associated with it requested set of action
entries.

(Continue reading)

Xu Xiaohu | 5 Jul 2010 03:17
Favicon

Re: New Version Notification for draft-xu-idr-best-external-loop-avoidance-00

Hi all,

A new draft about how to avoid transient forwarding loop when using BGP best
external approach to achieve fast convergence is available at:
http://tools.ietf.org/html/draft-xu-idr-best-external-loop-avoidance-00

Any comment is welcome.

Best wishes,
Xiaohu

-----邮件原件-----
发件人: IETF I-D Submission Tool [mailto:idsubmission <at> ietf.org] 
发送时间: 2010年7月2日 17:00
收件人: xuxh <at> huawei.com
抄送: guoerwei <at> huawei.com
主题: New Version Notification for
draft-xu-idr-best-external-loop-avoidance-00

A new version of I-D, draft-xu-idr-best-external-loop-avoidance-00.txt has
been successfully submitted by Xiaohu Xu and posted to the IETF repository.

Filename:	 draft-xu-idr-best-external-loop-avoidance
Revision:	 00
Title:		 Avoiding Transient Loops when using BGP Best External
Creation_date:	 2010-07-02
WG ID:		 Independent Submission
Number_of_pages: 6

Abstract:
(Continue reading)

Xu Xiaohu | 5 Jul 2010 11:31
Favicon

fwd: New Version Notification for draft-xu-virtual-subnet-01

Hi all,

 

This draft (http://tools.ietf.org/id/draft-xu-virtual-subnet-01.txt) is about a new data center network architecture which largely takes advantage of the MPLS/BGP VPN [RFC4364] technology to exchange connected host VPN routes among PEs so as to construct a scalable large IP subnet across the MPLS/IP backbone network of a date center. Note that these connected host VPN routes are generated automatically by PEs according to the associated local ARP tables.

 

I wonder whether this is a discussion topic suitable for the RTG WG. By the way, I copy this email to the L3VPN mailing-list since this architecture largely uses the MPLS/BGP VPN technology, and any comments from L3VPNers are also welcome.

 

Best wishes,

Xiaohu

 

> -----邮件原件-----

> 发件人: IETF I-D Submission Tool [mailto:idsubmission <at> ietf.org]

> 发送时间: 2010年7月5日 13:28

> 收件人: xuxh <at> huawei.com

> 主题: New Version Notification for draft-xu-virtual-subnet-01

>

>

> A new version of I-D, draft-xu-virtual-subnet-01.txt has been

> successfully submitted by Xiaohu Xu and posted to the IETF repository.

>

> Filename:   draft-xu-virtual-subnet

> Revision:   01

> Title:      Virtual Subnet: A Scalable Data Center Network Architecture

> Creation_date:   2010-07-05

> WG ID:      Independent Submission

> Number_of_pages: 13

>

> Abstract:

> This document proposes a scalable data center network architecture

> which, as an alternative to the Spanning Tree Protocol Bridge network,

> uses a Layer 3 routing infrastructure to provide scalable virtual

> Layer 2 network connectivity services.

>

>

>

> The IETF Secretariat.

 

 

 

_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr
Shakir, Rob | 6 Jul 2010 09:20

Interaction of TCP Window Size and BGP Keepalive Behaviour

Hi IDR,

We currently have a query around the correct parsing of RFC 4271, when
combining an implementation with mechanisms to avoid control-plane
congestion via means of resetting the TCP session's window size to zero.

The scenario that we are examining is one in which a BGP speaker
experiences control-plane congestion, in order to begin to reduce the
load, the device chooses to indicate to all peers that it does not wish
to receive further packets to process, until the period of congestion is
alleviated. To do this, it sets the TCP receive window (i.e. the peer's
send window) to zero. The relevant sentence in RFC 793 here is the
following:

  "... when the receive window is zero no segments should be
  acceptable except ACK segments."

In this period, there is therefore a requirement for the remote BGP
speaker to send no data, including throttling the KEEPALIVEs that are
being sent on a session. In this case, it would be necessary for the
device experiencing congestion to cease to track the hold timer, and
leave the session established, despite not receiving any BGP packet
within the hold time, since this is what the underlying TCP connection
has signalled for the remote peer to do.

In an example topology:

  rtr-A                   rtr-B
  (congested c-p)         (uncongested c-p)
  send window: >0         send window: 0
  recv window: 0          recv window: >0	

In this case we expect:
 a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
to rtr-A in normal operating circumstances.
 b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
session remains established even if no packet is received in the
holdtime.
 c) rtr-A continues to send KEEPALIVE packets to rtr-B.

This behaviour both allows the control plane congestion to be
alleviated, whilst still ensuring some form of liveness detection to
occur. Should an error in the underlying connection between the two
peers occur, we would expect that rtr-B is able to tear down the session
due to the fact that it does not receive a KEEPALIVE from rtr-A within
the hold time.

I believe that the above reasoning is merely a repetition of what
Section 6.5 of RFC 4271 states:

  "If a system does not receive successive KEEPALIVE, UPDATE, and/or
  NOTIFICATION messages within the period specified in the Hold Time
  field of the OPEN message, then the NOTIFICATION message with the
  Hold Timer Expired Error Code is sent and the BGP connection is
  closed."

With this in mind, we appear to be observing some different behaviour,
that does not appear to comply with this standard:

1) rtr-A signals control plane congestion through means of setting the
TCP recv window to 0, but continues to send KEEPALIVEs to rtr-B
2) rtr-B does not send any KEEPALIVE/UPDATE to rtr-A and indicates this
fact through CLI output.
3) After t=HOLDTIME, rtr-B sends NOTIFICATION to rtr-A indicating that
the hold time has expired.

It would appear that the reason that rtr-B sends this NOTIFICATION is
the fact that it has not been able to send a KEEPALIVE within the hold
time. I do not believe that this is valid behaviour, since there is no
mention of such a requirement within RFC 4271.

I'd very much appreciate comments on whether this behaviour should be
expected, and how those implementing a ground-up BGP-4 implementation
should treat this scenario?

In addition, if there are comments around the behaviour of setting the
recv window size to 0 during congestion, these would be greatly
appreciated.

Many thanks in advance for any comments.

Kind regards,
Rob

-- 
Rob Shakir                        <rob.shakir <at> cw.com>	   
IP&D Network Designer             Cable&Wireless Worldwide

This e-mail has been scanned for viruses by the Cable & Wireless Worldwide e-mail security system - powered
by MessageLabs. For more information on a proactive managed e-mail security service, visit http://www.cw.com/managed-exchange

The information contained in this e-mail is confidential and may also be subject to legal privilege. It is
intended only for the recipient(s) named above. If you are not named above as a recipient, you must not
read, copy, disclose, forward or otherwise use the information contained in this email. If you have
received this e-mail in error, please notify the sender (whose contact details are above) immediately by
reply e-mail and delete the message and any attachments without retaining any copies.

Cable and Wireless Worldwide plc 
Registered in England and Wales. Company Number 07029206
Registered office: Liberty House, 76 Hammersmith Road, London W14 8UD, England
_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Mitchell Erblich | 6 Jul 2010 10:21
Picon
Favicon

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour


On Jul 6, 2010, at 12:20 AM, Shakir, Rob wrote:

> Hi IDR,
> 
> We currently have a query around the correct parsing of RFC 4271, when
> combining an implementation with mechanisms to avoid control-plane
> congestion via means of resetting the TCP session's window size to zero.
> 
> 
> The scenario that we are examining is one in which a BGP speaker
> experiences control-plane congestion, in order to begin to reduce the
> load, the device chooses to indicate to all peers that it does not wish
> to receive further packets to process, until the period of congestion is
> alleviated. To do this, it sets the TCP receive window (i.e. the peer's
> send window) to zero. The relevant sentence in RFC 793 here is the
> following:
> 
>  "... when the receive window is zero no segments should be
>  acceptable except ACK segments."

Within TCP. Why not just just incrementally add latency to TCP ACKs per flow
thus reducing the flow?

Mitchell Erblich

> 
> In this period, there is therefore a requirement for the remote BGP
> speaker to send no data, including throttling the KEEPALIVEs that are
> being sent on a session. In this case, it would be necessary for the
> device experiencing congestion to cease to track the hold timer, and
> leave the session established, despite not receiving any BGP packet
> within the hold time, since this is what the underlying TCP connection
> has signalled for the remote peer to do.
> 
> In an example topology:
> 
>  rtr-A                   rtr-B
>  (congested c-p)         (uncongested c-p)
>  send window: >0         send window: 0
>  recv window: 0          recv window: >0	
> 
> In this case we expect:
> a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
> to rtr-A in normal operating circumstances.
> b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
> session remains established even if no packet is received in the
> holdtime.
> c) rtr-A continues to send KEEPALIVE packets to rtr-B.
> 
> This behaviour both allows the control plane congestion to be
> alleviated, whilst still ensuring some form of liveness detection to
> occur. Should an error in the underlying connection between the two
> peers occur, we would expect that rtr-B is able to tear down the session
> due to the fact that it does not receive a KEEPALIVE from rtr-A within
> the hold time.
> 
> I believe that the above reasoning is merely a repetition of what
> Section 6.5 of RFC 4271 states:
> 
>  "If a system does not receive successive KEEPALIVE, UPDATE, and/or
>  NOTIFICATION messages within the period specified in the Hold Time
>  field of the OPEN message, then the NOTIFICATION message with the
>  Hold Timer Expired Error Code is sent and the BGP connection is
>  closed."
> 
> With this in mind, we appear to be observing some different behaviour,
> that does not appear to comply with this standard:
> 
> 1) rtr-A signals control plane congestion through means of setting the
> TCP recv window to 0, but continues to send KEEPALIVEs to rtr-B
> 2) rtr-B does not send any KEEPALIVE/UPDATE to rtr-A and indicates this
> fact through CLI output.
> 3) After t=HOLDTIME, rtr-B sends NOTIFICATION to rtr-A indicating that
> the hold time has expired.
> 
> It would appear that the reason that rtr-B sends this NOTIFICATION is
> the fact that it has not been able to send a KEEPALIVE within the hold
> time. I do not believe that this is valid behaviour, since there is no
> mention of such a requirement within RFC 4271.
> 
> I'd very much appreciate comments on whether this behaviour should be
> expected, and how those implementing a ground-up BGP-4 implementation
> should treat this scenario?
> 
> In addition, if there are comments around the behaviour of setting the
> recv window size to 0 during congestion, these would be greatly
> appreciated.
> 
> Many thanks in advance for any comments.
> 
> Kind regards,
> Rob
> 
> -- 
> Rob Shakir                        <rob.shakir <at> cw.com>	   
> IP&D Network Designer             Cable&Wireless Worldwide
> 
> This e-mail has been scanned for viruses by the Cable & Wireless Worldwide e-mail security system -
powered by MessageLabs. For more information on a proactive managed e-mail security service, visit http://www.cw.com/managed-exchange
> 
> 
> 
> The information contained in this e-mail is confidential and may also be subject to legal privilege. It is
intended only for the recipient(s) named above. If you are not named above as a recipient, you must not
read, copy, disclose, forward or otherwise use the information contained in this email. If you have
received this e-mail in error, please notify the sender (whose contact details are above) immediately by
reply e-mail and delete the message and any attachments without retaining any copies.
> 
> Cable and Wireless Worldwide plc 
> Registered in England and Wales. Company Number 07029206
> Registered office: Liberty House, 76 Hammersmith Road, London W14 8UD, England
> _______________________________________________
> Idr mailing list
> Idr <at> ietf.org
> https://www.ietf.org/mailman/listinfo/idr

_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Robert Raszuk | 6 Jul 2010 10:37
Picon
Favicon

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour

Hi Rob,

IMHO I think you may have just found a bug in RFC 4271 section 6.5. It 
seems what the text should be saying is this:

"6.5.  Hold Timer Expired Error Handling

    Unless BGP speaker has indicated window size of zero towards it's
    TCP peer if a system does not receive successive KEEPALIVE, UPDATE,
    and/or NOTIFICATION messages within the period specified in the Hold
    Time field of the OPEN message, then the NOTIFICATION message with
    the Hold Timer Expired Error Code is sent and the BGP connection is
    closed."

Now the natural question about some other approaches to the problem 
would be to analyze what are any other cases where TCP session will 
correctly stay up and BGP can not communicate. If window is the only 
such trigger then we are OK with the above amendment (modulo point 
below). If there can be other such triggers I think we should analyze it 
one by one.

In general it is normally taken for granted that BGP has it's own 
keepalive machinery, but this covers both the path liveness (TCP could 
do that too), but also BGP processes liveness itself ?

So while sending window of 0 towards the peer sender blocks himself from 
knowing the BGP process state of the peer which may result in potential 
traffic disruption (TCP/kernel being up and BGP process dead).

To address this case the alternative approach would be to either like 
Mitchell proposed slow down TCP while still keeping bidirectional 
communication or if we really can not accept anything from the peer drop 
the session with NOTIFICATION message indicating the cause and 
indicating the willingness to restart in specified period of time.

Cheers,
R.

> Hi IDR,
>
> We currently have a query around the correct parsing of RFC 4271, when
> combining an implementation with mechanisms to avoid control-plane
> congestion via means of resetting the TCP session's window size to zero.
>
>
> The scenario that we are examining is one in which a BGP speaker
> experiences control-plane congestion, in order to begin to reduce the
> load, the device chooses to indicate to all peers that it does not wish
> to receive further packets to process, until the period of congestion is
> alleviated. To do this, it sets the TCP receive window (i.e. the peer's
> send window) to zero. The relevant sentence in RFC 793 here is the
> following:
>
>    "... when the receive window is zero no segments should be
>    acceptable except ACK segments."
>
> In this period, there is therefore a requirement for the remote BGP
> speaker to send no data, including throttling the KEEPALIVEs that are
> being sent on a session. In this case, it would be necessary for the
> device experiencing congestion to cease to track the hold timer, and
> leave the session established, despite not receiving any BGP packet
> within the hold time, since this is what the underlying TCP connection
> has signalled for the remote peer to do.
>
> In an example topology:
>
>    rtr-A                   rtr-B
>    (congested c-p)         (uncongested c-p)
>    send window:>0         send window: 0
>    recv window: 0          recv window:>0	
>
> In this case we expect:
>   a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
> to rtr-A in normal operating circumstances.
>   b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
> session remains established even if no packet is received in the
> holdtime.
>   c) rtr-A continues to send KEEPALIVE packets to rtr-B.
>
> This behaviour both allows the control plane congestion to be
> alleviated, whilst still ensuring some form of liveness detection to
> occur. Should an error in the underlying connection between the two
> peers occur, we would expect that rtr-B is able to tear down the session
> due to the fact that it does not receive a KEEPALIVE from rtr-A within
> the hold time.
>
> I believe that the above reasoning is merely a repetition of what
> Section 6.5 of RFC 4271 states:
>
>    "If a system does not receive successive KEEPALIVE, UPDATE, and/or
>    NOTIFICATION messages within the period specified in the Hold Time
>    field of the OPEN message, then the NOTIFICATION message with the
>    Hold Timer Expired Error Code is sent and the BGP connection is
>    closed."
>
> With this in mind, we appear to be observing some different behaviour,
> that does not appear to comply with this standard:
>
> 1) rtr-A signals control plane congestion through means of setting the
> TCP recv window to 0, but continues to send KEEPALIVEs to rtr-B
> 2) rtr-B does not send any KEEPALIVE/UPDATE to rtr-A and indicates this
> fact through CLI output.
> 3) After t=HOLDTIME, rtr-B sends NOTIFICATION to rtr-A indicating that
> the hold time has expired.
>
> It would appear that the reason that rtr-B sends this NOTIFICATION is
> the fact that it has not been able to send a KEEPALIVE within the hold
> time. I do not believe that this is valid behaviour, since there is no
> mention of such a requirement within RFC 4271.
>
> I'd very much appreciate comments on whether this behaviour should be
> expected, and how those implementing a ground-up BGP-4 implementation
> should treat this scenario?
>
> In addition, if there are comments around the behaviour of setting the
> recv window size to 0 during congestion, these would be greatly
> appreciated.
>
> Many thanks in advance for any comments.
>
> Kind regards,
> Rob
>

_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Paul Jakma | 6 Jul 2010 10:38

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour

On Tue, 6 Jul 2010, Shakir, Rob wrote:

> In an example topology:
>
>  rtr-A                   rtr-B
>  (congested c-p)         (uncongested c-p)
>  send window: >0         send window: 0
>  recv window: 0          recv window: >0
>
> In this case we expect:
> a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
> to rtr-A in normal operating circumstances.

Why do you think this? Or at least, from what level are you 
considering this?

The BGP implementation on B should continue to generate BGP messages. 
The TCP implementation on B should not send them yet though, should 
buffer.

> b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
> session remains established even if no packet is received in the
> holdtime.

As above, this would be surprising to me. See below.

> c) rtr-A continues to send KEEPALIVE packets to rtr-B.

> This behaviour both allows the control plane congestion to be
> alleviated, whilst still ensuring some form of liveness detection to
> occur. Should an error in the underlying connection between the two
> peers occur, we would expect that rtr-B is able to tear down the session
> due to the fact that it does not receive a KEEPALIVE from rtr-A within
> the hold time.
>
> I believe that the above reasoning is merely a repetition of what
> Section 6.5 of RFC 4271 states:
>
>  "If a system does not receive successive KEEPALIVE, UPDATE, and/or
>  NOTIFICATION messages within the period specified in the Hold Time
>  field of the OPEN message, then the NOTIFICATION message with the
>  Hold Timer Expired Error Code is sent and the BGP connection is
>  closed."

This justifies B tearing down a session if it does not receive a BGP 
KEEPALIVE. I'm not sure how it justifies that the BGP protocol must 
be able to have special insight and control into what TCP does (i.e. 
BGP being aware that TCP is throttling) - which is what would be 
required for BGP on B to stop generating messages to A, and/or for 
BGP on A to not expect BGP messages from B.

Given that TCP has its own keepalive mechanism (and there are 
standard APIs for enabling it), and given that the BGP designers 
chose to also have further keepalives at the BGP layer, it seems 
clear to me that the BGP designers intended for BGP KEEPALIVE to 
measure liveness from one BGP stack layer to other - /not/ just the 
TCP layer. I.e. it seems clear there is an intentional layering, and 
that BGP does not intend to be defer its liveness tests to TCP.

> With this in mind, we appear to be observing some different behaviour,
> that does not appear to comply with this standard:
>
> 1) rtr-A signals control plane congestion through means of setting the
> TCP recv window to 0, but continues to send KEEPALIVEs to rtr-B
> 2) rtr-B does not send any KEEPALIVE/UPDATE to rtr-A and indicates this
> fact through CLI output.
> 3) After t=HOLDTIME, rtr-B sends NOTIFICATION to rtr-A indicating that
> the hold time has expired.
>
> It would appear that the reason that rtr-B sends this NOTIFICATION 
> is the fact that it has not been able to send a KEEPALIVE within 
> the hold time. I do not believe that this is valid behaviour, since 
> there is no mention of such a requirement within RFC 4271.

No such requirement needs be explicitly stated in the BGP RFC. It 
would be somewhat unconventional to have a protocol above TCP specify 
itself to depend very intimately on the current internal state of TCP 
- state which I'm not sure TCP implementations even make available to 
their users (how would you do it, using typical sockets APIs, out of 
curiosity?). I.e it would be a layering violation.

> I'd very much appreciate comments on whether this behaviour should 
> be expected, and how those implementing a ground-up BGP-4 
> implementation should treat this scenario?

It seems expected and normal to me. :)

regards,
--

-- 
Paul Jakma	paul <at> jakma.org	Key ID: 64A2FF6A
Fortune:
Integrity has no need for rules.
_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Shakir, Rob | 6 Jul 2010 13:05

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour

Hi All,

Thanks for the responses thus far.

> -----Original Message-----
> From: Paul Jakma [mailto:paul <at> jakma.org]
> Sent: 06 July 2010 09:39
> To: Shakir, Rob
> Cc: idr <at> ietf.org
> Subject: Re: [Idr] Interaction of TCP Window Size and BGP
> Keepalive Behaviour
> 
>> In this case we expect: a) rtr-B does not send any BGP packet
>> (KEEPALIVE/UPDATE/NOTIFICATION) to rtr-A in normal operating 
>> circumstances.
> 
> Why do you think this? Or at least, from what level are you 
> considering this?
> 
> The BGP implementation on B should continue to generate BGP messages. 
> The TCP implementation on B should not send them yet though, should 
> buffer.

This would appear to me to be an internal process of rtr-B. In this
circumstance I find it useful to consider what rtr-A sees. If rtr-B is
queueing packets, rtr-A should not see them, providing the TCP stack
acts as it should be doing. If the window remains zero for a period
greater than the hold time, then rtr-B will not have any opportunity to
send the packet(s) before the session should be torn down unless rtr-A
ignores the hold timer. If rtr-B sends packets then, as long as rtr-A
complies with RFC 793, they should not ever be received by the BGP
process, since only ACK segments should be valid in this circumstance. 

The interesting point around the above observation is that, in my
opinion, this appears quite harmful behaviour. In the situation where
there is some transport disruption between rtr-A and rtr-B, if the
packets are generated by the BGP process, but queued by the underlying
TCP implementation, then what is the behaviour when it comes to sending
a NOTIFICATION to indicate that the hold time has expired as rtr-A is
not sending rtr-B KEEPALIVEs? I would expect in this scenario, that I do
not see a session teardown after t=HOLDTIME, but rather only when rtr-A
signals a non-zero window, in which case, we observe:

1) A flood of KEEPALIVEs (the required number for the length of time for
which we saw congestion)
2) A NOTIFICATION after this, since there was transport disruption.

If this were the case, then there is a relatively unknown period of time
(assuming that TCP does not deal with this - which is perhaps
unrealistic), for which the prefixes that we are holding in the RIB due
to this session are of unknown validity.

>> b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The

>> session remains established even if no packet is received in the 
>> holdtime.
> 
> As above, this would be surprising to me. See below.

I believe this is required behaviour, if one is to utilise a TCP window
size of zero to indicate congestion on a BGP session. Another approach
(where some process has a requirement to change the window size to zero)
would result in the transport requiring one type of behaviour from the
peer, yet BGP requiring an incompatible behaviour. This would of course
assume some form of IPC between the TCP stack and the BGP daemon to
avoid this circumstance.

 
> This justifies B tearing down a session if it does not receive a BGP 
> KEEPALIVE. I'm not sure how it justifies that the BGP protocol must be

> able to have special insight and control into what TCP does (i.e. BGP 
> being aware that TCP is throttling) - which is what would be required 
> for BGP on B to stop generating messages to A, and/or for BGP on A to 
> not expect BGP messages from B.
> 
> Given that TCP has its own keepalive mechanism (and there are standard

> APIs for enabling it), and given that the BGP designers chose to also 
> have further keepalives at the BGP layer, it seems clear to me that 
> the BGP designers intended for BGP KEEPALIVE to measure liveness from 
> one BGP stack layer to other - /not/ just the TCP layer. I.e. it seems

> clear there is an intentional layering, and that BGP does not intend 
> to be defer its liveness tests to TCP.

This would suggest that then BGP trying to utilise TCP to control
congestion would be some behaviour that goes against this idea. If the
object of KEEPALIVE is to check that the BGP daemon on the other side is
alive, then should it not be the case that we do not use TCP to control
a flow, after all, in a period that we are doing this, then we have
almost required the remote peer not to send data that would indicate
that it's BGP daemon is alive?

I also feel that perhaps I did not articulate myself properly here - one
of the behaviours that I am interested in the validity of is the
inability to send a KEEPALIVE within the hold time resulting in the
session being torn down. I think that the analysis you've presented here
implies that the BGP daemon should not do this, as it has no requirement
to know the state of the TCP session via which it is being transported,
and hence to the knowledge of the BGP daemon, it has been able to
generate KEEPALIVEs successfully. The fact that they have not been
received by the peer is not known to the BGP daemon.

> No such requirement needs be explicitly stated in the BGP RFC. It 
> would be somewhat unconventional to have a protocol above TCP specify 
> itself to depend very intimately on the current internal state of TCP
> - state which I'm not sure TCP implementations even make available to 
> their users (how would you do it, using typical sockets APIs, out of 
> curiosity?). I.e it would be a layering violation.
> 
>> I'd very much appreciate comments on whether this behaviour should be

>> expected, and how those implementing a ground-up BGP-4 implementation

>> should treat this scenario?
> 
> It seems expected and normal to me. :)

Hmm, I am not sure that I understand you completely here. If rtr-A tore
down the session after t=HOLDTIME, I'd consider this "expected and
normal" (no keepalives received in hold time, therefore NOTIFICATION is
required), but, in this case, I appear to see that rtr-B sends the
NOTIFICATION because it could not send a KEEPALIVE.

Whilst I see your implementation points - my issue here is that there
seems to be an inherent problem in throttling packets via means of TCP,
when there is a timer tracking packets that may be throttled by this. In
order to avoid this, I would suggest that there is some language in RFC
4271 that explicitly prohibits the use of underlying layers for
controlling the flow of BGP packets, and requires an implementions that
desire such message-pacing behaviour to implement internal queueing of
relevant packets.

Apologies if I've missed something obvious in the above. I'm just
unclear as to the justification for what we believe we observe, rather
than either the session remaining established, or rtr-A causing the
session to be torn down.

Many thanks in advance.

Kind regards,
Rob

-- 
Rob Shakir                        <rob.shakir <at> cw.com>	   
IP&D Network Designer             Cable&Wireless Worldwide

This e-mail has been scanned for viruses by the Cable & Wireless Worldwide e-mail security system - powered
by MessageLabs. For more information on a proactive managed e-mail security service, visit http://www.cw.com/managed-exchange

The information contained in this e-mail is confidential and may also be subject to legal privilege. It is
intended only for the recipient(s) named above. If you are not named above as a recipient, you must not
read, copy, disclose, forward or otherwise use the information contained in this email. If you have
received this e-mail in error, please notify the sender (whose contact details are above) immediately by
reply e-mail and delete the message and any attachments without retaining any copies.

Cable and Wireless Worldwide plc 
Registered in England and Wales. Company Number 07029206
Registered office: Liberty House, 76 Hammersmith Road, London W14 8UD, England
_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Paul Jakma | 6 Jul 2010 15:34

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour

On Tue, 6 Jul 2010, Shakir, Rob wrote:

> This would appear to me to be an internal process of rtr-B. In this 
> circumstance I find it useful to consider what rtr-A sees. If rtr-B 
> is queueing packets, rtr-A should not see them, providing the TCP 
> stack acts as it should be doing. If the window remains zero for a 
> period greater than the hold time, then rtr-B will not have any 
> opportunity to send the packet(s) before the session should be torn 
> down unless rtr-A ignores the hold timer.

Right.

> If rtr-B sends packets then, as long as rtr-A complies with RFC 
> 793, they should not ever be received by the BGP process, since 
> only ACK segments should be valid in this circumstance.

> The interesting point around the above observation is that, in my 
> opinion, this appears quite harmful behaviour. In the situation 
> where there is some transport disruption between rtr-A and rtr-B, 
> if the packets are generated by the BGP process, but queued by the 
> underlying TCP implementation, then what is the behaviour when it 
> comes to sending a NOTIFICATION to indicate that the hold time has 
> expired as rtr-A is not sending rtr-B KEEPALIVEs?

You mean that A is not receiving KEEPALIVE from B? A's BGP would 
tear-down, and close down at least its side of the session - sending 
a NOTIFICATION. Prior to that, A could still send messages to B just 
fine, thus B need not reach its hold timeout.

Further, if A closes its TCP connection then B potentially still can 
send its buffered packets to A, as part of shutting down its side of 
the TCP connection.

This is what I would expect anyway.

> I would expect in this scenario, that I do not see a session 
> teardown after t=HOLDTIME, but rather only when rtr-A signals a 
> non-zero window, in which case, we observe:

> 1) A flood of KEEPALIVEs (the required number for the length of time for
> which we saw congestion)
> 2) A NOTIFICATION after this, since there was transport disruption.

From who to what? You're saying that you:

- observe A shutting down the session
- but /expect/ A to stop counting down the hold-timer when its
   receive window is 0

Have I understood it right?

(Note that a BGP peer can tear-down a session and close() it, but TCP 
can still send those messages much later (and the TCP FIN to end)).

TCP and BGP are decoupled from each other though. So I don't see how 
BGP in A could ever reasonably be expected to know what TCPs receive 
window was.

> If this were the case, then there is a relatively unknown period of 
> time (assuming that TCP does not deal with this - which is perhaps 
> unrealistic), for which the prefixes that we are holding in the RIB 
> due to this session are of unknown validity.

How so, can you give a more detailed example?

> I believe this is required behaviour, if one is to utilise a TCP 
> window size of zero to indicate congestion on a BGP session.

Exactly how does a BGP implementation "utilise a TCP window size of 
0", to indicate congestion?

It almost sounds like you are reasoning about BGP as if it and TCP 
are monolithic. However, that's generally not the case, BGP and TCP 
generally would be quite decoupled protocols, operating quite 
independently - other than a few well defined sequence points.

>> clear there is an intentional layering, and that BGP does not intend
>> to be defer its liveness tests to TCP.

> I also feel that perhaps I did not articulate myself properly here 
> - one of the behaviours that I am interested in the validity of is 
> the inability to send a KEEPALIVE within the hold time resulting in 
> the session being torn down. I think that the analysis you've 
> presented here implies that the BGP daemon should not do this, as 
> it has no requirement to know the state of the TCP session via 
> which it is being transported, and hence to the knowledge of the 
> BGP daemon, it has been able to generate KEEPALIVEs successfully. 
> The fact that they have not been received by the peer is not known 
> to the BGP daemon.

Right. BGP sends and reads stuff, and TCP does or does not deliver it 
within some unknown amount of time. BGP has its own keepalive and 
its own timer by which to judge whether or not, at a *BGP* level, 
communication is working sufficiently.

> Hmm, I am not sure that I understand you completely here. If rtr-A 
> tore down the session after t=HOLDTIME, I'd consider this "expected 
> and normal" (no keepalives received in hold time, therefore 
> NOTIFICATION is required), but, in this case, I appear to see that 
> rtr-B sends the NOTIFICATION because it could not send a KEEPALIVE.

So

> Whilst I see your implementation points - my issue here is that 
> there seems to be an inherent problem in throttling packets via 
> means of TCP, when there is a timer tracking packets that may be 
> throttled by this. In order to avoid this, I would suggest that 
> there is some language in RFC 4271 that explicitly prohibits the 
> use of underlying layers for controlling the flow of BGP packets, 
> and requires an implementions that desire such message-pacing 
> behaviour to implement internal queueing of relevant packets.

I have to say, I don't see the problem. BGP and TCP are not 
synchronised to each other, and trying to do so likely is not 
reasonably achievable.

Further, even if you did achieve this (having A not count down its 
BGP hold-time so long as its receive window was 0 - i.e. exclude that 
time from counting for hold-time), what's the point of it exactly? It 
doesn't seem to achieve anything except complexity and requiring 
fairly non-standard TCP<->userspace hacks.

> Apologies if I've missed something obvious in the above. I'm just 
> unclear as to the justification for what we believe we observe, 
> rather than either the session remaining established, or rtr-A 
> causing the session to be torn down.

Hmm, then it's not 100% clear to me what exactly you see, versus what 
you expect to happen.

regards,
--

-- 
Paul Jakma	paul <at> jakma.org	Key ID: 64A2FF6A
Fortune:
Every successful person has had failures but repeated failure is no
guarantee of eventual success.
_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr

Shakir, Rob | 6 Jul 2010 16:19

Re: Interaction of TCP Window Size and BGP Keepalive Behaviour

Paul,

Thanks for you reply.

I think that the problem that I'm having is that my interpretation of this scenario is that I'm basing some of
this argument on what we are observing in our lab, rather than an overview of what _should_ happen.
Apologies for this.

>> The interesting point around the above observation is that, in my 
>> opinion, this appears quite harmful behaviour. In the situation where 
>> there is some transport disruption between rtr-A and rtr-B, if the 
>> packets are generated by the BGP process, but queued by the 
>> underlying TCP implementation, then what is the behaviour when it 
>> comes to sending a NOTIFICATION to indicate that the hold time has 
>> expired as rtr-A is not sending rtr-B KEEPALIVEs?
> 
> You mean that A is not receiving KEEPALIVE from B? A's BGP would 
> tear-down, and close down at least its side of the session - sending a 
> NOTIFICATION. Prior to that, A could still send messages to B just 
> fine, thus B need not reach its hold timeout.

A does not receive KEEPALIVE from B. A tolerates this and does not track hold time.

A sends KEEPALIVE to B at the required interval.

> Further, if A closes its TCP connection then B potentially still can 
> send its buffered packets to A, as part of shutting down its side of 
> the TCP connection.

Where A stops tracking the hold timer. A does not close the connection to B, based on hold timer. In 'normal
operating circumstances', I'd then expect that there is no reason that this session should be terminated.

>> I would expect in this scenario, that I do not see a session teardown 
>> after t=HOLDTIME, but rather only when rtr-A signals a non-zero 
>> window, in which case, we observe:
>> 
>> 1) A flood of KEEPALIVEs (the required number for the length of time 
>> for which we saw congestion) 2) A NOTIFICATION after this, since 
>> there was transport disruption.
> 
> From who to what? You're saying that you:
> 
> - observe A shutting down the session
> - but /expect/ A to stop counting down the hold-timer when its
>    receive window is 0
> Have I understood it right?
> 
> (Note that a BGP peer can tear-down a session and close() it, but TCP 
> can still send those messages much later (and the TCP FIN to end)).
> 
> TCP and BGP are decoupled from each other though. So I don't see how 
> BGP in A could ever reasonably be expected to know what TCPs receive 
> window was.

In our scenario, we observe that B logs that A has requested a 0 window size.

We also see that A stops tracking the hold timer.

Both of these observations would seem to imply that there nis some communication between the status of the
TCP connection, and the status of the BGP session.

The point above is in the case that:

- A's KEEPALIVEs to B are not delivered before the hold timer expires.
- B therefore wishes to send NOTIFICATION.
- B's TCP implementation queues these packets and does not deliver them.

If TCP were to deliver packets after some time period, and when the connection parameters allow (i.e. the
window is >0), then we would expect to see these packets ingress to rtr-A only when the window is > 0.

>> If this were the case, then there is a relatively unknown period of 
>> time (assuming that TCP does not deal with this - which is perhaps 
>> unrealistic), for which the prefixes that we are holding in the RIB 
>> due to this session are of unknown validity.
> 
> How so, can you give a more detailed example?

With the above in mind:

- B's TCP has not delivered a NOTIFICATION to A.
- A is not tracking a hold time, therefore A does not have a reason to send a NOTIFICATION to B.
- B has sent a NOTIFICATIOIN, and hence will remove all NLRI related to this session from the RIB - session has
been closed from B's point of view.
- A does not have any indication that the session has failed, and hence the prefixes remain within the RIB.

Until A either has a TCP failure/closes the TCP session, or is able to receive queued packets from B, these
prefixes will remain within the RIB, and are of unknown validity.

>> I believe this is required behaviour, if one is to utilise a TCP 
>> window size of zero to indicate congestion on a BGP session.
> 
> Exactly how does a BGP implementation "utilise a TCP window size of 
> 0", to indicate congestion?

My assumption here is probably naïve - and assumes some IPC between the TCP stack and the BGP
implementation. I assume that BGP is able to tell TCP to signal on the connection that the window should be
zero, in order to stop receiving UPDATEs. I am unaware of any mechanism within BGP that would result in a
pause in UPDATEs being delivered. The rationale for this observation is that this is what we appear to
observe in the lab.

> It almost sounds like you are reasoning about BGP as if it and TCP are 
> monolithic. However, that's generally not the case, BGP and TCP 
> generally would be quite decoupled protocols, operating quite 
> independently - other than a few well defined sequence points.

This is almost certainly the error that I am making. The problem I have with this statement, is that we appear
to have implementation evidence that there is at least some signalling between the two processes in terms
of reacting to the window size during session activity. 

The actual observations that we have are that:

- A does stop tracking the hold time, and this does not appear to be a problem - and this corresponds with
periods when the TCP window is set to 0.
- B stops sending KEEPALIVEs during this period (as to whether they are delayed by TCP, I am unsure).
- A does not set a window > 0 for a complete hold time.
- B terminates the session -  due to not being able to send a KEEPALIVE within a hold time.

My primary concern here is that B's behaviour is invalid. There is no RFC 4271 requirement for B to terminate
the session due to not being able to transmit a KEEPALIVE.

I hope this is somewhat clearer!

Thanks very much for your input on this matter. I think that if the scenario existed where BGP did not react at
all to what the state of the TCP transport is, I agree with your reasoning, and would have observed a result
that I expect.

Kind regards,
Rob

This e-mail has been scanned for viruses by the Cable & Wireless Worldwide e-mail security system - powered
by MessageLabs. For more information on a proactive managed e-mail security service, visit http://www.cw.com/managed-exchange

The information contained in this e-mail is confidential and may also be subject to legal privilege. It is
intended only for the recipient(s) named above. If you are not named above as a recipient, you must not
read, copy, disclose, forward or otherwise use the information contained in this email. If you have
received this e-mail in error, please notify the sender (whose contact details are above) immediately by
reply e-mail and delete the message and any attachments without retaining any copies.

Cable and Wireless Worldwide plc 
Registered in England and Wales. Company Number 07029206
Registered office: Liberty House, 76 Hammersmith Road, London W14 8UD, England
_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www.ietf.org/mailman/listinfo/idr


Gmane