Jeff Tantsura | 24 May 01:52 2016
Picon

NomCom 2016-2017 Call for Volunteers

Dear RTGWG

Please consider volunteering for NomCom, see below for details. 

Thanks,
Jeff and Chris

---------- Forwarded message ----------
From: NomCom Chair 2016 <nomcom-chair-2016 <at> ietf.org>
Date: Thu, May 19, 2016 at 3:06 PM
Subject: NomCom 2016-2017 Call for Volunteers
To: IETF Announcement List <ietf-announce <at> ietf.org>


Subject: NomCom 2016-2017 Call for Volunteers

The IETF nomcom appoints folks to fill the open slots on the IAOC, the IAB, and
the IESG.

Ten voting members for the nomcom are selected in a verifiably random way from
a pool of volunteers. The more volunteers, the better chance we have of choosing
a random yet representative cross section of the IETF population.

The details of the operation of the nomcom can be found in RFC 7437, and
BCP10/RFC3797 details the selection algorithm.

Volunteers must have attended 3 of the past 5 IETF meetings.  As specified in
RFC 7437, that means three out of the five past meetings up to the time this
email announcement goes out to start the solicitation of volunteers. The five
meetings out of which you must have attended *three* are:

IETF = 91 (Honolulu)      \
       92 (Dallas)         \
       93 (Prague)          -*** ANY THREE!
       94 (Yokohama)       /
       95 (Buenos Aires)  /


If you qualify, please volunteer. Before you decide to volunteer, please
remember that anyone appointed to this Nomcom will not be considered as a
candidate for any of the positions that the 2016 - 2017 Nomcom is responsible
for filling.

Some 229 people have already volunteered by ticking the box on the IETF 95
registration form. 131 of these have been verified as eligible. I will contact
all of these shortly. Thank you for volunteering!

The list of people and posts whose terms end with the March 2017 IETF meeting,
and thus the positions for which this nomcom is responsible, are

IAOC:

    Lou Berger

IAB:

    Ralph Droms
    Russ Housley
    Robert Sparks
    Andrew Sullivan
    Dave Thaler
    Suzanne Woolf

IESG:

    Jari Arkko (GEN)
    Deborah Brungard (RTG)
    Ben Campbell (ART)
    Spencer Dawkins (TSV)
    Stephen Farrell (SEC)
    Joel Jaeggli (OPS)
    Terry Manderson (INT)
    Alvaro Retana (RTG)


All appointments are for 2 years. The ART and Routing areas have 3 ADs and the
General area has 1; all other areas have 2 ADs. Thus, all areas (with the
exception of GEN) have at least one continuing AD.

The primary activity for this nomcom will begin in July 2016 and should be
completed in January 2017.   The nomcom will have regularly scheduled conference
calls to ensure progress. There will be activities to collect requirements from
the community, review candidate questionnaires, review feedback from community
members about candidates, and talk to candidates.

While being a nomcom member does require some time commitment it is also a very
rewarding experience.

As a member of the nomcom it is very important that you be able to attend IETF97
(Seoul) to conduct interviews. Being at IETF96 (Berlin) is useful for
orientation.  Being at IETF98 is not essential.

Please volunteer by sending me an email before 23:59 UTC June 20, 2016, as
follows:

To: nomcom-chair-2016 <at> ietf.org
Subject: Nomcom 2016-17 Volunteer

Please include the following information in the email body:

Your Full Name: __________
    // as you write it on the IETF registration form
Current Primary Affiliation:
    // Typically what goes in the Company field
    // in the IETF Registration Form
Emails: _______________
   // All email addresses used to register for the past 5 IETF meetings
   // Preferred email address first
Telephone: _______________________
    // For confirmation if selected

You should expect an email response from me within 5 business days stating
whether or not you are qualified.  If you don't receive this response, please
re-send your email with the tag "RESEND"" added to the subject line.

If you are not yet sure if you would like to volunteer, please consider that
nomcom members play a very important role in shaping the leadership of the IETF.
Questions by email or voice are welcome. Volunteering for the nomcom is a great
way to contribute to the IETF!

You can find a detailed timeline on the nomcom web site at:
    https://datatracker.ietf.org/nomcom/2016/

I will be publishing a more detailed target timetable, as well as details of the
randomness seeds to be used for the RFC 3797 selection process, within the next
few weeks.

Thank you!

Lucy Lynch
llynch <at> civil-tongue.net
nomcom-chair-2016 <at> ietf.org


_______________________________________________ Pals mailing list Pals <at> ietf.org https://www.ietf.org/mailman/listinfo/pals
_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg
The IESG | 24 May 00:51 2016
Picon

Last Call: <draft-ietf-rtgwg-bgp-routing-large-dc-10.txt> (Use of BGP for routing in large-scale data centers) to Informational RFC


The IESG has received a request from the Routing Area Working Group WG
(rtgwg) to consider the following document:
- 'Use of BGP for routing in large-scale data centers'
  <draft-ietf-rtgwg-bgp-routing-large-dc-10.txt> as Informational RFC

The IESG plans to make a decision in the next few weeks, and solicits
final comments on this action. Please send substantive comments to the
ietf <at> ietf.org mailing lists by 2016-06-06. Exceptionally, comments may be
sent to iesg <at> ietf.org instead. In either case, please retain the
beginning of the Subject line to allow automated sorting.

Abstract

   Some network operators build and operate data centers that support
   over one hundred thousand servers.  In this document, such data
   centers are referred to as "large-scale" to differentiate them from
   smaller infrastructures.  Environments of this scale have a unique
   set of network requirements with an emphasis on operational
   simplicity and network stability.  This document summarizes
   operational experience in designing and operating large-scale data
   centers using BGP as the only routing protocol.  The intent is to
   report on a proven and stable routing design that could be leveraged
   by others in the industry.

The file can be obtained via
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-bgp-routing-large-dc/

IESG discussion can be tracked via
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-bgp-routing-large-dc/ballot/

No IPR declarations have been submitted directly on this I-D.
Alia Atlas | 23 May 23:33 2016
Picon

AD review and progressing: draft-ietf-rtgwg-bgp-routing-large-dc-10

First, I would like to thank Jon, Petr, and Ariff for their work on this document.  I think it will be very useful for many who are looking at data-center designs.

Second, I'd like to thank RTGWG for adopting this draft and working to improve it.  I think it is a better document than it would have been if I had instead AD-sponsored it.

Finally, I've done my AD review and have no significant comments that require addressing.  I am forwarding this draft for IETF Last Call and it is on the telechat for June 16.  I do apologize for the delay; I am a bit backlogged on drafts.

One minor suggestion is that in Section 8.3, it might be useful to refer to RFC 5837, which, if implemented, would allow a way to indicate the local interface an ICMP message was received on while doing IP address masquerading.   I believe it would solve the issue of "hiding the address of the entry point into the device".

Thanks,
Alia


_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg
Lou Berger | 23 May 13:44 2016
Picon

Fwd: New Version Notification for draft-rtgyangdt-rtgwg-ni-model-00.txt

FYI this is the second of the 2 models split out from 
draft-rtgyangdt-rtgwg-device-model

We look forward to hearing feedback from the WG.

-------- Forwarded Message --------
Subject: 	New Version Notification for
draft-rtgyangdt-rtgwg-ni-model-00.txt
Date: 	Tue, 17 May 2016 10:25:07 -0700
From: 	internet-drafts <at> ietf.org
To: 	Christan Hopps <chopps <at> chopps.org>, Acee Lindem <acee <at> cisco.com>,
Lou Berger <lberger <at> labn.net>, Christian Hopps <chopps <at> chopps.org>, Dean
Bogdanovic <ivandean <at> gmail.com>

A new version of I-D, draft-rtgyangdt-rtgwg-ni-model-00.txt
has been successfully submitted by Lou Berger and posted to the
IETF repository.

Name:		draft-rtgyangdt-rtgwg-ni-model
Revision:	00
Title:		Network Instance Model
Document date:	2016-05-17
Group:		Individual Submission
Pages:		15
URL:            https://www.ietf.org/internet-drafts/draft-rtgyangdt-rtgwg-ni-model-00.txt
Status:         https://datatracker.ietf.org/doc/draft-rtgyangdt-rtgwg-ni-model/
Htmlized:       https://tools.ietf.org/html/draft-rtgyangdt-rtgwg-ni-model-00

Abstract:
   This document defines a network instance module.  This module along
   with the logical network element module can be used to manage the
   logical and virtual resource representations that may be present on a
   network device.  Examples of common industry terms for logical
   resource representations are Logical Systems or Logical Routers.
   Examples of common industry terms for virtual resource
   representations are Virtual Routing and Forwarding (VRF) instances
   and Virtual Switch Instances (VSIs).

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at tools.ietf.org.

The IETF Secretariat
Lou Berger | 23 May 13:44 2016
Picon

Fwd: New Version Notification for draft-rtgyangdt-rtgwg-lne-model-00.txt

FYI this is the first of the 2 models split out from 
draft-rtgyangdt-rtgwg-device-model

We look forward to hearing feedback from the WG.

-------- Forwarded Message --------
Subject: 	New Version Notification for
draft-rtgyangdt-rtgwg-lne-model-00.txt
Date: 	Tue, 17 May 2016 10:24:56 -0700
From: 	internet-drafts <at> ietf.org
To: 	Christan Hopps <chopps <at> chopps.org>, Acee Lindem <acee <at> cisco.com>,
Lou Berger <lberger <at> labn.net>, Christian Hopps <chopps <at> chopps.org>, Dean
Bogdanovic <ivandean <at> gmail.com>

A new version of I-D, draft-rtgyangdt-rtgwg-lne-model-00.txt
has been successfully submitted by Lou Berger and posted to the
IETF repository.

Name:		draft-rtgyangdt-rtgwg-lne-model
Revision:	00
Title:		Logical Network Element Model
Document date:	2016-05-17
Group:		Individual Submission
Pages:		13
URL:            https://www.ietf.org/internet-drafts/draft-rtgyangdt-rtgwg-lne-model-00.txt
Status:         https://datatracker.ietf.org/doc/draft-rtgyangdt-rtgwg-lne-model/
Htmlized:       https://tools.ietf.org/html/draft-rtgyangdt-rtgwg-lne-model-00

Abstract:
   This document defines a logical network element module.  This module
   along with the network instance module can be used to manage the
   logical and virtual resource representations that may be present on a
   network device.  Examples of common industry terms for logical
   resource representations are Logical Systems or Logical Routers.
   Examples of of common industry terms for virtual resource
   representations are Virtual Routing and Forwarding (VRF) instances
   and Virtual Switch Instances (VSIs).

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at tools.ietf.org.

The IETF Secretariat
Jeff Tantsura | 3 May 05:10 2016
Picon

WGLC on draft-ietf-rtgwg-rlfa-node-protection

Dear RTGWG,

 

The authors of draft-ietf-rtgwg-rlfa-node-protection have told us that the

draft is ready for working group last call (WGLC).

 

This email is to start  the Working Group Last Call (WGLC) for draft-ietf-rtgwg-rlfa-node-protection.
This call will close by Monday, May 16.
Please provide your feedback whether you support (or not) the advancement of this draft. 

Thanks,
Jeff and Chris
<!-- /* Font Definitions */ <at> font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} <at> font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} <at> font-face {font-family:-webkit-standard;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:"Calibri",sans-serif; color:#1F497D;} span.EmailStyle18 {mso-style-type:personal; font-family:"Calibri",sans-serif; color:#1F497D;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} <at> page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} -->
_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg
Jeff Tantsura | 1 May 10:18 2016
Picon

RTGWG draft minutes available

Hi RTGWG,

The draft minutes for the RTGAREA meeting at IETF 95 are now available.  Please let me know if you have any comments.

Jeff & Chris
_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg
Picon

rtgwg - New Meeting Session Request for IETF 96


A new meeting session request has just been submitted by Jeff Tantsura, a Chair of the rtgwg working group.

---------------------------------------------------------
Working Group Name: Routing Area Working Group
Area Name: Routing Area
Session Requester: Jeff Tantsura

Number of Sessions: 2
Length of Session(s):  2 Hours, 2.5 Hours
Number of Attendees: 150
Conflicts to Avoid: 
 First Priority: isis mpls ospf pce idr spring
 Second Priority: bess l3sm  bier teas ccamp i2rs
 Third Priority: bfd detnet lime nvo3 pim  netmod

Special Requests:

---------------------------------------------------------
Hannes Gredler | 26 Apr 10:14 2016
Picon
Gravatar

WGLC on draft-ietf-rtgwg-rlfa-node-protection

hi,

I am not aware of any IPR other than the one already disclosed

https://datatracker.ietf.org/ipr/2334/
https://datatracker.ietf.org/ipr/2346/

rgds,

/hannes
Acee Lindem (acee | 25 Apr 19:16 2016
Picon

Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)

Hello,

I have been selected as the Routing Directorate reviewer for this draft.
The Routing Directorate seeks to review all routing or routing-related
drafts as they pass through IETF last call and IESG review, and sometimes
on special request. The purpose of the review is to provide assistance to
the Routing ADs. For more information about the Routing Directorate,
please see ​http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir

Although these comments are primarily for the use of the Routing ADs, it
would be helpful if you could consider them along with any other IETF Last
Call comments that you receive, and strive to resolve them through
discussion or by updating the draft.

Document: draft-ietf-rtgwg-bgp-routing-large-dc-09.txt
Reviewer: Acee Lindem
Review Date: 4/25/16
IETF LC End Date: Not started
Intended Status: Informational

Summary:
    This document is basically ready for publication, but has some minor
issues and nits that should be resolved prior to publication.

Comments:
    The document starts with the requirements for an MSDC routing and then
provides an overview of Clos data topologies and data center network
design. This overview attempts to cover a lot of a material in a very
small amount of text. While not completely successful, the overview
provides a lot of good information and references. The bulk of the
document covers the usage of EBGP as the sole data center routing protocol
and other aspects of the routing design including ECMP, summarization
issues, and convergence. These sections provide a very good guide for
using EBGP in a Clos data center and an excellent discussion of the
deployment issues (based on real deployment experience).

    The technical content of the document is excellent. The readability
could be improved by breaking up some of the run-on sentences and with the
suggested editorial changes (see Nits below).


Major Issues:

    I have no major issues with the document.

Minor Issues:

    Section 4.2: Can an informative reference be added for Direct Server
Return (DSR)?
    Section 5.2.4 and 7.4: Define precisely what is meant by "scale-out"
topology somewhere in the document.
    Section 5.2.5: Can you add a backward reference to the discussion of
"lack of peer links inside every peer”? Also, it would be good to describe
how this would allow for summarization and under what failure conditions.
    Section 7.4: Should you add a reference to
https://www.ietf.org/id/draft-ietf-rtgwg-bgp-pic-00.txt to the penultimate
paragraph in this section?

Nits:

***************
*** 143,149 ****
     network stability so that a small group of people can effectively
     support a significantly sized network.
  
!    Experimentation and extensive testing has shown that External BGP
     (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
     these type of data center applications.  This is in contrast with
     more traditional DC designs, which may se simple tree topologies and
--- 143,149 ----
     network stability so that a sall group of people can effectively
     support a significantly sized network.
  
!    Experimentation and extensive testing have shown that External BGP
     (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
     these type of data center applications.  This is in contrast with
     more traditional DC designs, which may use simple tree topologies and
***************
*** 178,191 ****
  2.1.  Bandwidth and Traffic Patterns
  
     The primary requirement when building an interconnection network for
!    large number of servers is to accommodate application bandwidth and
     latency requirements.  Until recently it was quite common to see the
     majority of traffic entering and leaving the data center, commonly
     referred to as "north-south" traffic.  Traditional "tree" topologies
     were sufficient to accommodate such flows, even with high
     oversubscription ratios between the layers of the network.  If more
     bandwidth was required, it was added by "scaling up" the network
!    elements, e.g. by upgrading the device's linecards or fabrics or
     replacing the device with one with higher port density.
  
     Today many large-scale data centers host applications generating
--- 178,191 ----
  2.1.  Bandwidth and Traffic Patterns
  
     The primary requirement when building an interconnection network for
!    a large number of servers is to accommodate application bandwidth and
     latency requirements.  Until recently it was quite common to see the
     majority of traffic entering and leaving the data center, commonly
     referred to as "north-south" traffic.  Traditional "tree" topologies
     were sufficient to accommodate such flows, even with high
     oversubscription ratios between the layers of the network.  If more
     bandwidth was required, it was added by "scaling up" the network
!    elements, e.g., by upgrading the device's linecards or fabrics or
     replacing the device with one with higher port density.
  
     Today many large-scale data centers host applications generating
***************
*** 195,201 ****
     [HADOOP], massive data replication between clusters needed by certain
     applications, or virtual machine migrations.  Scaling traditional
     tree topologies to match these bandwidth demands becomes either too
!    expensive or impossible due to physical limitations, e.g. port
     density in a switch.
  
  2.2.  CAPEX Minimization
--- 195,201 ----
     [HADOOP], massive data replication between clusters needed by certain
     applications, or virtual machine migrations.  Scaling traditional
     tree topologies to match these bandwidth demands becomes either too
!    expensive or impossible due to physical limitations, e.g., port
     density in a switch.
  
  2.2.  CAPEX Minimization
***************
*** 209,215 ****
  
     o  Unifying all network elements, preferably using the same hardware
        type or even the same device.  This allows for volume pricing on
!       bulk purchases and reduced maintenance and sparing costs.
  
     o  Driving costs down using competitive pressures, by introducing
        multiple network equipment vendors.
--- 209,215 ----
  
     o  Unifying all network elements, preferably using the same hardware
        type or even the same device.  This allows for volume pricing on
!       bulk purchases and reduced maintenance and inventory costs.
  
     o  Driving costs down using competitive pressures, by introducing
        multiple network equipment vendors.
***************
*** 234,244 ****
     minimizes software issue-related failures.
  
     An important aspect of Operational Expenditure (OPEX) minimization is
!    reducing size of failure domains in the network.  Ethernet networks
     are known to be susceptible to broadcast or unicast traffic storms
     that can have a dramatic impact on network performance and
     availability.  The use of a fully routed design significantly reduces
!    the size of the data plane failure domains - i.e. limits them to the
     lowest level in the network hierarchy.  However, such designs
     introduce the problem of distributed control plane failures.  This
     observation calls for simpler and less control plane protocols to
--- 234,244 ----
     minimizes software issue-related failures.
  
     An important aspect of Operational Expenditure (OPEX) minimization is
!    reducing the size of failure domains in the network.  Ethernet
networks
     are known to be susceptible to broadcast or unicast traffic storms
     that can have a dramatic impact on network performance and
     availability.  The use of a fully routed design significantly reduces
!    the size of the data plane failure domains, i.e., limits them to the
     lowest level in the network hierarchy.  However, such designs
     introduce the problem of distributed control plane failures.  This
     observation calls for simpler and less control plane protocols to
***************
*** 253,259 ****
     performed by network devices.  Traditionally, load balancers are
     deployed as dedicated devices in the traffic forwarding path.  The
     problem arises in scaling load balancers under growing traffic
!    demand.  A preferable solution would be able to scale load balancing
     layer horizontally, by adding more of the uniform nodes and
     distributing incoming traffic across these nodes.  In situations like
     this, an ideal choice would be to use network infrastructure itself
--- 253,259 ----
     performed by network devices.  Traditionally, load balancers are
     deployed as dedicated devices in the traffic forwarding path.  The
     problem arises in scaling load balancers under growing traffic
!    demand.  A preferable solution would be able to scale the load
balancing
     layer horizontally, by adding more of the uniform nodes and
     distributing incoming traffic across these nodes.  In situations like
     this, an ideal choice would be to use network infrastructure itself
***************
*** 305,311 ****
  3.1.  Traditional DC Topology
  
     In the networking industry, a common design choice for data centers
!    typically look like a (upside down) tree with redundant uplinks and
     three layers of hierarchy namely; core, aggregation/distribution and
     access layers (see Figure 1).  To accommodate bandwidth demands, each
     higher layer, from server towards DC egress or WAN, has higher port
--- 305,311 ----
  3.1.  Traditional DC Topology
  
     In the networking industry, a common design choice for data centers
!    typically look like an (upside down) tree with redundant uplinks and
     three layers of hierarchy namely; core, aggregation/distribution and
     access layers (see Figure 1).  To accommodate bandwidth demands, each
     higher layer, from server towards DC egress or WAN, has higher port
***************
*** 373,379 ****
     topology, sometimes called "fat-tree" (see, for example, [INTERCON]
     and [ALFARES2008]).  This topology features an odd number of stages
     (sometimes known as dimensions) and is commonly made of uniform
!    elements, e.g. network switches with the same port count.  Therefore,
     the choice of folded Clos topology satisfies REQ1 and facilitates
     REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
     topology (3 stages counting Tier-2 stage twice, when tracing a packet
--- 373,379 ----
     topology, sometimes called "fat-tree" (see, for example, [INTERCON]
     and [ALFARES2008]).  This topology features an odd number of stages
     (sometimes known as dimensions) and is commonly made of uniform
!    elements, e.g., network switches with the same port count.  Therefore,
     the choice of folded Clos topology satisfies REQ1 and facilitates
     REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
     topology (3 stages counting Tier-2 stage twice, when tracing a packet
***************
*** 460,466 ****
  3.2.3.  Scaling the Clos topology
  
     A Clos topology can be scaled either by increasing network element
!    port density or adding more stages, e.g. moving to a 5-stage Clos, as
     illustrated in Figure 3 below:
  
                                        Tier-1
--- 460,466 ----
  3.2.3.  Scaling the Clos topology
  
     A Clos topology can be scaled either by increasing network element
!    port density or adding more stages, e.g., moving to a 5-stage Clos, as
     illustrated in Figure 3 below:
  
                                        Tier-1
***************
*** 523,529 ****
  3.2.4.  Managing the Size of Clos Topology Tiers
  
     If a data center network size is small, it is possible to reduce the
!    number of switches in Tier-1 or Tier-2 of Clos topology by a factor
     of two.  To understand how this could be done, take Tier-1 as an
     example.  Every Tier-2 device connects to a single group of Tier-1
     devices.  If half of the ports on each of the Tier-1 devices are not
--- 523,529 ----
  3.2.4.  Managing the Size of Clos Topology Tiers
  
     If a data center network size is small, it is possible to reduce the
!    number of switches in Tier-1 or Tier-2 of a Clos topology by a factor
     of two.  To understand how this could be done, take Tier-1 as an
     example.  Every Tier-2 device connects to a single group of Tier-1
     devices.  If half of the ports on each of the Tier-1 devices are not
***************
*** 574,580 ****
     originally defined in [IEEE8021D-1990] for loop free topology
     creation, typically utilizing variants of the traditional DC topology
     described in Section 3.1.  At the time, many DC switches either did
!    not support Layer 3 routed protocols or supported it with additional
     licensing fees, which played a part in the design choice.  Although
     many enhancements have been made through the introduction of Rapid
     Spanning Tree Protocol (RSTP) in the latest revision of
--- 574,580 ----
     originally defined in [IEEE8021D-1990] for loop free topology
     creation, typically utilizing variants of the traditional DC topology
     described in Section 3.1.  At the time, many DC switches either did
!    not support Layer 3 routing protocols or supported them with
additional
     licensing fees, which played a part in the design choice.  Although
     many enhancements have been made through the introduction of Rapid
     Spanning Tree Protocol (RSTP) in the latest revision of
***************
*** 599,605 ****
     as the backup for loop prevention.  The major downsides of this
     approach are the lack of ability to scale linearly past two in most
     implementations, lack of standards based implementations, and added
!    failure domain risk of keeping state between the devices.
  
     It should be noted that building large, horizontally scalable, Layer
     2 only networks without STP is possible recently through the
--- 599,605 ----
     as the backup for loop prevention.  The major downsides of this
     approach are the lack of ability to scale linearly past two in most
     implementations, lack of standards based implementations, and added
!    the failure domain risk of syncing state between the devices.
  
     It should be noted that building large, horizontally scalable, Layer
     2 only networks without STP is possible recently through the
***************
*** 621,631 ****
     Finally, neither the base TRILL specification nor the M-LAG approach
     totally eliminate the problem of the shared broadcast domain, that is
     so detrimental to the operations of any Layer 2, Ethernet based
!    solutions.  Later TRILL extensions have been proposed to solve the
     this problem statement primarily based on the approaches outlined in
     [RFC7067], but this even further limits the number of available
!    interoperable implementations that can be used to build a fabric,
!    therefore TRILL based designs have issues meeting REQ2, REQ3, and
     REQ4.
  
  4.2.  Hybrid L2/L3 Designs
--- 621,631 ----
     Finally, neither the base TRILL specification nor the M-LAG approach
     totally eliminate the problem of the shared broadcast domain, that is
     so detrimental to the operations of any Layer 2, Ethernet based
!    solution.  Later TRILL extensions have been proposed to solve the
     this problem statement primarily based on the approaches outlined in
     [RFC7067], but this even further limits the number of available
!    interoperable implementations that can be used to build a fabric.
!    Therefore, TRILL based designs have issues meeting REQ2, REQ3, and
     REQ4.
  
  4.2.  Hybrid L2/L3 Designs
***************
*** 635,641 ****
     in either the Tier-1 or Tier-2 parts of the network and dividing the
     Layer 2 domain into numerous, smaller domains.  This design has
     allowed data centers to scale up, but at the cost of complexity in
!    the network managing multiple protocols.  For the following reasons,
     operators have retained Layer 2 in either the access (Tier-3) or both
     access and aggregation (Tier-3 and Tier-2) parts of the network:
  
--- 635,641 ----
     in either the Tier-1 or Tier-2 parts of the network and dividing the
     Layer 2 domain into numerous, smaller domains.  This design has
     allowed data centers to scale up, but at the cost of complexity in
!    the managing multiple network protocols.  For the following reasons,
     operators have retained Layer 2 in either the access (Tier-3) or both
     access and aggregation (Tier-3 and Tier-2) parts of the network:
  
***************
*** 644,650 ****
  
     o  Seamless mobility for virtual machines that require the
        preservation of IP addresses when a virtual machine moves to
!       different Tier-3 switch.
  
     o  Simplified IP addressing = less IP subnets are required for the
        data center.
--- 644,650 ----
  
     o  Seamless mobility for virtual machines that require the
        preservation of IP addresses when a virtual machine moves to
!       a different Tier-3 switch.
  
     o  Simplified IP addressing = less IP subnets are required for the
        data center.
***************
*** 679,686 ****
     adoption in networks where large Layer 2 adjacency and larger size
     Layer 3 subnets are not as critical compared to network scalability
     and stability.  Application providers and network operators continue
!    to also develop new solutions to meet some of the requirements that
!    previously have driven large Layer 2 domains by using various overlay
     or tunneling techniques.
  
  5.  Routing Protocol Selection and Design
--- 679,686 ----
     adoption in networks where large Layer 2 adjacency and larger size
     Layer 3 subnets are not as critical compared to network scalability
     and stability.  Application providers and network operators continue
!    to develop new solutions to meet some of the requirements that
!    previously had driven large Layer 2 domains using various overlay
     or tunneling techniques.
  
  5.  Routing Protocol Selection and Design
***************
*** 700,706 ****
     design.
  
     Although EBGP is the protocol used for almost all inter-domain
!    routing on the Internet and has wide support from both vendor and
     service provider communities, it is not generally deployed as the
     primary routing protocol within the data center for a number of
     reasons (some of which are interrelated):
--- 700,706 ----
     design.
  
     Although EBGP is the protocol used for almost all inter-domain
!    routing in the Internet and has wide support from both vendor and
     service provider communities, it is not generally deployed as the
     primary routing protocol within the data center for a number of
     reasons (some of which are interrelated):
***************
*** 741,754 ****
        state IGPs.  Since every BGP router calculates and propagates only
        the best-path selected, a network failure is masked as soon as the
        BGP speaker finds an alternate path, which exists when highly
!       symmetric topologies, such as Clos, are coupled with EBGP only
        design.  In contrast, the event propagation scope of a link-state
        IGP is an entire area, regardless of the failure type.  In this
        way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
        that all widely deployed link-state IGPs feature periodic
!       refreshes of routing information, even if this rarely causes
!       impact to modern router control planes, while BGP does not expire
!       routing state.
  
     o  BGP supports third-party (recursively resolved) next-hops.  This
        allows for manipulating multipath to be non-ECMP based or
--- 741,754 ----
        state IGPs.  Since every BGP router calculates and propagates only
        the best-path selected, a network failure is masked as soon as the
        BGP speaker finds an alternate path, which exists when highly
!       symmetric topologies, such as Clos, are coupled with an EBGP only
        design.  In contrast, the event propagation scope of a link-state
        IGP is an entire area, regardless of the failure type.  In this
        way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
        that all widely deployed link-state IGPs feature periodic
!       refreshes of routing information while BGP does not expire
!       routing state, although this rarely impacts modern router control
!       planes.
  
     o  BGP supports third-party (recursively resolved) next-hops.  This
        allows for manipulating multipath to be non-ECMP based or
***************
*** 765,775 ****
        controlled and complex unwanted paths will be ignored.  See
        Section 5.2 for an example of a working ASN allocation scheme.  In
        a link-state IGP accomplishing the same goal would require multi-
!       (instance/topology/processes) support, typically not available in
        all DC devices and quite complex to configure and troubleshoot.
        Using a traditional single flooding domain, which most DC designs
        utilize, under certain failure conditions may pick up unwanted
!       lengthy paths, e.g. traversing multiple Tier-2 devices.
  
     o  EBGP configuration that is implemented with minimal routing policy
        is easier to troubleshoot for network reachability issues.  In
--- 765,775 ----
        controlled and complex unwanted paths will be ignored.  See
        Section 5.2 for an example of a working ASN allocation scheme.  In
        a link-state IGP accomplishing the same goal would require multi-
!       (instance/topology/process) support, typically not available in
        all DC devices and quite complex to configure and troubleshoot.
        Using a traditional single flooding domain, which most DC designs
        utilize, under certain failure conditions may pick up unwanted
!       lengthy paths, e.g., traversing multiple Tier-2 devices.
  
     o  EBGP configuration that is implemented with minimal routing policy
        is easier to troubleshoot for network reachability issues.  In
***************
*** 806,812 ****
        loopback sessions are used even in the case of multiple links
        between the same pair of nodes.
  
!    o  Private Use ASNs from the range 64512-65534 are used so as to
        avoid ASN conflicts.
  
     o  A single ASN is allocated to all of the Clos topology's Tier-1
--- 806,812 ----
        loopback sessions are used even in the case of multiple links
        between the same pair of nodes.
  
!    o  Private Use ASNs from the range 64512-65534 are used to
        avoid ASN conflicts.
  
     o  A single ASN is allocated to all of the Clos topology's Tier-1
***************
*** 815,821 ****
     o  A unique ASN is allocated to each set of Tier-2 devices in the
        same cluster.
  
!    o  A unique ASN is allocated to every Tier-3 device (e.g.  ToR) in
        this topology.
  
  
--- 815,821 ----
     o  A unique ASN is allocated to each set of Tier-2 devices in the
        same cluster.
  
!    o  A unique ASN is allocated to every Tier-3 device (e.g.,  ToR) in
        this topology.
  
  
***************
*** 903,922 ****
  
     Another solution to this problem would be using Four-Octet ASNs
     ([RFC6793]), where there are additional Private Use ASNs available,
!    see [IANA.AS].  Use of Four-Octet ASNs put additional protocol
!    complexity in the BGP implementation so should be considered against
     the complexity of re-use when considering REQ3 and REQ4.  Perhaps
     more importantly, they are not yet supported by all BGP
     implementations, which may limit vendor selection of DC equipment.
!    When supported, ensure that implementations in use are able to remove
!    the Private Use ASNs if required for external connectivity
!    (Section 5.2.4).
  
  5.2.3.  Prefix Advertisement
  
     A Clos topology features a large number of point-to-point links and
     associated prefixes.  Advertising all of these routes into BGP may
!    create FIB overload conditions in the network devices.  Advertising
     these links also puts additional path computation stress on the BGP
     control plane for little benefit.  There are two possible solutions:
  
--- 903,922 ----
  
     Another solution to this problem would be using Four-Octet ASNs
     ([RFC6793]), where there are additional Private Use ASNs available,
!    see [IANA.AS].  Use of Four-Octet ASNs puts additional protocol
!    complexity in the BGP implementation and should be balanced against
     the complexity of re-use when considering REQ3 and REQ4.  Perhaps
     more importantly, they are not yet supported by all BGP
     implementations, which may limit vendor selection of DC equipment.
!    When supported, ensure that deployed implementations are able to
remove
!    the Private Use ASNs when external connectivity to these ASes is
!    required (Section 5.2.4).
  
  5.2.3.  Prefix Advertisement
  
     A Clos topology features a large number of point-to-point links and
     associated prefixes.  Advertising all of these routes into BGP may
!    create FIB overload in the network devices.  Advertising
     these links also puts additional path computation stress on the BGP
     control plane for little benefit.  There are two possible solutions:
  
***************
*** 925,951 ****
        device, distant networks will automatically be reachable via the
        advertising EBGP peer and do not require reachability to these
        prefixes.  However, this may complicate operations or monitoring:
!       e.g. using the popular "traceroute" tool will display IP addresses
        that are not reachable.
  
     o  Advertise point-to-point links, but summarize them on every
        device.  This requires an address allocation scheme such as
        allocating a consecutive block of IP addresses per Tier-1 and
        Tier-2 device to be used for point-to-point interface addressing
!       to the lower layers (Tier-2 uplinks will be numbered out of Tier-1
!       addressing and so forth).
  
     Server subnets on Tier-3 devices must be announced into BGP without
     using route summarization on Tier-2 and Tier-1 devices.  Summarizing
     subnets in a Clos topology results in route black-holing under a
!    single link failure (e.g. between Tier-2 and Tier-3 devices) and
     hence must be avoided.  The use of peer links within the same tier to
     resolve the black-holing problem by providing "bypass paths" is
     undesirable due to O(N^2) complexity of the peering mesh and waste of
     ports on the devices.  An alternative to the full-mesh of peer-links
!    would be using a simpler bypass topology, e.g. a "ring" as described
     in [FB4POST], but such a topology adds extra hops and has very
!    limited bisection bandwidth, in addition requiring special tweaks to
  
  
  
--- 925,951 ----
        device, distant networks will automatically be reachable via the
        advertising EBGP peer and do not require reachability to these
        prefixes.  However, this may complicate operations or monitoring:
!       e.g., using the popular "traceroute" tool will display IP addresses
        that are not reachable.
  
     o  Advertise point-to-point links, but summarize them on every
        device.  This requires an address allocation scheme such as
        allocating a consecutive block of IP addresses per Tier-1 and
        Tier-2 device to be used for point-to-point interface addressing
!       to the lower layers (Tier-2 uplink addresses will be allocated
!       from Tier-1 address blocks and so forth).
  
     Server subnets on Tier-3 devices must be announced into BGP without
     using route summarization on Tier-2 and Tier-1 devices.  Summarizing
     subnets in a Clos topology results in route black-holing under a
!    single link failure (e.g., between Tier-2 and Tier-3 devices) and
     hence must be avoided.  The use of peer links within the same tier to
     resolve the black-holing problem by providing "bypass paths" is
     undesirable due to O(N^2) complexity of the peering mesh and waste of
     ports on the devices.  An alternative to the full-mesh of peer-links
!    would be using a simpler bypass topology, e.g., a "ring" as described
     in [FB4POST], but such a topology adds extra hops and has very
!    limited bisectional bandwidth. Additionally requiring special tweaks
to
  
  
  
***************
*** 956,963 ****
  
     make BGP routing work - such as possibly splitting every device into
     an ASN on its own.  Later in this document, Section 8.2 introduces a
!    less intrusive method for performing a limited form route
!    summarization in Clos networks and discusses it's associated trade-
     offs.
  
  5.2.4.  External Connectivity
--- 956,963 ----
  
     make BGP routing work - such as possibly splitting every device into
     an ASN on its own.  Later in this document, Section 8.2 introduces a
!    less intrusive method for performing a limited form of route
!    summarization in Clos networks and discusses its associated trade-
     offs.
  
  5.2.4.  External Connectivity
***************
*** 972,985 ****
     document.  These devices have to perform a few special functions:
  
     o  Hide network topology information when advertising paths to WAN
!       routers, i.e. remove Private Use ASNs [RFC6996] from the AS_PATH
        attribute.  This is typically done to avoid ASN number collisions
        between different data centers and also to provide a uniform
        AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
        prefixes originated in the topology.  An implementation specific
        BGP feature typically called "Remove Private AS" is commonly used
        to accomplish this.  Depending on implementation, the feature
!       should strip a contiguous sequence of Private Use ASNs found in
        AS_PATH attribute prior to advertising the path to a neighbor.
        This assumes that all ASNs used for intra data center numbering
        are from the Private Use ranges.  The process for stripping the
--- 972,985 ----
     document.  These devices have to perform a few special functions:
  
     o  Hide network topology information when advertising paths to WAN
!       routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH
        attribute.  This is typically done to avoid ASN number collisions
        between different data centers and also to provide a uniform
        AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
        prefixes originated in the topology.  An implementation specific
        BGP feature typically called "Remove Private AS" is commonly used
        to accomplish this.  Depending on implementation, the feature
!       should strip a contiguous sequence of Private Use ASNs found in an
        AS_PATH attribute prior to advertising the path to a neighbor.
        This assumes that all ASNs used for intra data center numbering
        are from the Private Use ranges.  The process for stripping the
***************
*** 998,1005 ****
        to the WAN Routers upstream, to provide resistance to a single-
        link failure causing the black-holing of traffic.  To prevent
        black-holing in the situation when all of the EBGP sessions to the
!       WAN routers fail simultaneously on a given device it is more
!       desirable to take the "relaying" approach rather than introducing
        the default route via complicated conditional route origination
        schemes provided by some implementations [CONDITIONALROUTE].
  
--- 998,1005 ----
        to the WAN Routers upstream, to provide resistance to a single-
        link failure causing the black-holing of traffic.  To prevent
        black-holing in the situation when all of the EBGP sessions to the
!       WAN routers fail simultaneously on a given device, it is more
!       desirable to readvertise the default route rather than originating
        the default route via complicated conditional route origination
        schemes provided by some implementations [CONDITIONALROUTE].
  
***************
*** 1017,1023 ****
     prefixes originated from within the data center in a fully routed
     network design.  For example, a network with 2000 Tier-3 devices will
     have at least 2000 servers subnets advertised into BGP, along with
!    the infrastructure or other prefixes.  However, as discussed before,
     the proposed network design does not allow for route summarization
     due to the lack of peer links inside every tier.
  
--- 1017,1023 ----
     prefixes originated from within the data center in a fully routed
     network design.  For example, a network with 2000 Tier-3 devices will
     have at least 2000 servers subnets advertised into BGP, along with
!    the infrastructure and link prefixes.  However, as discussed before,
     the proposed network design does not allow for route summarization
     due to the lack of peer links inside every tier.
  
***************
*** 1028,1037 ****
     o  Interconnect the Border Routers using a full-mesh of physical
        links or using any other "peer-mesh" topology, such as ring or
        hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
!       exchange network reachability information - e.g. by adding a mesh
        of IBGP sessions.  The interconnecting peer links need to be
        appropriately sized for traffic that will be present in the case
!       of a device or link failure underneath the Border Routers.
  
     o  Tier-1 devices may have additional physical links provisioned
        toward the Border Routers (which are Tier-2 devices from the
--- 1028,1037 ----
     o  Interconnect the Border Routers using a full-mesh of physical
        links or using any other "peer-mesh" topology, such as ring or
        hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
!       exchange network reachability information, e.g., by adding a mesh
        of IBGP sessions.  The interconnecting peer links need to be
        appropriately sized for traffic that will be present in the case
!       of a device or link failure in the mesh connecting the Border
Routers.
  
     o  Tier-1 devices may have additional physical links provisioned
        toward the Border Routers (which are Tier-2 devices from the
***************
*** 1043,1049 ****
        device compared with the other devices in the Clos.  This also
        reduces the number of ports available to "regular" Tier-2 switches
        and hence the number of clusters that could be interconnected via
!       Tier-1 layer.
  
     If any of the above options are implemented, it is possible to
     perform route summarization at the Border Routers toward the WAN
--- 1043,1049 ----
        device compared with the other devices in the Clos.  This also
        reduces the number of ports available to "regular" Tier-2 switches
        and hence the number of clusters that could be interconnected via
!       the Tier-1 layer.
  
     If any of the above options are implemented, it is possible to
     perform route summarization at the Border Routers toward the WAN
***************
*** 1071,1079 ****
     ECMP is the fundamental load sharing mechanism used by a Clos
     topology.  Effectively, every lower-tier device will use all of its
     directly attached upper-tier devices to load share traffic destined
!    to the same IP prefix.  Number of ECMP paths between any two Tier-3
     devices in Clos topology equals to the number of the devices in the
!    middle stage (Tier-1).  For example, Figure 5 illustrates the
     topology where Tier-3 device A has four paths to reach servers X and
     Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
     respectively.
--- 1071,1079 ----
     ECMP is the fundamental load sharing mechanism used by a Clos
     topology.  Effectively, every lower-tier device will use all of its
     directly attached upper-tier devices to load share traffic destined
!    to the same IP prefix.  The number of ECMP paths between any two
Tier-3
     devices in Clos topology equals to the number of the devices in the
!    middle stage (Tier-1).  For example, Figure 5 illustrates a
     topology where Tier-3 device A has four paths to reach servers X and
     Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
     respectively.
***************
*** 1105,1116 ****
  
     The ECMP requirement implies that the BGP implementation must support
     multipath fan-out for up to the maximum number of devices directly
!    attached at any point in the topology in upstream or downstream
     direction.  Normally, this number does not exceed half of the ports
     found on a device in the topology.  For example, an ECMP fan-out of
     32 would be required when building a Clos network using 64-port
     devices.  The Border Routers may need to have wider fan-out to be
!    able to connect to multitude of Tier-1 devices if route summarization
     at Border Router level is implemented as described in Section 5.2.5.
     If a device's hardware does not support wider ECMP, logical link-
     grouping (link-aggregation at layer 2) could be used to provide
--- 1105,1116 ----
  
     The ECMP requirement implies that the BGP implementation must support
     multipath fan-out for up to the maximum number of devices directly
!    attached at any point in the topology in the upstream or downstream
     direction.  Normally, this number does not exceed half of the ports
     found on a device in the topology.  For example, an ECMP fan-out of
     32 would be required when building a Clos network using 64-port
     devices.  The Border Routers may need to have wider fan-out to be
!    able to connect to a multitude of Tier-1 devices if route
summarization
     at Border Router level is implemented as described in Section 5.2.5.
     If a device's hardware does not support wider ECMP, logical link-
     grouping (link-aggregation at layer 2) could be used to provide
***************
*** 1122,1131 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to
     compensate for fan-out limitations.  Such approach, however,
     increases the risk of flow polarization, as less entropy will be
!    available to the second stage of ECMP.
  
     Most BGP implementations declare paths to be equal from an ECMP
     perspective if they match up to and including step (e) in
--- 1122,1131 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to
     compensate for fan-out limitations.  Such approach, however,
     increases the risk of flow polarization, as less entropy will be
!    available at the second stage of ECMP.
  
     Most BGP implementations declare paths to be equal from an ECMP
     perspective if they match up to and including step (e) in
***************
*** 1148,1154 ****
     perspective of other devices, such a prefix would have BGP paths with
     different AS_PATH attribute values, while having the same AS_PATH
     attribute lengths.  Therefore, BGP implementations must support load
!    sharing over above-mentioned paths.  This feature is sometimes known
     as "multipath relax" or "multipath multiple-as" and effectively
     allows for ECMP to be done across different neighboring ASNs if all
     other attributes are equal as already described in the previous
--- 1148,1154 ----
     perspective of other devices, such a prefix would have BGP paths with
     different AS_PATH attribute values, while having the same AS_PATH
     attribute lengths.  Therefore, BGP implementations must support load
!    sharing over the above-mentioned paths.  This feature is sometimes
known
     as "multipath relax" or "multipath multiple-as" and effectively
     allows for ECMP to be done across different neighboring ASNs if all
     other attributes are equal as already described in the previous
***************
*** 1182,1199 ****
  
     It is often desirable to have the hashing function used for ECMP to
     be consistent (see [CONS-HASH]), to minimize the impact on flow to
!    next-hop affinity changes when a next-hop is added or removed to ECMP
     group.  This could be used if the network device is used as a load
     balancer, mapping flows toward multiple destinations - in this case,
!    losing or adding a destination will not have detrimental effect of
     currently established flows.  One particular recommendation on
     implementing consistent hashing is provided in [RFC2992], though
     other implementations are possible.  This functionality could be
     naturally combined with weighted ECMP, with the impact of the next-
     hop changes being proportional to the weight of the given next-hop.
     The downside of consistent hashing is increased load on hardware
!    resource utilization, as typically more space is required to
!    implement a consistent-hashing region.
  
  7.  Routing Convergence Properties
  
--- 1182,1199 ----
  
     It is often desirable to have the hashing function used for ECMP to
     be consistent (see [CONS-HASH]), to minimize the impact on flow to
!    next-hop affinity changes when a next-hop is added or removed to an
ECMP
     group.  This could be used if the network device is used as a load
     balancer, mapping flows toward multiple destinations - in this case,
!    losing or adding a destination will not have a detrimental effect on
     currently established flows.  One particular recommendation on
     implementing consistent hashing is provided in [RFC2992], though
     other implementations are possible.  This functionality could be
     naturally combined with weighted ECMP, with the impact of the next-
     hop changes being proportional to the weight of the given next-hop.
     The downside of consistent hashing is increased load on hardware
!    resource utilization, as typically more resources (e.g., TCAM space)
!    are required to implement a consistent-hashing function.
  
  7.  Routing Convergence Properties
  
***************
*** 1209,1224 ****
     driven mechanism to obtain updates on IGP state changes.  The
     proposed routing design does not use an IGP, so the remaining
     mechanisms that could be used for fault detection are BGP keep-alive
!    process (or any other type of keep-alive mechanism) and link-failure
     triggers.
  
     Relying solely on BGP keep-alive packets may result in high
!    convergence delays, in the order of multiple seconds (on many BGP
     implementations the minimum configurable BGP hold timer value is
     three seconds).  However, many BGP implementations can shut down
     local EBGP peering sessions in response to the "link down" event for
     the outgoing interface used for BGP peering.  This feature is
!    sometimes called as "fast fallover".  Since links in modern data
     centers are predominantly point-to-point fiber connections, a
     physical interface failure is often detected in milliseconds and
     subsequently triggers a BGP re-convergence.
--- 1209,1224 ----
     driven mechanism to obtain updates on IGP state changes.  The
     proposed routing design does not use an IGP, so the remaining
     mechanisms that could be used for fault detection are BGP keep-alive
!    time-out (or any other type of keep-alive mechanism) and link-failure
     triggers.
  
     Relying solely on BGP keep-alive packets may result in high
!    convergence delays, on the order of multiple seconds (on many BGP
     implementations the minimum configurable BGP hold timer value is
     three seconds).  However, many BGP implementations can shut down
     local EBGP peering sessions in response to the "link down" event for
     the outgoing interface used for BGP peering.  This feature is
!    sometimes called "fast fallover".  Since links in modern data
     centers are predominantly point-to-point fiber connections, a
     physical interface failure is often detected in milliseconds and
     subsequently triggers a BGP re-convergence.
***************
*** 1236,1242 ****
  
     Alternatively, some platforms may support Bidirectional Forwarding
     Detection (BFD) [RFC5880] to allow for sub-second failure detection
!    and fault signaling to the BGP process.  However, use of either of
     these presents additional requirements to vendor software and
     possibly hardware, and may contradict REQ1.  Until recently with
     [RFC7130], BFD also did not allow detection of a single member link
--- 1236,1242 ----
  
     Alternatively, some platforms may support Bidirectional Forwarding
     Detection (BFD) [RFC5880] to allow for sub-second failure detection
!    and fault signaling to the BGP process.  However, the use of either of
     these presents additional requirements to vendor software and
     possibly hardware, and may contradict REQ1.  Until recently with
     [RFC7130], BFD also did not allow detection of a single member link
***************
*** 1245,1251 ****
  
  7.2.  Event Propagation Timing
  
!    In the proposed design the impact of BGP Minimum Route Advertisement
     Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
     considered.  Per the standard it is required for BGP implementations
     to space out consecutive BGP UPDATE messages by at least MRAI
--- 1245,1251 ----
  
  7.2.  Event Propagation Timing
  
!    In the proposed design the impact of the BGP Minimum Route
Advertisement
     Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
     considered.  Per the standard it is required for BGP implementations
     to space out consecutive BGP UPDATE messages by at least MRAI
***************
*** 1258,1270 ****
     In a Clos topology each EBGP speaker typically has either one path
     (Tier-2 devices don't accept paths from other Tier-2 in the same
     cluster due to same ASN) or N paths for the same prefix, where N is a
!    significantly large number, e.g.  N=32 (the ECMP fan-out to the next
     Tier).  Therefore, if a link fails to another device from which a
!    path is received there is either no backup path at all (e.g. from
     perspective of a Tier-2 switch losing link to a Tier-3 device), or
!    the backup is readily available in BGP Loc-RIB (e.g. from perspective
     of a Tier-2 device losing link to a Tier-1 switch).  In the former
!    case, the BGP withdrawal announcement will propagate un-delayed and
     trigger re-convergence on affected devices.  In the latter case, the
     best-path will be re-evaluated and the local ECMP group corresponding
     to the new next-hop set changed.  If the BGP path was the best-path
--- 1258,1270 ----
     In a Clos topology each EBGP speaker typically has either one path
     (Tier-2 devices don't accept paths from other Tier-2 in the same
     cluster due to same ASN) or N paths for the same prefix, where N is a
!    significantly large number, e.g.,  N=32 (the ECMP fan-out to the next
     Tier).  Therefore, if a link fails to another device from which a
!    path is received there is either no backup path at all (e.g., from the
     perspective of a Tier-2 switch losing link to a Tier-3 device), or
!    the backup is readily available in BGP Loc-RIB (e.g., from perspective
     of a Tier-2 device losing link to a Tier-1 switch).  In the former
!    case, the BGP withdrawal announcement will propagate without delay and
     trigger re-convergence on affected devices.  In the latter case, the
     best-path will be re-evaluated and the local ECMP group corresponding
     to the new next-hop set changed.  If the BGP path was the best-path
***************
*** 1279,1285 ****
     situation when a link between Tier-3 and Tier-2 device fails, the
     Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
     devices, withdrawing the affected prefixes.  The Tier-1 devices, in
!    turn, will relay those messages to all downstream Tier-2 devices
     (except for the originator).  Tier-2 devices other than the one
     originating the UPDATE should then wait for ALL upstream Tier-1
  
--- 1279,1285 ----
     situation when a link between Tier-3 and Tier-2 device fails, the
     Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
     devices, withdrawing the affected prefixes.  The Tier-1 devices, in
!    turn, will relay these messages to all downstream Tier-2 devices
     (except for the originator).  Tier-2 devices other than the one
     originating the UPDATE should then wait for ALL upstream Tier-1
  
***************
*** 1307,1313 ****
     features that vendors include to reduce the control plane impact of
     rapidly flapping prefixes.  However, due to issues described with
     false positives in these implementations especially under such
!    "dispersion" events, it is not recommended to turn this feature on in
     this design.  More background and issues with "route flap dampening"
     and possible implementation changes that could affect this are well
     described in [RFC7196].
--- 1307,1313 ----
     features that vendors include to reduce the control plane impact of
     rapidly flapping prefixes.  However, due to issues described with
     false positives in these implementations especially under such
!    "dispersion" events, it is not recommended to enable this feature in
     this design.  More background and issues with "route flap dampening"
     and possible implementation changes that could affect this are well
     described in [RFC7196].
***************
*** 1316,1324 ****
  
     A network is declared to converge in response to a failure once all
     devices within the failure impact scope are notified of the event and
!    have re-calculated their RIB's and consequently updated their FIB's.
     Larger failure impact scope typically means slower convergence since
!    more devices have to be notified, and additionally results in a less
     stable network.  In this section we describe BGP's advantages over
     link-state routing protocols in reducing failure impact scope for a
     Clos topology.
--- 1316,1324 ----
  
     A network is declared to converge in response to a failure once all
     devices within the failure impact scope are notified of the event and
!    have re-calculated their RIBs and consequently updated their FIBs.
     Larger failure impact scope typically means slower convergence since
!    more devices have to be notified, and results in a less
     stable network.  In this section we describe BGP's advantages over
     link-state routing protocols in reducing failure impact scope for a
     Clos topology.
***************
*** 1327,1335 ****
     the best path from the point of view of the local router is sent to
     neighbors.  As such, some failures are masked if the local node can
     immediately find a backup path and does not have to send any updates
!    further.  Notice that in the worst case ALL devices in a data center
     topology have to either withdraw a prefix completely or update the
!    ECMP groups in the FIB.  However, many failures will not result in
     such a wide impact.  There are two main failure types where impact
     scope is reduced:
  
--- 1327,1335 ----
     the best path from the point of view of the local router is sent to
     neighbors.  As such, some failures are masked if the local node can
     immediately find a backup path and does not have to send any updates
!    further.  Notice that in the worst case, all devices in a data center
     topology have to either withdraw a prefix completely or update the
!    ECMP groups in their FIBs.  However, many failures will not result in
     such a wide impact.  There are two main failure types where impact
     scope is reduced:
  
***************
*** 1357,1367 ****
  
     o  Failure of a Tier-1 device: In this case, all Tier-2 devices
        directly attached to the failed node will have to update their
!       ECMP groups for all IP prefixes from non-local cluster.  The
        Tier-3 devices are once again not involved in the re-convergence
        process, but may receive "implicit withdraws" as described above.
  
!    Even though in case of such failures multiple IP prefixes will have
     to be reprogrammed in the FIB, it is worth noting that ALL of these
     prefixes share a single ECMP group on Tier-2 device.  Therefore, in
     the case of implementations with a hierarchical FIB, only a single
--- 1357,1367 ----
  
     o  Failure of a Tier-1 device: In this case, all Tier-2 devices
        directly attached to the failed node will have to update their
!       ECMP groups for all IP prefixes from a non-local cluster.  The
        Tier-3 devices are once again not involved in the re-convergence
        process, but may receive "implicit withdraws" as described above.
  
!    Even in the case of such failures, multiple IP prefixes will have
     to be reprogrammed in the FIB, it is worth noting that ALL of these
     prefixes share a single ECMP group on Tier-2 device.  Therefore, in
     the case of implementations with a hierarchical FIB, only a single
***************
*** 1375,1381 ****
     possible with the proposed design, since using this technique may
     create routing black-holes as mentioned previously.  Therefore, the
     worst control plane failure impact scope is the network as a whole,
!    for instance in a case of a link failure between Tier-2 and Tier-3
     devices.  The amount of impacted prefixes in this case would be much
     less than in the case of a failure in the upper layers of a Clos
     network topology.  The property of having such large failure scope is
--- 1375,1381 ----
     possible with the proposed design, since using this technique may
     create routing black-holes as mentioned previously.  Therefore, the
     worst control plane failure impact scope is the network as a whole,
!    for instance in thecase of a link failure between Tier-2 and Tier-3
     devices.  The amount of impacted prefixes in this case would be much
     less than in the case of a failure in the upper layers of a Clos
     network topology.  The property of having such large failure scope is
***************
*** 1384,1397 ****
  
  7.5.  Routing Micro-Loops
  
!    When a downstream device, e.g.  Tier-2 device, loses all paths for a
     prefix, it normally has the default route pointing toward the
     upstream device, in this case the Tier-1 device.  As a result, it is
!    possible to get in the situation when Tier-2 switch loses a prefix,
!    but Tier-1 switch still has the path pointing to the Tier-2 device,
!    which results in transient micro-loop, since Tier-1 switch will keep
     passing packets to the affected prefix back to Tier-2 device, and
!    Tier-2 will bounce it back again using the default route.  This
     micro-loop will last for the duration of time it takes the upstream
     device to fully update its forwarding tables.
  
--- 1384,1397 ----
  
  7.5.  Routing Micro-Loops
  
!    When a downstream device, e.g.,  Tier-2 device, loses all paths for a
     prefix, it normally has the default route pointing toward the
     upstream device, in this case the Tier-1 device.  As a result, it is
!    possible to get in the situation where a Tier-2 switch loses a prefix,
!    but a Tier-1 switch still has the path pointing to the Tier-2 device,
!    which results in transient micro-loop, since the Tier-1 switch will
keep
     passing packets to the affected prefix back to Tier-2 device, and
!    the Tier-2 will bounce it back again using the default route.  This
     micro-loop will last for the duration of time it takes the upstream
     device to fully update its forwarding tables.
  
***************
*** 1402,1408 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can
     be configured with static "discard" or "null" routes that will be
     more specific than the default route for prefixes missing during
     network convergence.  For Tier-2 switches, the discard route should
--- 1402,1408 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    To minimize the impact of such micro-loops, Tier-2 and Tier-1
switches can
     be configured with static "discard" or "null" routes that will be
     more specific than the default route for prefixes missing during
     network convergence.  For Tier-2 switches, the discard route should
***************
*** 1417,1423 ****
  
  8.1.  Third-party Route Injection
  
!    BGP allows for a "third-party", i.e. directly attached, BGP speaker
     to inject routes anywhere in the network topology, meeting REQ5.
     This can be achieved by peering via a multihop BGP session with some
     or even all devices in the topology.  Furthermore, BGP diverse path
--- 1417,1423 ----
  
  8.1.  Third-party Route Injection
  
!    BGP allows for a "third-party", i.e., directly attached, BGP speaker
     to inject routes anywhere in the network topology, meeting REQ5.
     This can be achieved by peering via a multihop BGP session with some
     or even all devices in the topology.  Furthermore, BGP diverse path
***************
*** 1427,1433 ****
     implementation.  Unfortunately, in many implementations ADD-PATH has
     been found to only support IBGP properly due to the use cases it was
     originally optimized for, which limits the "third-party" peering to
!    IBGP only, if the feature is used.
  
     To implement route injection in the proposed design, a third-party
     BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
--- 1427,1433 ----
     implementation.  Unfortunately, in many implementations ADD-PATH has
     been found to only support IBGP properly due to the use cases it was
     originally optimized for, which limits the "third-party" peering to
!    IBGP only.
  
     To implement route injection in the proposed design, a third-party
     BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
***************
*** 1442,1453 ****
     As mentioned previously, route summarization is not possible within
     the proposed Clos topology since it makes the network susceptible to
     route black-holing under single link failures.  The main problem is
!    the limited number of redundant paths between network elements, e.g.
     there is only a single path between any pair of Tier-1 and Tier-3
     devices.  However, some operators may find route aggregation
     desirable to improve control plane stability.
  
!    If planning on using any technique to summarize within the topology
     modeling of the routing behavior and potential for black-holing
     should be done not only for single or multiple link failures, but
  
--- 1442,1453 ----
     As mentioned previously, route summarization is not possible within
     the proposed Clos topology since it makes the network susceptible to
     route black-holing under single link failures.  The main problem is
!    the limited number of redundant paths between network elements, e.g.,
     there is only a single path between any pair of Tier-1 and Tier-3
     devices.  However, some operators may find route aggregation
     desirable to improve control plane stability.
  
!    If any technique to summarize within the topology is planned,
     modeling of the routing behavior and potential for black-holing
     should be done not only for single or multiple link failures, but
  
***************
*** 1458,1468 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    also fiber pathway failures or optical domain failures if the
     topology extends beyond a physical location.  Simple modeling can be
     done by checking the reachability on devices doing summarization
     under the condition of a link or pathway failure between a set of
!    devices in every tier as well as to the WAN routers if external
     connectivity is present.
  
     Route summarization would be possible with a small modification to
--- 1458,1468 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
  
  
!    also fiber pathway failures or optical domain failures when the
     topology extends beyond a physical location.  Simple modeling can be
     done by checking the reachability on devices doing summarization
     under the condition of a link or pathway failure between a set of
!    devices in every tier as well as to the WAN routers when external
     connectivity is present.
  
     Route summarization would be possible with a small modification to
***************
*** 1519,1544 ****
     cluster from Tier-2 devices since each of them has only a single path
     down to this prefix.  It would require dual-homed servers to
     accomplish that.  Also note that this design is only resilient to
!    single link failure.  It is possible for a double link failure to
     isolate a Tier-2 device from all paths toward a specific Tier-3
     device, thus causing a routing black-hole.
  
!    A result of the proposed topology modification would be reduction of
     Tier-1 devices port capacity.  This limits the maximum number of
     attached Tier-2 devices and therefore will limit the maximum DC
     network size.  A larger network would require different Tier-1
     devices that have higher port density to implement this change.
  
     Another problem is traffic re-balancing under link failures.  Since
!    three are two paths from Tier-1 to Tier-3, a failure of the link
     between Tier-1 and Tier-2 switch would result in all traffic that was
     taking the failed link to switch to the remaining path.  This will
!    result in doubling of link utilization on the remaining link.
  
  8.2.2.  Simple Virtual Aggregation
  
     A completely different approach to route summarization is possible,
!    provided that the main goal is to reduce the FIB pressure, while
     allowing the control plane to disseminate full routing information.
     Firstly, it could be easily noted that in many cases multiple
     prefixes, some of which are less specific, share the same set of the
--- 1519,1544 ----
     cluster from Tier-2 devices since each of them has only a single path
     down to this prefix.  It would require dual-homed servers to
     accomplish that.  Also note that this design is only resilient to
!    single link failures.  It is possible for a double link failure to
     isolate a Tier-2 device from all paths toward a specific Tier-3
     device, thus causing a routing black-hole.
  
!    A result of the proposed topology modification would be a reduction of
     Tier-1 devices port capacity.  This limits the maximum number of
     attached Tier-2 devices and therefore will limit the maximum DC
     network size.  A larger network would require different Tier-1
     devices that have higher port density to implement this change.
  
     Another problem is traffic re-balancing under link failures.  Since
!    there are two paths from Tier-1 to Tier-3, a failure of the link
     between Tier-1 and Tier-2 switch would result in all traffic that was
     taking the failed link to switch to the remaining path.  This will
!    result in doubling the link utilization of the remaining link.
  
  8.2.2.  Simple Virtual Aggregation
  
     A completely different approach to route summarization is possible,
!    provided that the main goal is to reduce the FIB size, while
     allowing the control plane to disseminate full routing information.
     Firstly, it could be easily noted that in many cases multiple
     prefixes, some of which are less specific, share the same set of the
***************
*** 1550,1563 ****
     [RFC6769] and only install the least specific route in the FIB,
     ignoring more specific routes if they share the same next-hop set.
     For example, under normal network conditions, only the default route
!    need to be programmed into FIB.
  
     Furthermore, if the Tier-2 devices are configured with summary
!    prefixes covering all of their attached Tier-3 device's prefixes the
     same logic could be applied in Tier-1 devices as well, and, by
     induction to Tier-2/Tier-3 switches in different clusters.  These
     summary routes should still allow for more specific prefixes to leak
!    to Tier-1 devices, to enable for detection of mismatches in the next-
     hop sets if a particular link fails, changing the next-hop set for a
     specific prefix.
  
--- 1550,1563 ----
     [RFC6769] and only install the least specific route in the FIB,
     ignoring more specific routes if they share the same next-hop set.
     For example, under normal network conditions, only the default route
!    needs to be programmed into FIB.
  
     Furthermore, if the Tier-2 devices are configured with summary
!    prefixes covering all of their attached Tier-3 device's prefixes, the
     same logic could be applied in Tier-1 devices as well, and, by
     induction to Tier-2/Tier-3 switches in different clusters.  These
     summary routes should still allow for more specific prefixes to leak
!    to Tier-1 devices, to enable detection of mismatches in the next-
     hop sets if a particular link fails, changing the next-hop set for a
     specific prefix.
  
***************
*** 1571,1584 ****
  
  
     Re-stating once again, this technique does not reduce the amount of
!    control plane state (i.e.  BGP UPDATEs/BGP LocRIB sizing), but only
!    allows for more efficient FIB utilization, by spotting more specific
!    prefixes that share their next-hops with less specifics.
  
  8.3.  ICMP Unreachable Message Masquerading
  
     This section discusses some operational aspects of not advertising
!    point-to-point link subnets into BGP, as previously outlined as an
     option in Section 5.2.3.  The operational impact of this decision
     could be seen when using the well-known "traceroute" tool.
     Specifically, IP addresses displayed by the tool will be the link's
--- 1571,1585 ----
  
  
     Re-stating once again, this technique does not reduce the amount of
!    control plane state (i.e.,  BGP UPDATEs/BGP Loc-RIB size), but only
!    allows for more efficient FIB utilization, by detecting more specific
!    prefixes that share their next-hop set with a subsuming less specific
!    prefix.
  
  8.3.  ICMP Unreachable Message Masquerading
  
     This section discusses some operational aspects of not advertising
!    point-to-point link subnets into BGP, as previously identified as an
     option in Section 5.2.3.  The operational impact of this decision
     could be seen when using the well-known "traceroute" tool.
     Specifically, IP addresses displayed by the tool will be the link's
***************
*** 1587,1605 ****
     complicated.
  
     One way to overcome this limitation is by using the DNS subsystem to
!    create the "reverse" entries for the IP addresses of the same device
!    pointing to the same name.  The connectivity then can be made by
!    resolving this name to the "primary" IP address of the devices, e.g.
     its Loopback interface, which is always advertised into BGP.
     However, this creates a dependency on the DNS subsystem, which may be
     unavailable during an outage.
  
     Another option is to make the network device perform IP address
     masquerading, that is rewriting the source IP addresses of the
!    appropriate ICMP messages sent off of the device with the "primary"
     IP address of the device.  Specifically, the ICMP Destination
     Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
!    Exceeded (type 11) code 0, which are involved in proper working of
     the "traceroute" tool.  With this modification, the "traceroute"
     probes sent to the devices will always be sent back with the
     "primary" IP address as the source, allowing the operator to discover
--- 1588,1606 ----
     complicated.
  
     One way to overcome this limitation is by using the DNS subsystem to
!    create the "reverse" entries for these point-to-point IP addresses
pointing
!    to a the same name as the loopback address.  The connectivity then
can be made by
!    resolving this name to the "primary" IP address of the devices, e.g.,
     its Loopback interface, which is always advertised into BGP.
     However, this creates a dependency on the DNS subsystem, which may be
     unavailable during an outage.
  
     Another option is to make the network device perform IP address
     masquerading, that is rewriting the source IP addresses of the
!    appropriate ICMP messages sent by the device with the "primary"
     IP address of the device.  Specifically, the ICMP Destination
     Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
!    Exceeded (type 11) code 0, which are required for correct operation of
     the "traceroute" tool.  With this modification, the "traceroute"
     probes sent to the devices will always be sent back with the
     "primary" IP address as the source, allowing the operator to discover

Thanks,
Acee

_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg
Jeff Tantsura | 21 Apr 09:42 2016
Picon

WGLC on draft-ietf-rtgwg-rlfa-node-protection

Dear RTGWG,
 
The authors of draft-ietf-rtgwg-rlfa-node-protection have told us that the
draft is ready for working group last call (WGLC).
 
Before we do the WGLC we want to do an IPR poll on the document.
 
This mail starts that IPR poll.
 
Are you aware of any IPR that applies to draft-ietf-rtgwg-rlfa-node-protection?
 
If so, has this IPR been disclosed in compliance with IETF IPR rules
(see RFCs 3979, 4879, 3669 and 5378 for more details).
 
Currently there are two IPR disclosures on draft-psarkar-rtgwg-rlfa-node-protection
(which was the pre-working group version of draft-ietf-rtgwg-rlfa-node-protection):

If you are listed as a document author or contributor please respond to
this email regardless of whether or not you are aware of any relevant
IPR. *The response needs to be sent to the MPLS wg mailing list.* The
document will not advance to the next stage until a response has been
received from each author and contributor.
 
If you are on the RTGWG email list but are not listed as an author or
contributor, then please explicitly respond only if you are aware of any
IPR that has not yet been disclosed in conformance with IETF rules.
 
Thanks, 
Jeff and Chris
_______________________________________________
rtgwg mailing list
rtgwg <at> ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg

Gmane