I have done my usual AD review of this draft before progressing it.
Thanks for the hard work on this!
I have a number of different comments on the draft, as given below.
None of them are sufficient for me to be concerned about starting the
IETF Last Call, but please address them as soon as possible.
Assuming an updated draft will appear and a quiet IETF Last Call, I expect
this to be on the IESG telechat in early Jan.
1) In Sec 2, 3rd paragraph, in the sentence:
"The single node in both S's P-space and E's Q-space is C; thus node C is selected as the repair tunnel's end-point."
it should be "S's extended P-space"
2) In Sec 2, it says: "The non-failure traffic distribution is not disrupted by the provision of such a tunnel since it is only used for repair traffic and MUST NOT be used for normal traffic."
This is obviously correct and good - but I think it would be very useful to clarify that OAM traffic to test the rLFA may transit the tunnel at any time. Otherwise, the MUST NOT could cause some confusion - depending on how one thinks about "normal traffic".
3) In Sec 3: I can't parse "Examples of worse failures are node failures (see Section 6 ), and through the failure of a shared risk link group (SRLG), the through the independent concurrent failure of multiple links, and these are out of scope for this specification."
I think you mean "Examples of worse failures are node failures (see Section 6), the failure of a shared risk link group (SRLG), the independent concurrent failures of multiple links; protecting against such worse failures is out of scope for this specification." I would add in the failure of broadcast interfaces and NBMA interfaces for completeness, even though that was mentioned in Sec 2.
4) In Sec 4.2: " Provided both these requirements are met, packets forwarded over the repair tunnel will reach their destination and will not loop." Please change to:
"will not loop after the single link failure". Of course, looping can happen if a worse failure than protected against occurs - as with LFA. This could also be mitigated by requiring that the PQ node is downstream of the PLR, as is mentioned in Sec 4.2.2.
5) In Sec 188.8.131.52
: "This may be calculated by computing an SPT at each of S's neighbors (excluding E) and excising the subtree reached via the path N->S->E."
As described here, a node Y that is reached via N->S->A would be considered to be in S's extended P-space. I realize that one would assume that Y would be in S's P-space anyway and thus it is safe to not care about this edge case. However, the ECMP considerations make it more complex so please at a minimum add in the same caveat as in Sec 184.108.40.206 "(including those routers which are members of an ECMP that includes link S-E)" suitably modified. In the cost-based version in Compute_Extended_P_Space, this is handled by ignoring any potential node from N whose shortest path goes back through S. It'd be nice if the two methods were consistent.
6) In Sec 4.2.2: "As described in [RFC5286], always selecting a PQ node that is downstream with respect to the repairing node, prevents the formation of loops when the failure is worse than expected." Could you clarify that the PQ node is downstream with respect to the repairing node and the destination - rather than the proxy destination E? I'm fairly certain that the latter wouldn't work (but don't have an example topology created). If you disagree, let me know and I'll work on creating one. This is the constraint that is expressed in Apply_Downstream_Constraint().
7) In Sec 4.3: "The reader is referred to [I-D.psarkar-rtgwg-rlfa-node-protection] for further information on the use of RLFA for node repairs." Can you add "and broadcast or NBMA link repairs"? Do you feel that is accurate?
8) In Sec 6: s/"When the failure is a node failure rather than a link failure"/"When the failure is a node failure rather than a point-to-point link failure"
9) In Sec 6: "Alternatively one might choose to assume that the probability of a node failure and microloops forming is sufficiently rare that the case can be ignored." Can you please clarify from microloops to "microloops forming due to use of alternates"? We know that in cases where a rLFA is necessary, that neighbor isn't loop-free and so regular microloops due to reconvergence will form.
10) In Sec 7: "In the absence of a protocol to learn the preferred IP address for targeted LDP, an LSR should attempt a targeted LDP session with the Router ID [RFC2328] [RFC5305] [RFC5340], unless it is configured otherwise." Can you please add in some text for how this would work for IPv6? I believe that there are current drafts discussing carrying Routable IP addresses (e.g. http://datatracker.ietf.org/doc/draft-ietf-ospf-routable-ip-address/ ). We know that there is interest in having IPv6 only networks with MPLS - so it'd be good not to create new gaps.
11) In Sec 8.4: "In an MPLS network, this is achieved without any scaleability impact, as the tunnels to the PQ nodes are always present as a property of an LDP-based deployment." The targeted LDP sessions don't have a scaleability impact? That the repair tunnels don't need to be specifically created as new tunnels, I agree with - but this statement is overselling. Please make the technical point more clearly.
12) In Sec 9: I feel like here is a good place at least mention the issues with microloops from reconvergence. Since reconvergence after rLFA is going to result in a local microloop (depending on timing), at least a reference to https://tools.ietf.org/html/draft-litkowski-rtgwg-uloop-delay-03 with a recommendation to consider it is important. Otherwise, the rLFA repair happens and then traffic microloops and is lost. The fact that these local microloops occur with real impact much more with rLFA (or any advanced FRR technique) is an important management consideration.
13) Sec 12: Saying "To prevent their use as an attack vector the repair tunnel endpoints SHOULD be assigned from a set of addresses that are not reachable from outside the routing domain." is basically empty words without more behind Sec 7 default of using Router IDs. Can you find a reference that talks about a BCP for Router IDs not being reachable addresses outside the routing domain? Can you describe how to use the IGP extensions?
a) In Sec 220.127.116.11
: "The exclusion of routers reachable via an ECMP that includes S-E prevents the forwarding subsystem attempting to a repair endpoint via the failed link S-E."
s/attempting to a repair/from attempting to use a repair
b) In Sec 10: "We propose "Remote LFA" as a natural second step." This is going to be an RFC - so rather than propose, try specify.