Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC
2014-03-04 05:01:00 GMT
Reviews of draft-ietf-bmwg-sip-bench-term-08 and draft-ietf-bmwg-sip-bench-meth-08
**G: Summary: These drafts are not ready for publication as RFCs.
***Response: G: Summary: We have edited the documents in light of the comments provided by Robert and other reviewers and also in light of experience running the tests in a lab environment and with the collaboration of a vendor of a commercial product. We changed the titles of the documents to reflect more accurately their scope; we reduced the number of benchmarks and the number of tests. We reduced the number of distinct test architectures to two and moved the illustrations of the two architectures to the Methodology document for ease of use. Details on these and other changes are inline below.
**G: Item 1: First, some of the text in these documents shows signs of being old, and the
working group may have been staring at them so long that they've become hard to
see. The terminology document says "The issue of overload in SIP networks is
currently a topic of discussion in the SIPPING WG." (SIPPING was closed in
2009). The methodology document suggests a "flooding" rate that is orders of
magnitude below what simple devices achieve at the moment. That these survived
working group last call indicates a different type of WG review may be needed
to groom other bugs out of the documents.
***Response, G: Item 1:
We removed comments and tests related to flooding from the documents.
**G: Item 2: Who is asking for these benchmarks, and are they (still) participating in the
group? The measurements defined here are very simplistic and will provide
limited insight into the relative performance of two elements in a real
deployment. The documents should be clear about their limitations, and it would
be good to know that the community asking for these benchmarks is getting tools
that will actually be useful to them. The crux of these two documents is in the
last paragraph of the introduction to the methodology doc: "Finally, the
overall value of these tests is to serve as a comparison function between
multiple SIP implementations". The documents punt on providing any comparison
guidance, but even if we assume someone can figure that out, do these
benchmarks provide something actually useful for inputs?
***Response,G: Item 2:
Yes, they are valuable to the community.
1. A major SBC vendor used these documents and the paid services of two students to perform the tests describe therein to learn the values of the benchmarks which were subsequently published for external release.
2. Regarding the measurements being simplistic: They were intentionally designed to be simplistic because the goal of the BMWG Working Group is not to reproduce real-world traffic in the lab. To quote the BMWG charter: To better distinguish the BMWG from other measurement initiatives in the IETF, the scope of the BMWG is limited to the characterization of implementations of various internetworking technologies using controlled stimuli in a laboratory environment. Said differently, the BMWG does not attempt to produce benchmarks for live, operational Networks.
3. Regarding the documents not providing any comparison guidance, again, that is intentional. The documents were designed such that testing two different implementations will result in two different reports that then can be compared by Operations personnel. It is not the job of the document itself to provide comparison guidance. The metrics generated by the methods in these documents define a frontiers beyond which "there be dragons."
4. In summary, we believe that these documents are useful and they have been used by vendors in the community.
**G: Item 3: It would be good to explain how these documents relate to RFC6076.
***Response, G: Item 3:
The authors have been in contact for several years and agreed that there is little overlap. The 6076 relates to the end to end performance of a service on a network. These drafts, on the other hand, refer to lab tests of a device .
**G: Item 4: The terminology tries to refine the definition of session, but the definition
provided, "The combination of signaling and media messages and processes that
support a SIP-based service" doesn't answer what's in one session vs another.
Trying to generically define session has been hard and several working groups
have struggled with it (see INSIPID for a current version of that
conversation). This document doesn't _need_ a generic definition of session -
it only needs to define the set of messages that it is measuring. It would be
much clearer to say "for the purposes of this document, as session is the set
of SIP messages associated with an INVITE initiated dialog and any Associated
Media, or a series of related SIP MESSAGE requests". (And looking at the
benchmarks, you aren't leveraging related MESSAGE requests - they all appear to
be completely independent). Introducing the concepts of Invite-initiated
sessions and non-invite-initiated sessions doesn't actually help define the
metrics. When you get to the metrics, you can speak concretely in terms of a
series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing a short
introduction pointing folks with PSTN backgrounds relating these to "Session
Attempts" will be clearer.
To be clear, I strongly suggest a fundamental restructuring of the document to
describe the benchmarks in terms of dialogs and transactions, and remove the IS
and NS concepts completely.
***Response, G: Item 4:Re-definition of a session:
We believe that the 3D depiction of the session is useful. As we state in the document, the definition is for the purpose of this document only. The reason we created it was to be able to refer to all the different cases – ones in which we have a INVITE initiated session with media, in which case all three of the components are non-null; the case of a INVITE-initiated session without media, in which case the media and control components are null and only the Sig component is non-null; and the case of non-INVITE initiated sessions, such as REGISTER and MESSAGE, in which case again, the only non-null component is the Sig component. We will in the next revision of the document refer to the diagram and its nomenclature in our descriptions of the metrics and the test cases. Each test case describes the set of SIP messages and the order in which they should be sent. For this reason we do not need to define a session as this or that set of SIP requests.
**G: Item 5: The INVITE related tests assume no provisional responses, leaving out the
effect on a device's memory when the state machines it is maintaining transition
to the proceeding state. Further, by not including provisionals, and building
the tests to search for Timer B firing, the tests insure there will be multiple
retransmissions of the INVITE (when using UDP) that the device being tested has
to handle. The traffic an element has to handle and likely the memory it will
consume will be very different with even a single 100 trying, which is the more
usual case in deployed networks. The document should be clear _why_ it chose
the test model it did and left out metrics that took having a provisional
response into account. Similarly, you are leaving out the delayed-offer INVITE
transactions used by 3pcc and it should be more obvious that you are doing so.
Likewise, the media oriented tests take a very basic approach to simulating
media. It should be explicitly stated that you are simulating the effects of a
codec like G.711 and that you are assuming an element would only be forwarding
packets and has to do no transcoding work. It's not clear from the documents
whether the EA is generating actual media or dummy packets. If it's actual
media, the test parameters that assume constant sized packets at a constant
rate will not work well for video (and I suspect endpoints, like B2BUAs, will
terminate your call early if you send them garbage).
The sections on a series of INVITEs is fairly clear that you mean each of them
to have different dialog identifiers. I don't see any discussion of varying
the To: URI. If you don't, what's going to keep a gateway or B2BUA from
rejecting all but the first with something like Busy? Similarly, I'm not
finding where you talk about how many AoRs you are registering against in the
registration tests. I think, as written, someone could write this where all the
registers affected only one AoR.
***Response, G: Item 5:
Item 5: Why not define all the metrics in terms of dialogs and transactions?
These documents describe black-box testing. The evidence of the existence of transactions is that the session was set up. In the case of a REGISTER request, for example, we see the 200 OK to the REGISTER and know there was a successful session.
**G: Item 6: Stress Testing:
The methodology document calls Stress Testing out of scope, but the very nature
of the Benchmarking algorithm is a stress test. You are iteratively pushing to
see at what point something fails, _exactly_ by finding the rate of attempted
sessions per second that the thing under test would consider too high.
*** Response, G: Item 6:
These are benchmark tests, designed to find the highest rate at which the system can handle session attempts with no failures of the application itself. The tests stop at the point where a single application error is observer. Stress testing would continue to run with an ever-increasing number of errors at the application layer, at ever higher rates until such time as the platform upon which the application runs, fails catastrophically, by, for example rebooting, or stopping operation entirely and failing to reboot.
- - - - - - - - - - - - - - - - - - - - - -
Now to specific issues in document order, starting with the terminology
document (nits are separate and at the end):
** T (for Terminology document): The title and abstract are misleading - this is
not general benchmarking for SIP performance. You have a narrow set of
tests, gathering metrics on a small subset of the protocol machinery.
Please (as RFC 6076 did) look for a title that matches the scope of the
document. For instance, someone testing a SIP Events server would be ill-served
with the benchmarks defined here.
*** Response: T: The documents have been renamed as follows:
Methodology for Benchmarking Session Initiation Protocol (SIP) Devices: Basic session setup and registration
Terminology for Benchmarking Session Initiation Protocol (SIP) Devices:Basic session setup and registration
** T, section 1: RFC5393 should be a normative reference. You probably also need
to pull in RFCs 4320 and 6026 in general - they affect the state machines you
*** Response, T, section 1: Agreed. We have pulled in rfc5393, rfc4320 and rfc6026.
** T, 3.1.1: As noted above, this definition of session is not useful. It
doesn't provide any distinction between two different sessions. I strongly
disagree that SIP reserves "session" to describe services analogous to
telephone calls on a switched network - please provide a reference. SIP INVITE
transactions can pend forever - it is only the limited subset of the use of
the transactions (where you don't use a provisional response) that keeps this
communication "brief". In the normal case, an INVITE an its final response can
be separated by an arbitrary amount of time. Instead of trying to tweak this
text, I suggest replacing all of it with simpler, more direct descriptions of
the sequence of messages you are using for the benchmarks you are defining
***Response, T, 3.1.1: Same as response to Item 4: Re-definition of a session:
We believe that the 3D depiction of the session is useful. As we state in the document, the definition is for the purpose of this document only. The reason we created it was to be able to refer to all the different cases – ones in which we have a INVITE initiated session with media, in which case all three of the components are non-null; the case of a INVITE-initiated session without media, in which case the media and control components are null and only the Sig component is non-null; and the case of non-INVITE initiated sessions, such as REGISTER and MESSAGE, in which case again, the only non-null component is the Sig component. Each test case describes the set of SIP messages and the order in which they should be sent. For this reason we do not need to define a session as this or that set of SIP requests.
**T, 3.1.1: How is this vector notion (and graph) useful for this document? I
don't see that it's actually used anywhere in the documents. Similarly, the
arrays don't appear to be actually used (though you reference them from some
definitions) - What would be lost from the document if you simply removed all
***Response, T3.1.1: It is not necessary to refer to the diagram after the initial explanation. We do in fact refer to the components of the session in the methodology document.
- - - - - - -
**T, 3.1.5, Discussion, last sentence: Why is it important to say "For UA-type
of network devices such as gateways, it is expected that the UA will be driven
into overload based on the volume of media streams it is processing." It's not
clear that's true for all such devices. How is saying anything here useful?
***Response: T, 3.1.5: We do not consider gateways anymore, so we have removed this from T,3.1.5.
**T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session
Attempt. Why not just say INVITE? You aren't actually measuring "session
attempts" for INVITEs or REGISTERs - you have separate benchmarks for them.
***Response: T, 3.1.6: Agreed. The definition was modified to say, "A SIP INVITE or
REGISTER request sent by the EA that has not received a final response."
**T, 3.1.7: It needs to be explicit that these benchmarks are not accounting
for/allowing early dialogs.
***Response: T, 3.1.7: Agreed. We added a sentence to that affect.
**T, 3.1.8: The words "early media" appear here for the first time. Given the
way the benchmarks are defined, does it make sense to discuss early media in
these documents at all (beyond noting you do not account for it)? If so,
there needs to be much more clarity. (By the way, this Discussion will be
much easier to write in terms of dialogs).
***Response: T, 3.1.8: We now refer to early pre-call media, following what rfc3261 does in Section 20.11 when it first talks about early media.
**T, 3.1.9, Discussion point 2: What does "the media session is established"
mean? If you leave this written as a generic definition, then is this when an
MSRP connection has been made? If you simplify it to the simple media model
currently in the document, does it mean an RTP packet has been sent? Or does it
have to be received?. For the purposes of the benchmarks defined here, it
doesn't seem to matter, so why have this as part of the discussion anyway?
***Response: T, 3.1.9: We did not find that phrase in T 3.1.9, but we did find a SUBSCRIBE given as an example of a NS session and changed that to a REGISTER.
**T, 3.1.9, Definition: A series of CANCELs meets this definition.
***Response: We have clarified the fact that we only consider the REGISTER request
as an NS. The CANCELs are out of scope.
**T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and they aren't
covered elsewhere in the document.
***Response: T, 3.1.10 Discussion: The 3xx has been added to the list as well. Only the 2xx is considered to be a success.
**T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Why is it in
this document and not -meth-?
***Response: T, 3.1.11 Discussion: T3.1.11 was removed from version (-09).
**T, 3.1.11 Discussion, next to last sentence: "measured by the number of
distinct Call-IDs" means you are not supporting forking, or you would not count
answers from more than on leg of the fork as different sessions, like you
should. Or are you intending that there would never be an answer from more than
one leg of a fork? If so, the documents need to be clearer about the
methodology and what's actually being measured.
***Response: T, 3.1.11 Discussion: T3.1.11 was removed from version (-09).
**T, 3.2.2 Definition: There's something wrong with this definition. For
example, proxies do not create sessions (or dialogs). Did you mean "forwards
***Response: T, 3.2.2 Definition: Wording was changed to, "Device in the test topology that facilitates the creation of sessions between EAs."
**T, 3.2.2 Discussion: This is definition by enumeration since it uses a MUST,
and is exclusive of any future things that might sit in the middle. If that's
what you want, make this the definition. The MAY seems contradictory unless you
are saying a B2BUA or SBC is just a specialized User Agent Server. If so,
please say it that way.
***Response: T, 3.2.2 Discussion: The text now reads as follows: "The DUT is an
RFC3261-compatible network intermediary such as ..."
**T, 3.2.3: This seems out of place or under-explored. You don't appear to
actually _use_ this definition in the documents. You declare these things in
scope, but the only consequence is the line in this section about the not
lowering performance benchmarks when present. Consider making that part of the
methodology of a benchmark and removing this section. If you think it's
essential, please revisit the definition - you may want to generalize it into
_anything_ that sits on the path and may affect SIP processing times
(otherwise, what's special about this either being SIP Aware, or being a
***Response: T, 3.2.3: References to firewalls both stateful and otherwise have been removed.
**T, 3.2.5 Definition: This definition just obfuscates things. Point to 3261's
definition instead. How is TCP a measurement unit? Does the general terminology
template include "enumeration" as a type? Do you really want to limit this
enumeration to the set of currently defined transports? Will you never run
these benchmarks for SIP over websockets?
**Response: T, 3.2.5 Definition: The set of transports now includes websockets,
**T, 3.3.2 Discussion: Again, there needs to be clarity about what it means to
"create" a media session. This description differentiates attempt vs success,
so what is it exactly that makes a media session attempt successful? When you
say number of media sessions, do you mean number of M lines or total number of
INVITEs that have SDP with m lines?
***Response: T, 3.3.2 Discussion: This term was removed.
** T, 3.3.3: This would much clearer written in terms of transactions and dialogs
(you are already diving into transaction state machine details). This is a
place where the document needs to point out that it is not providing benchmarks
relevant to environments where provisionals are allowed to happen and INVITE
transactions are allowed to pend.
***Response: ** T, 3.3.3: This is about whether or not the attempt to set up a call has succeeded. It is about how we define success and failure. It is about how long you wait before you declare a failure. This section defines a parameter, measured in units of time, that represents the amount of time your EA Client will wait for a response from the EA Server, after the elapse of which the EA will declare a failure to establish a call. Remember, this is lab testing not end to end testing. We are not concerned with whether or not the call is ever set up after some errors have occurred. We are testing to failure. The failure to establish the session before X seconds have passed is a failure within the context of this test.
The edited version reads as follows:
3.3.3. Establishment Threshold Time
Configuration of the EA that represents the amount of time that an EA client will wait for a response from the EA server before declaring a Session Attempt Failure.
**T, 3.3.4: How does this model (A single session duration separate from the
media session hold time) produce useful benchmarks? Are you using it to allow
media to go beyond the termination of a call? If not, then you have media only
for the first part of a call? What real world thing does this reflect?
Alternatively, what part of the device or system being benchmarked does this
provide insight to?
***Response: T, 3.3.4: The term "Media Session Hold Time" was removed.
**T, 3.3.5: The document needs to be honest about the limits of this simple
model of media. It doesn't account for codecs that do not have constant packet
sizes. The benchmarks that use the model don't capture the differences based on
content of the media being sent - a B2BUA or gateway, may will behave
differently if it is transcoding or doing content processing (such as DTMF
detection) than it will if it is just shoveling packets without looking at
***Response: T, 3.3.5: The following changes were made to the definition:
Definition: Configuration on the EA for a fixed number of frames or samples to be sent in each RTP packet of the media session.
For a single benchmark test, media sessions use a defined number of samples or frames per RTP packet. If two SBCs for example used the same codec but one put more frames into the RTP packets than the other, this might cause variation in
performance benchmark measurements.
An Integer Number of frames or samples, depending on whether hybrid or sample-based codecs are used, respectively.
In addition, a new parameter, "Codec Type" was added as follows:
The name of the codec used to generate the media session.
For a single benchmark test, all sessions use the same size packet
for media streams. The size of packets can cause variation in
performance benchmark measurements.
This is an alphanumeric name assigned to uniquely identify the codec.
In addition, this parameter was added to the Test Setup Report in M5.1.
**T, 3.3.6: Again, the model here is that any two media packets present the same
load to the thing under test. That's not true for transcoding, mixing, or
analysis (such as for dtmf detection). It's not clear that if you have two
streams, each stream has its own "constant rate". You call out having one audio
and one video stream - how do you configure different rates for them?
***Response: **T, 3.3.6: This definition has been deleted.
**T, 3.3.7: This document points to the methodology document for indicating
whether streams are bi-directional or uni-directional. I cant find where the
methodology document talks about this (the string 'direction' does not
occur in that document).
***Response: T, 3.3.7: This definition has been deleted.
**T, 3.3.8: This text is old - it was probably written pre-RFC5056. If you fork,
loop detection is not optional. This, and the methodology document should be
updated to take that into account.
***T, 3.3.8: This text has been removed. It relates to loop detection which is no longer considered in version 09.
**T, 3.3.9: Clarify if more than one leg of a fork can be answered successfully
and update 3.1.11 accordingly. Talk about how this affects the success
benchmarks (how will the other legs getting failure responses affect the
***T, 3.3.9: Response: This text has been removed. It relates to forking which is no longer considered in version 09.
**T, 3.3.9, Measurement units: There is confusion here. The unit is probably
"endpoints". This section talks about two things, that, and type of forking.
How is "type of forking" a unit, and are these templates supposed to allow more
than one unit for a term?
***T, 3.3.9: Response: This text has been removed. It relates to forking which is no longer considered in version 09.
**T, 3.4.2, Definition: It's not clear what "successfully completed" means. Did
you mean "successfully established"? This is a place where speaking in terms of
dialogs and transactions rather than sessions will be much clearer.
***Response: T, 3.4.2, Definition: The SER was re-defined as follows:
3.4.1. Session Establishment Rate
The maximum value of the Session Attempt Rate that the DUT can
handle for an extended, pre-defined, period with zero failures.
This benchmark is obtained with zero failure in which 100% of the
sessions attempted by the Emulated Agent are successfully
completed by the DUT. The session attempt rate provisioned on the
EA is raised and lowered as described in the algorithm in the
accompanying methodology document, until a traffic load at the
given attempt rate over the sustained period of time identified by
T in the algorithm completes without any failed session attempts.
Sessions may be IS or NS or a mix of both and will be defined in
the particular test.
sessions per second (sps)
Session Attempt Rate
**T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the
context of the methodology document (where the docs come closer to defining it).
This definition includes a variable T but doesn't explain it - you have to read
the methodology to know what T is all about. You might just say "for the
duration of the test" or whatever is actually correct.
***Response: T, 3.4.3: This was a reference to Session Capacity, a concept that has been removed from version 09.
**T, 3.4.3, Discussion: "Media Session Hold Time MUST be set to infinity". Why?
The argument you give in the next sentence just says the media session hold
time has to be at least as long as the session duration. If they were equal,
and finite, the test result does not change. What's the utility of the infinity
***Response: T, 3.4.3, Discussion: This was a reference to Session Capacity, a concept that has been removed from version 09.
**T, 3.4.4: "until it stops responding". Any non-200 response is still a
response, and if something sends a 503 or 4xx with a retry-after (which is
likely when it's truly saturating) you've hit the condition you are trying to
find. The notion that the Overload Capacity is measurable by not getting any
responses at all is questionable. This discussion has a lot of methodology in
it - why isn't that (only) in the methodology document?
***Response: T, 3.4.4: This related to Session Overload Capacity, a concept that has been removed from version 09.
**T, 3.4.5: A normal, fully correct system that challenged requests and
performed flawlessly would have a .5 Session Establishment Performance score.
Is that what you intended? The SHOULD in this section looks like methodology.
Why is this a SHOULD and not a MUST (the document should be clearer about why
sessions remaining established is important). Or wait - is this what Note 2 in
section 5.1 of the methodology document (which talks about reporting formats)
is supposed to change? If so, that needs to be moved to the actual methodology
and made _much_ clearer.
***Response: T, 3.4.5: This section related to Session Establishment Performance, a concept that has been removed from version 09.
**T, 3.4.6: You talk of the first non-INVITE in an NS. How are you
distinguishing subsequent non-INVITES in this NS from requests in some other
NS? Are you using dialog identifiers or something else? Why do you expect that
to matter (why is the notion of a sequence of related non-INVITEs useful from a
benchmarking perspective - there isn't state kept in intermediaries because of
them - what will make this metric distinguishable from a metric that just
focuses on the transactions?)
***Response: T, 3.4.6: This section related to Session Attempt Delay, a concept that was removed from version 09.
**T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or
some other end-to-end non-INVITE? I suspect it's because you are wanting to
focus on a simple non-INVITE transaction (which is why you are leaving out
SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear
that's why you chose it. You should also talk about whether the payload of all
of the MESSAGE requests are the same size and whether that size is a parameter
to the benchmark. (You'll likely get very different behavior from a MESSAGE
***Response: T, 3.4.7: This section related to the IM Rate. We removed IM from the scope of these documents in version 09, due to the fact that there many ways to deliver such services and specifying one or the other to be tested would not be useful.
**T, 3.4.7: The definition says "messages completed" but the discussion talks
about "definition of success". Does success mean an IM transaction completed
successfully? If so, the definition of success for a UAC has a problem. As
written, it describes a binary outcome for the whole test, not how to determine
the success of an individual transaction - how do you get from what it
describes to a rate?
***Response: T, 3.4.7: IM is outside the scope of the documents in version 09.
**T, Appendix A: The document should better motivate why this is here.
Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are
silent on them. The discussion says you are _selecting_ a Session Attempts
Arrival Rate distribution. It would be clearer to say you are selecting the
distribution of messages sent from the EA. It's not clear how this particular
metric will benefit from different sending distributions.
***Response: T, Appendix A: Appendix A has been removed.
- - - - - - - - - - - - - - - - -
_______________________________________________ bmwg mailing list bmwg <at> ietf.org https://www.ietf.org/mailman/listinfo/bmwg