From: Talpey, Thomas
[mailto:Thomas.Talpey <at> netapp.com]
Sent: Monday, March 01, 2004 8:54
PM
To: Jim Pinkerton; rddp <at> ietf.org
Subject: Comments on
draft-ietf-rddp-security-01
I have
reviewed the recent RDDP Security draft
(http://www.ietf.org/internet-drafts/draft-ietf-rddp-security-01.txt)
and have some comments in advance
of tomorrow's
working group meeting.
I'd like
to open by complimenting the authors on an excellent
second revision. This is a
difficult document, but an important
one.
I don't
have many high-level comments. The draft seems
well put-together.
One style
thing to mention though - there are many, many
references to "later" or
other sections, especially in the earlier
ones. I suggest direct section
references.
[<jim>] The
current draft has one reference to “later”. This is now fixed.
Apostophes
and em-dashes seem to have a different ascii
encoding, which appear strangely
with certain text tools.
Suggest conventional ascii.
[<jim>] Ouch.
Fixed 4 quotes, 10 dashes.
How
specifically do we define "Partial Mutual Trust", which
appears relatively early? One way
the reader may take it,
is as an authentication guarantee,
for both the local application
and remote peer. But who decides
"partial", for instance? Is
the motivation primarily to clarify
the use of shared resources,
such as the SRQ?
[<jim>] The
definition is in the introduction in the -02 version of the draft. I have
refined the introductory text a bit in the upcoming -03 draft though. Main
issue is the intro implies that anytime you share resources you better only do
it if there is Partial Mutual Trust. That is not true – many types of
resources can be shared safely, and are fully examined in the existing text. However,
there are some conditions where sharing of a specific type of resource has
unmitigated attacks, thus the only time the specific resource should be shared
is if a condition of Partial Mutual Trust exists.
In the
introduction, it seems critical to describe the scope of
possible damage if the trust is
violated. I believe it is very
different depending on the resource
which is the subject of
the trust. System-level damage
should be described differently
from damage which merely
damages the application. This is
not a very specific recommendation,
I know.
I think
there are differences between the trust model as described
here, and a
privileged/nonprivileged application (at the end of
the Introduction). It might be
worthwhile to bring these out.
For example, the trust model seems
primarily oriented at wire-based
attacks, while the
privileged/nonprivileged application is a local
matter? What interaction do these
have?
[<jim>] I’d
appreciate specific recommendations on what to change here.
Sect 4
pargraph 1 last sentence - "should be preserved when
under attack", just delete
"when under attack" and say "should
be preserved". Maybe even
"must".
[<jim>] done.
4.2.1
refers to a FIFO list of buffers - maybe not FIFO in the
case of SRQs. Perhaps just
"list of buffers" here - but also see
comment on 4.2.10.
[<jim>] Even
a Shared-RQ is first-in, first-out – i.e. they are taken off of the queue
in the order they were put on. As to when the actual completion occurs is an
entirely separate issue. That said, I don’t think the FIFO statement
actually adds anything, so will remove it.
4.2.2
buffer exposed to "the Internet"? This is vague.
[<jim>] Replaced
with “network”.
4.2.2 a
general comment - this refers to the RI, while I think
it means the Privileged Resource
Manager? Overall, I think the
"RI" term is loosely
used, for example it's defined only in Figure 1,
and about 2/3 of the time seems to
mean the PRM instead of
the RNIC's programmatic interface.
In fact, in 7.5.5 it's used to
mean "Remote
Invalidation"!
[<jim>] Good
catch. I had abreviated Remote Invalidate to make it fit in the table at the
end of the document, but that table is gone, so it is no longer an issue. For
clarity, I have removed all referenced to “RI” and instead used
either RNIC Interface, RNIC (some of the references were not to an interface,
but to the entire RNIC), or Remote Invalidate .
4.2.3
describes page translation tables. Is it worth bringing out
the possibility they are
multi-level and therefore possibly shared?
[<jim>] I’d
prefer not. There are a ton of different implementation techniques within the
RNIC. Going to the gory details can be endless, and I don’t think
terribly useful. The document already assumes the Page Translation Tables can
be shared, so I have added a note in this section to that effect. Section 4.2.9
states:
“The exact resource allocation
algorithm for the Page Translation Table is outside the scope of this
specification. It may be allocated for a specific Data Buffer, or be allocated
as a pooled resource to be consumed by potentially multiple Data Buffers, or be
managed in some other way. This paper attempts to abstract implementation
dependent issues, and focus on higher level security issues such as resource
starvation and sharing of resources between Streams.”
4.2.4 I
suggest merging Mike Krause's comment/observation that
the STag does not in itself provide
protection, and in fact should
provide efficiency. This is
important to understanding that protection
comes from other facilities.
[<jim>] I’d
appreciate specific text suggestions here.
4.2.5
mention that Completion Queues are of bounded size (like
4.2.6) and so the size can be
attacked. CQ attack is much more
likely than AEQ attack, in fact.
[<jim>] So?
Likely or not, they are attackable. Is there a suggested text change?
4.2.8
clarify the RRQ is inbound.
[<jim>] ???
What is the RRQ?
After (or
in) 4.2.8, should add the outbound RRQ as well. It's not
easily attacked, but this queue has
interesting flow control interactions
with sends. If the queue overflows,
a head-of-line blocking behavior
could be encountered, which is very
different from "normal" operation.
[<jim>] ??
Again, what is the RRQ? Are you talking about the Send Queue? If so, the
discussion you outlined above has no security issues, so why go into it? This
isn’t a document focused on the application developer...
4.2.8.2
it's very important to mention that RDMA Read exposes a
buffer for RDMA Write as well. And
describe the possible protection.
[<jim>] <TBD>
4.2.8.2
last paragraph, has a vague use of the phrase "data transmission
or reception". For example,
RDMA Writes don't yield a completion at
the data sink, counter to the
sentence.
[<jim>] <TBD>
4.2.9
refers to an STag which has only local scope that "may not
be visible from the wire". (pp
starting "The next issue"). => This "may"
must be a "must" in order
for the statement to be true! In fact, if it's
a "may" then a security
hole is open.
[<jim>] I’ve
rewritten the sentence to be clearer.
“The next issue is how an STag
name is associated with a Data Buffer. For the case of an Untagged Data Buffer,
there is no wire visible mapping between an STag and the Data Buffer. Note that
there may, in fact, be an STag which represents the buffer. However, because the
STag by definition is not visible on the wire, this is a local host specific
issue which should be analyzed in the context of local host implementation
specific security analysis, and thus is outside the scope of this paper.”
4.2.10
"send" and "receive" operations - these do not mean
"Send" so
are somewhat confusing.
[<jim>] To
me this is a knit. I’ve finessed the text a little to try to make things
clearer though. We were trying to avoid the keywords Send Queue and Receive
Queue, since they are reserved words in other specifications. I’ve bailed
on not using Receive Queue and Send Queue though - I think it makes things a
bit clearer (also defined it in the appropriate place).
4.2.10 -
FIFO order? Perhaps "sequential", in the face of SRQ. The
paragraph goes on to describe this,
but not very clearly. I suggest
not using "FIFO" in any
case.
[<jim>] Done.
Section 5
first paragraph describes when/if the "Local peer" is untrusted.
I noticed here that "local
peer" isn't defined, and in fact I think it means
the local application or ULP? I
think it would be worth a pass through the
document to be sure all use of the
term is consistent. And of course,
defining the term.
[<jim>] TBD
Section 6
- it's not clear to me what's an attacker with send-only
capabilities?
[<jim>] In
re-reading this, it really doesn’t convey the intent at all. Here’s
the proposed re-wording:
“An attacker’s
capabilities delimit the types of attacks that attacker is able to launch.
RDMAP and DDP require that the initial LLP Stream (and connection) be set up
prior to transferring RDMAP/DDP Messages. Attackers with send only capabilities
must first guess the current LLP Stream parameters before they can attack RNIC
resources (e.g. TCP sequence number). Attackers with both send and receive
capabilities have presumably setup a valid LLP Stream, and thus have a wider
ability to attack RNIC resources.”
Is this any
clearer? If not, can you propose something?
Section 7
paragraph 2 - I disagree that connection setup in stream mode
has no new attacks. This is where
ULPs may exchange information. If
the information is bad, significant
resource issues can clearly result. It's
true however that such exchanges
are outside the scope of the document,
I just think the document should
not state that there are no new attacks.
[<jim>] I
added this point to the paragraph.
7.1.1
should perhaps introduce the PD more directly. The PD is described
in draft-hilland which is not
normative.
[<jim>] Draft-hilland
isn’t even informative – it’s not a work group draft. Thus it
seems inappropriate to bring it into this context. We’ve intentionally
tried to abstract above a verbs implementation to not bog us down on whether
the security analysis implies that we’ve also signed up for verbs being
the “one true way”. That said, it was the intent that the security
analysis completely covers all wire visible security issues embodied in verbs.
7.1.1
goes on to say "should" associate a QP to unique PDs, or shared,
I find this discussion to not be
clear. Also I'd suggest incorporating Caitlin's
"remote PD" concept,
which helps to define the PD and STag scope to a
remote peer.
[<jim>] Sorry,
I don’t understand your issue. To me the text is clear. Please suggest
specific text.
7.1.2
mentions "allocating STags in an unpredictable way". I want to point
out that it is also possible for
the application to "advertise" STags in such
a way, if it manages its own pool.
In fact, this can be much more efficient
since STag allocation is not
appropriate for the performance path.
[<jim>] Changed
text to say “Allocating and/or advertising STags numbers in an unpredictable
way.”
7.2.3
MITM attack discussion first says "the only countermeasure" (pp 1)
then says "the best
countermeasure" (pp 2). Eh?
[<jim>] I
don’t see the the issue. Pp1 states two ways. Pp2 states the best way is
the first way (i.e. IPSec, rather then local link security).
I think
the summary of 7.2.3 is that MITM attacks should be protected in
the usual way by LLP facilities,
and not in DDP/RDMAP. However it may be
worth mentioning that the MPA layer
injects a CRC, making this nontrivial
(though no more secure).
[<jim>] I
don’t understand the point.
In 7.4.2,
to me, the most usual exposure of a buffer for RDMA Read is to
provide access to NON-stale data -
i.e. an actual transfer such as client
"writing" data by
advertising it for pull. Thus the zeroing memory comment
only applies in certain (rare, IMO)
cases. The text only generally mentions
this by saying "combination of
read and write", and could be clearer. Maybe
giving a specific example?
[<jim>]I
don’t understand the ambiguity. I walk through a specific example (i.e. “the
Remote Peer may be able to examine the contents of the buffer before they are
initialized with the correct data.”).
7.4.5
great place to mention the (more difficult) flip side - RDMA Write into
RDMA Read buffer. This is a
definite issue, see my 4.2.8.2 comment.
[<jim>] TBD
7.4.6
suggest using the term "aliasing" to help describe the multiple STag
situation.
[<jim>] I
like it.
7.4.9
"eaves dropping" is one word. Worth referring to MITM discussion here
as well?
[<jim>] done.
7.5.1
define "scarce"? Why wouldn't all resources be under the control of
the PRM?
[<jim>] This
seems like a knit. The dictionary definition is used. I did change the text to
say “... scarce (i.e. bounded)...”. Does this scratch your itch?
7.5.2.1
mention that if the stream is torn down, then this additionally has
the benefit that RNIC resources are
released. For example, it cuts off a
misbehaving peer drawing from a
Shared RQ.
[<jim>] done.
7.5.2.3 after
"The local Peer can protect itself", first bullet mentions resize.
Really doesn't it just mean
"size"? Also give a hint that safe size is >= sum
of all queues, etc. as described
immediately following.
[<jim>] I
meant resize. I’ve reworded it to make this more clear.
“Size
the CQ to the appropriate level, as specified below (note that if the CQ
currently exists, and it needs to be resized, resizing the CQ can fail, so the
CQ resize should be done before sizing the Send Queue and Receive Queue on the Stream),
OR”
7.5.2.3
some of the math examples are confusing - "S-RQ" looks like a minus
sign but isn't? Also
MaxPosted*OnEach*Q is not clear.
[<jim>] Changed
name of “S-RQ” to “SRQ”. Added formal definitions of
each variable.
7.5.2.4
describes the situation "If the issue is a bug in the Remote Peer's
implementation, and not a malicious
attack..." - so tell me, how does the
local peer know?
[<jim>] Hopefully
you’re joking (??) Each of the attacks has a specific counter measure
identified, with appropriate normative statements. So the local peer can
differentiate, and it doesn’t matter.
7.5.5 has
a funny title: "RI an STag Shared on Multiple Streams". Here
"RI"
has a very different meaning -
Remote Invalidate. Even when spelled out,
the title is awkward.
[<jim>] Done.
8.1 it's
not clear how to protect against replay attacks at the RDDP "session"
level. (BTW, session is an
undefined term.) Is reply protection purely a
function of transport? This doesn't
sound right. By the way, this is only
the second mention of
"replay" in the draft.
[<jim>] There
are four references to replay in the doc. Replay is a generic term, but
specifically in this draft it is packet sequencing replay. Within IPSec, there
are specific features defined to detect replays.
On session, I
agree that it is confusing. I’ve swept the document to replace “session”
with “Stream” where appropriate. In some cases (e.g. IPSec session
keys), session is still appropriate. Also when stated “Stream and/or
session level authentication”. I think with reduced scope of the
term “session” we don’t have to define it.
I note
that 8.1.3 mentions NFS and its provision of security - presumably
referring to RPC's GSSAPI support.
This is a good topic to discuss in the
NFSv4 working group, but I'm not
sure what needs to be mentioned here.
What do the authors have in mind?
[<jim>] We
went over this during today’s meeting, so I won’t go into this
here.
I note
that section 8.2 is entitled "Recommendations" and describes itself
as being geared toward iSCSI
requirements. But it has many MUSTs and
the sections seems somewhat more
relevant for an iSCSI over RDMA
document, or perhaps a specific
RDDP/IPSec layering, rather than a
general RDMA one. Is a second document
appropriate?
[<jim>] Section
8.2 is completely rewritten to now just cross reference the IP Storage security
draft.
I have
not yet reviewed the appendices.
Again,
good job!
Tom.