internet-drafts | 23 Jan 18:12 2015
Picon

I-D Action: draft-ietf-nfsv4-multi-domain-fs-reqs-01.txt


A New Internet-Draft is available from the on-line Internet-Drafts directories.
 This draft is a work item of the Network File System Version 4 Working Group of the IETF.

        Title           : Multiple NFSv4 Domain Namespace Deployment Guidelines
        Authors         : William A. (Andy) Adamson
                          Nicolas Williams
	Filename        : draft-ietf-nfsv4-multi-domain-fs-reqs-01.txt
	Pages           : 13
	Date            : 2015-01-23

Abstract:
   This document describes administrative constraints to the deployment
   of the NFSv4 protocols required for the construction of an NFSv4 file
   system namespace supporting the use of multiple NFSv4 domains and
   utilizing multi-domain capable file systems.  Also described are
   administrative constraints to name resolution and security services
   appropriate to such a system.  Such a namespace is a suitable way to
   enable a Federated File System supporting the use of multiple NFSv4
   domains.

The IETF datatracker status page for this draft is:
https://datatracker.ietf.org/doc/draft-ietf-nfsv4-multi-domain-fs-reqs/

There's also a htmlized version available at:
http://tools.ietf.org/html/draft-ietf-nfsv4-multi-domain-fs-reqs-01

A diff from the previous version is available at:
http://www.ietf.org/rfcdiff?url2=draft-ietf-nfsv4-multi-domain-fs-reqs-01

(Continue reading)

Chuck Lever | 22 Jan 19:11 2015
Picon

Marshaling and receiving RDMA_NOMSG calls

I’m working on adding RDMA_NOMSG support to the Linux client and server
implementations. After some internal discussion, I’ve found that I’m
not quite clear on what exactly is allowed when sending an RPC call
marshaled as an RDMA_NOMSG type message.

Let’s have a look at a symlink creation request over NFSv4. The
compound looks like this:

 { PUTFH, CREATE(NF4LNK) }

When the symlink pathname is large, the transport marshals this
compound as an RDMA_NOMSG type request because it is larger than the
inline threshold. (Note that NFSv2 SYMLINK also has content trailing
the symlink pathname, and thus may also be marshaled as an RDMA_NOMSG
type request).

In my prototype, the read list contains three chunk elements: one for
what would have been the RPC header, one for the pathname, and one for
trailing parts of the compound. The client expects the server to pull
all three elements via separate RDMA READs. (Not terribly efficient,
but this eliminates the need for the sender to allocate a large buffer
and copy the whole compound into it).

The rc_position field in each element is set to zero. As I understand
it, that makes the s/g elements part of the same chunk/RPC argument.
The read list therefore contains one chunk with three s/g elements.

The Solaris server rejects this request. It appears to expect that
each s/g element should have an rc_position field that is the exact
offset of the s/g element in the XDR stream.
(Continue reading)

internet-drafts | 21 Jan 19:53 2015
Picon

I-D Action: draft-ietf-nfsv4-minorversion2-dot-x-30.txt


A New Internet-Draft is available from the on-line Internet-Drafts directories.
 This draft is a work item of the Network File System Version 4 Working Group of the IETF.

        Title           : NFSv4 Minor Version 2 Protocol External Data Representation Standard (XDR) Description
        Author          : Thomas Haynes
	Filename        : draft-ietf-nfsv4-minorversion2-dot-x-30.txt
	Pages           : 80
	Date            : 2015-01-21

Abstract:
   This Internet-Draft provides the XDR description for NFSv4 minor
   version two.

The IETF datatracker status page for this draft is:
https://datatracker.ietf.org/doc/draft-ietf-nfsv4-minorversion2-dot-x/

There's also a htmlized version available at:
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-dot-x-30

A diff from the previous version is available at:
http://www.ietf.org/rfcdiff?url2=draft-ietf-nfsv4-minorversion2-dot-x-30

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at tools.ietf.org.

Internet-Drafts are also available by anonymous FTP at:
ftp://ftp.ietf.org/internet-drafts/

_______________________________________________
(Continue reading)

Tom Haynes | 21 Jan 19:47 2015

Re: Review of draft-ietf-nfsv4-flex-files-04



Hi Dave,

Thanks for the review. I’ve given some comments back, but I will be generating a new version of the document.

Tom

On Jan 17, 2015, at 5:05 AM, David Noveck <davenoveck <at> gmail.com> wrote:


Noteworthy Multi-section Clarity Issues:

Confusion due to use of the term "NFSv4"

The document never uses the term NFSv4.0.  As a result it is unclear when it says "NFSv4" whether NFSv4.0 or all NFSv4 protocols is meant.

Okay, I’ll work on this.


Loose coupling and the putative absence of a control protocol:

I found the treatment of "loose coupling" confusing.  "Loose coupling"  is described as a situation in which no control protocol is present.

On the other hand, the "control protocol" is defined as "a set of requirements", which I assume applies to all mapping types and both tight and loose coupling.

(BTW - see my response to your comments about Section 2 and take this with a grain of salt. I just wanted to preserve my responses.)

I concede you are technically correct.

However, given the requirements:

   control protocol:  is a set of requirements for the communication of
      information on layouts, stateids, file metadata, and file data
      between the metadata server and the storage devices (see
      [pNFSLayouts]).

the difference between a loosely and tightly coupled system is that
the loosely couple one cannot meet the requirements for the
communication of layouts and stateids. Both file metadata and
file data can be exchanged.


  Any particular control protocol for file-based access would also meet the requirements specified in Section 2, in either their tight-coupling or loose-coupling variants.

Also, in section 2, it says:
With a loose coupling, the only control protocol might be a version of NFS.
Versions of NFS are not purpose-built control protocols but the document say they are (or might be) control protocols, and they presumably, as used by the MDS, meet the general requirements for control protocols and the specific requirements for the loose-coupling variant of the flexible files mapping type.

See above


Also, if you have a particular version of NFS (e.g. NFSv3) in mind as the control to be used in loose coupling, this should be stated

The choice of which NFS version is used is dependent on the union of the set of variants available on the metadata server and storage device.




Comments by Section:

Abstract:

Too much of this is devoted to explaining pNFS, which is not the focus of this document. I suggest that the Abstract only refer briefly to the fact that this document defines a pNFS mapping type. It seems to me that the abstract is trying to be a short introduction, which, in the nature of things, it cannot be.

This is pretty much a boiler plate abstract in use for both RFC 5663 and 5664.



While the current introduction mentions loose coupling as the only distinctive feature of the flex files layout ("to allow the use of storage devices which need not be tightly coupled to the metadata server"), I don't think this is quite accurate.  It seems to me that the mirroring stuff is important enough to rate a mention in the abstract. 

Agreed

Also, while you are using the term "tightly coupled" in line with your later definition, I don't think you can assume that in the abstract.  Perhaps you want to say something like "to allow the use of storage devices in a fashion such that they require only  a quite limited degree of interaction with the metadata server, using already existing protocols”)

Okay

I’ll take a stab at rewriting the abstract with your points in mind.


Section 1:

In the first paragraph:
  • Given that RFC3530bis has been approved for publication it may make sense to switch the v4.0 reference to refer to the newer document.

No clue as to why I have it this way.


  • I think you should add a clause allowing use of future minor versions.

A general failing of RFC 5661 in this regard. But yes.


Given your desire to standardize on "data storage device", you may (or may not) want to do something about:

There are different Layout Types for different storage systems and methods of arranging data on storage devices. This document defines the Flexible File Layout Type used with file-based data servers (emphases added).

The second paragraph also is afflicted by confusion/uncertainty regarding how to describe issues relating to coupling strength. I think it should be made clearer that:
  • The "requirements" embodied by a control protocol (since a control protocol is a set of requirements) apply to both tight and loose coupling and all pNFS mapping types.
  • The global stateid model (which should be explained a bit further) applies only to the files mapping type and the tight-coupling variant of this mapping type, has a more extensive set of requirements, and presumably a more complicated control protocol, with the latter not being specified in this document.

okay

Section 1.1;

Later (in section 1.2). you try to standardize on "data storage device' as opposed to 'data server". It would be worthwhile adjusting the definitions of these terms to match that intention.

I note you define "object layout type" but never use the term. Would it be better to remove this?


Done



you define 'mirror' as follows:

is a copy of a file. While mirroring can be used for backing up a file, the copies can be distributed such that each remote site has a locally cached copy. Note that if one copy of the mirror is updated, then all copies must be updated.
I'm wondering about the use of the term "cached" here. i suspect that this is actually a mirrored copy and the word 'cached' will confuse pople

Agreed, I’ll reword.



Also i'm afraid about putting this so early in the document. I'm worried that people's reaction might go:
  • I can have a local copy for each of my sites, allowing low-latency access :-)
  • Whenever I write to a layout, I (i.e. my client) might have to do a whole bunch of high-latency writes to remote sites before returning any layout. :-(

A perfectly valid reaction. I had it the first time I was introduced to client-side mirroring. :-)


In a more extended treatment of mirroring (somewhere other than the definition section) there would room to explain:
  • that the MDS might only specify one or two mirrors to someone getting a write layout,.
  • that in that case, the MDS might take on the job of propagating updates to the other N-1 or N-2 mirrors.
  • that the number-of-mirrors hint is going to have some role in letting client's say "don't bury me under a pile of mirrors”
Ack, I’ll expand

A related concern is how the occurrence of multiple mirrors relates to the issues regarding including "less optimal" mirrors discussed in section 4.2.  If one of the mirrors is local it would seem that any other would have to be less optimal, creating a problem if one wants all written data to go to at least two locations.

The optimalality (which is not a word, I know) is supposed to apply for the read-only workload.





Section 1.2:


I think the basic purpose of this section is to say that you are using the term "storage device" instead of "data server". I think you want to say that there is essentially no difference and you have to pick one term. If you try to explain the supposed differences between these two very similar concepts, you wind up in a swamp. Look out for alligators.

Or crocodiles!


Section 2:

The second paragraph is not really relevant to the topic of this section. It is part of a general introduction to pNFS, as it applies to this mapping type.


Yes and it can also be inferred from Section 1.1.

The reason I want to call it out is that I find it amazing how many people trip over this concept when introduced to it.



As to the first paragraph, I see the basic problem as connected to the multiple use of the phrase "control protocol". Also, I'm troubled be the "MUST be defined" in the last sentence. Am I wrong to suppose that the document is in fact laying down the law, not to implementers, (which what the RFC2119 terms are for) but to itself.? As a way of checking whether I'm understanding the issues here, what would you say about the following as a replacement?

The coupling of the metadata server with the storage devices can be either tight or loose. In a tight coupling, there is a control protocol, a purpose-built protocol (not specified here) present to manage security, LAYOUTCOMMITs, etc. In the case if loose coupling, the control protocol's functionality is more limited, and a version of NFS might serve the purpose. Given the functionality available in such an arrangement, semantics with regard to managing security, state, and locking models may be different from those in the case of tight coupling and are described below.


Great, I like “purpose-built”!

I can work with this and folding it back into your earlier concerns.



Section 2.1

With regard to the first sentence, you need to state somewhere (section 2?) your general policy regarding the inheritance of semantics from the file mapping type. I presume that in the tight coupling case you basically inherit the semantics of the file mapping type.

With regard to the second sentence, you need to address the case in which commit is not needed, e.g. all the WRITEs were synchronous.

In the third sentence, I suggest deleting the "I.e.,”

OK


Section 2.2:

It isn't clear to me what is intended by the first paragraph. it seems to me that you need to choose between specifying the requirements for mds-ds interaction or specifying how they are to interact. As of now, the text is somewhere between those two. For example, saying "SETATTR" makes it sound like you are describing a wire protocol. I suspect you don't intend to and if that's the case, need to indicate what the real requirements are.

This has changed, but I’ll try and fold this back in to the new text.


Also, to get back to the issue of storage-device-vs.-data-server, generally storage devices don't have a 'filesystem exports level".

The last sentence of the first paragraph (the one after the semicolon) is confusing. It says "hence it provides a level of security equivalent to NFSv3."  The (minimal) level of security this provides derives from the use of AUTH_SYS.  It has nothing to do with NFSv3.  Using NFSv4 or NFSv2 with AUTH_SYS would be exactly the same.

I see what you mean.



In the second paragraph you stray pretty far into essentially describing a particular implementation. Also, the MDS only needs to propagate to the DS information to determine whether READs or WRITEs are valid. Also, I'm not sure what "go through the same authorization process" means. Do you mean that the client is authorized using the same criteria that would apply to authorizing READs and WRITEs issued through the MDS? Perhaps it would make sense to reference section 13.12 of RFC5661 here, either to incorporate its requirements or to say how those of the flexible files layout are different.


Again, I’ll keep this in mind as I rewrite.



Section 2.3:

All the other sub-sections of section are organized around the distinction between tight and loose coupling, while in this sub-section this distinction is not mentioned at all. Instead this is organized by data access protocol, which I found confusing. I would up with the the following conclusions. Are these right?
  • That for NFSv3 only the loose coupling model is supported.

Yes

  • That the second paragraph describes the tight coupling model for NFSv4.0. I had originally thought it described it described the loose coupling model , but from section 5.1 concluded that it described the tight coupling model and that the loose coupling model for v4.0 was also supported.

Yes


  • That the third and fourth paragraphs describe the loose and tight coupling models respectively for v4.1
Yes



There are a number of locking-related matters that need to be dealt with somewhere, if not necessarily here:
  • delegation recall/revocation.
  • lease renewal and expiration.
  • reboots of MDS, clients, storage devices.

Ack

Section 4.1:

At the end of the second paragraph, suggest replacing "using NFSv4" by using the "specified minor version of NFSv4”

Agreed


Suggest a paragraph after the current third paragraph to deal with optional features. For example,

In the case in which a minor version contains an OPTIONAL feature that the client might use to communicate with
the storage device, the client can check for the presence of support when the GETDEVICEINFO returns. If the
absence of support for the feature causes the device not to be acceptable, then this needs to be communicated to
the MDS by doing a LAYOUTREURN.


ok

Section 4.2:

Section 5.1:

The seventh paragraph is confusing

For tight coupling, ffds_stateid provides the stateid to be used by the client to access the file. For loose coupling and a NFSv4 storage device, the client may use an anonymous stateid to perform I/O on the storage device as there is no use for the metadata server stateid (no control protocol). In such a scenario, the server MUST set the ffds_stateid to be zero.
The problem is with the phrase "may use an anonymous stateid" (emphasis added). The implication (or maybe it's an implicature) is that it might use some other stateid. Given that ffds_stateid MUST be zero, it's hard to imagine what other stateid the client might use.

ok


Is the following replacement correct? If so, I feel it is clearer.
The field ffds_stateid provides the stateid to be used by the client to access the file. For tight coupling this stateid is established using the control protocol. When loose coupling in effect ffds_stateid must be set by the MDS to be zero. As a result client access to the storage device uses the anonymous stateid for NFSv4 protocols.

yes


Section 5.2:

Suggest replacing the second sentence by the following:

Some reasons for this are accommodating clustered servers which share the same filehandle and allowing for multiple read-only copies of the file on the same storage device.

ok



Section 9.1.1:

I think you are missing a "<>" after ffie_errors.

Ah crap, that is also in the NFSv4.2 document!



Secctions 9.2.1-9.2.2:

Wondering about use of nfstime4 here since:
  • It uses up 12 bytes of space.
  • Has a range of hundred of billions of years which I hope is overkill.

That DS on Wolf 359 tends to timeout rather frequently.

  • Is not convenient for the kinds of calculations you'll by doing with this sort of data.
If any code has been written to use this, it isn't worth changing at this point.  However, it it hasn't, the following formats should be considered:
  • uint32 microseconds has reasonable resolution and covers durations up to about 50 days.

You mean milliseconds here, right?

  • uint64 nanoseconds has adequate resolution for the foreseeable future and covers durations up to hundreds of years.





Suggest replacing the last sentence by "The only parameter is optional so the client need not specify values for a parameter it does not care about".

You need to have more info about mirror hints somewhere, and this is a likely place.  Some issues to address:
  • Why a client might choose a particular ffmc_mirrors value
  • Why the server might choose to give a different mirror count than hinted by the client
  • What options a client has if a layout's  mirror count is not acceptable to him

ack


Section 14:

Not sure what you mean by the statement "at the least the server MAY revoke client layouts and/ or device address mappings".  If that's what the server may do at the least, what might it do at the most?   For example, might it corrupt any file that the client has touched.  Or do you  mean that the client MUST at least do these things?

OK, “least” and “MAY” do not go with  the “MUST” of 5661:

   Regardless of the
   reason, once a client's layout has been revoked, the pNFS server MUST
   prevent the client from sending I/O for the affected file from and to
   all data servers; in other words, it MUST fence the client from the
   affected file on the data servers.

I’ll reword.


Section 15:

I think you're going to have problems with the the IESG with the section as written. I suggest rewriting it.

It was in my plan.


_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4
Thomas Haynes | 20 Jan 00:45 2015

[PATCH] Go from credentials to user/group

From: Tom Haynes <loghyr <at> primarydata.com>

Not sure why this was not sent out ...

Signed-off-by: Tom Haynes <loghyr <at> primarydata.com>
---
 flexfiles_middle_coupling.xml | 16 ++++++++-----
 flexfiles_middle_layout.xml   | 54 ++++++++-----------------------------------
 2 files changed, 20 insertions(+), 50 deletions(-)

diff --git a/flexfiles_middle_coupling.xml b/flexfiles_middle_coupling.xml
index 12fd0c1..89f5244 100644
--- a/flexfiles_middle_coupling.xml
+++ b/flexfiles_middle_coupling.xml
 <at>  <at>  -42,12 +42,16  <at>  <at> 
       With loosely coupled storage devices, the metadata server
       uses synthetic uids and gids for the data file, where the uid
       owner of the data file is allowed read/write access and the
-      gid owner is allowed read only access.  As part of the layout,
-      the client is provided with the rpc credentials to be used
-      (see ffds_auth in  <xref target='ff_layout4' />) to access the
-      data file.  Fencing off clients is achieved by using SETATTR
-      by the server to change the uid and/or gid owners of the data
-      file to implicitly revoke the outstanding rpc credentials.
+      gid owner is allowed read only access.  As part of the layout
+      (see ffds_user and ffds_group in  <xref target='ff_layout4' />),
+      the client is provided with the user and group to be used in
+      the RPC credentials needed to access the data file.  Fencing
+      off clients is achieved by using SETATTR by the server to
+      change the uid and/or gid owners of the data file to implicitly
+      revoke the outstanding rpc credentials.
+    </t>
+
+    <t>
       Note: it is recommended to implement common access control
       methods at the storage device filesystem exports level to
       allow only the metadata server root (super user) access to
diff --git a/flexfiles_middle_layout.xml b/flexfiles_middle_layout.xml
index b7d5b2c..511b6b2 100644
--- a/flexfiles_middle_layout.xml
+++ b/flexfiles_middle_layout.xml
 <at>  <at>  -70,7 +70,8  <at>  <at> 
 ///     uint32_t                ffds_efficiency;
 ///     stateid4                ffds_stateid;
 ///     nfs_fh4                 ffds_fh_vers&lt;&gt;;
-///     opaque_auth             ffds_auth;
+///     fattr4_owner            ffds_user;
+///     fattr4_owner_group      ffds_group;
 /// };
 ///
       </artwork>
 <at>  <at>  -190,52 +191,17  <at>  <at> 
     </t>

     <t>
-      For loosely coupled storage devices, ffds_auth provides the
-      RPC credentials to be used by the client to access the data
-      files.  For tightly coupled storage devices, the server SHOULD
-      use the AUTH_NONE flavor and a zero length opaque body to
-      minimize the returned structure length.  I.e., if
-      ffdv_tightly_coupled (see <xref target='ff_device_addr4' />)
-      is set, then the client MUST ignore ffds_auth in this case.
+      For loosely coupled storage devices, ffds_user and ffds_group
+      provide the synthetic user and group to be used in the RPC
+      credentials that the client presents to the storage device
+      to access the data files.  For tightly coupled storage devices,
+      the user and group on the storage device will be the same as
+      on the metadata server.  I.e., if ffdv_tightly_coupled (see
+      <xref target='ff_device_addr4' />) is set, then the client
+      MUST ignore both ffds_user and ffds_group.
     </t>

     <t>
-      Note that opaque_auth is defined in Section 8.2 of
-      <xref target='RFC5531' /> and is for reference:
-    </t>
-
-    <t>
-      &lt;CODE BEGINS&gt;
-    </t>
-
-    <figure>
-      <artwork>
-    enum auth_flavor {
-       AUTH_NONE       = 0,
-       AUTH_SYS        = 1,
-       AUTH_SHORT      = 2,
-       AUTH_DH         = 3,
-       RPCSEC_GSS      = 6
-       /* and more to be defined */
-    };
-      </artwork>
-    </figure>
-
-    <figure>
-      <artwork>
-    struct opaque_auth {
-        auth_flavor  flavor;
-        opaque       body&lt;400&gt;;
-    };
-      </artwork>
-    </figure>
-
-    <t>
-      &lt;CODE ENDS&gt;
-    </t>
-
-
-    <t>
       ffds_efficiency describes the metadata server's evaluation
       as to the effectiveness of each mirror. Note that this is per
       layout and not per device as the metric may change due to
--

-- 
1.9.3

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4

Thomas Haynes | 19 Jan 22:31 2015

[PATCH 0/4] Do not use a RPC cred for fencing users

From: Tom Haynes <loghyr <at> primarydata.com>

This patch set changes the RPC credential used for fencing into an
fattr4_owner and fattr4_owner_group.

Tom Haynes (4):
  Define the protocols to the storage device, not the metadata server
  Fix the RPC xref
  Expound more on the generation of synthetic ids
  Explain format of user and group strings

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4

David Noveck | 17 Jan 14:05 2015
Picon

Review of draft-ietf-nfsv4-flex-files-04


Noteworthy Multi-section Clarity Issues:

Confusion due to use of the term "NFSv4"

The document never uses the term NFSv4.0.  As a result it is unclear when it says "NFSv4" whether NFSv4.0 or all NFSv4 protocols is meant.

Loose coupling and the putative absence of a control protocol:

I found the treatment of "loose coupling" confusing.  "Loose coupling"  is described as a situation in which no control protocol is present.

On the other hand, the "control protocol" is defined as "a set of requirements", which I assume applies to all mapping types and both tight and loose coupling.  Any particular control protocol for file-based access would also meet the requirements specified in Section 2, in either their tight-coupling or loose-coupling variants.

Also, in section 2, it says:
With a loose coupling, the only control protocol might be a version of NFS.
Versions of NFS are not purpose-built control protocols but the document say they are (or might be) control protocols, and they presumably, as used by the MDS, meet the general requirements for control protocols and the specific requirements for the loose-coupling variant of the flexible files mapping type. Also, if you have a particular version of NFS (e.g. NFSv3) in mind as the control to be used in loose coupling, this should be stated

Comments by Section:

Abstract:

Too much of this is devoted to explaining pNFS, which is not the focus of this document. I suggest that the Abstract only refer briefly to the fact that this document defines a pNFS mapping type. It seems to me that the abstract is trying to be a short introduction, which, in the nature of things, it cannot be.

While the current introduction mentions loose coupling as the only distinctive feature of the flex files layout ("to allow the use of storage devices which need not be tightly coupled to the metadata server"), I don't think this is quite accurate.  It seems to me that the mirroring stuff is important enough to rate a mention in the abstract.  Also, while you are using the term "tightly coupled" in line with your later definition, I don't think you can assume that in the abstract.  Perhaps you want to say something like "to allow the use of storage devices in a fashion such that they require only  a quite limited degree of interaction with the metadata server, using already existing protocols")

Section 1:

In the first paragraph:
  • Given that RFC3530bis has been approved for publication it may make sense to switch the v4.0 reference to refer to the newer document.
  • I think you should add a clause allowing use of future minor versions.
Given your desire to standardize on "data storage device", you may (or may not) want to do something about:

There are different Layout Types for different storage systems and methods of arranging data on storage devices. This document defines the Flexible File Layout Type used with file-based data servers (emphases added).

The second paragraph also is afflicted by confusion/uncertainty regarding how to describe issues relating to coupling strength. I think it should be made clearer that:
  • The "requirements" embodied by a control protocol (since a control protocol is a set of requirements) apply to both tight and loose coupling and all pNFS mapping types.
  • The global stateid model (which should be explained a bit further) applies only to the files mapping type and the tight-coupling variant of this mapping type, has a more extensive set of requirements, and presumably a more complicated control protocol, with the latter not being specified in this document.
Section 1.1;

Later (in section 1.2). you try to standardize on "data storage device' as opposed to 'data server". It would be worthwhile adjusting the definitions of these terms to match that intention.

I note you define "object layout type" but never use the term. Would it be better to remove this?

you define 'mirror' as follows:

is a copy of a file. While mirroring can be used for backing up a file, the copies can be distributed such that each remote site has a locally cached copy. Note that if one copy of the mirror is updated, then all copies must be updated.
I'm wondering about the use of the term "cached" here. i suspect that this is actually a mirrored copy and the word 'cached' will confuse pople

Also i'm afraid about putting this so early in the document. I'm worried that people's reaction might go:
  • I can have a local copy for each of my sites, allowing low-latency access :-)
  • Whenever I write to a layout, I (i.e. my client) might have to do a whole bunch of high-latency writes to remote sites before returning any layout. :-(
In a more extended treatment of mirroring (somewhere other than the definition section) there would room to explain:
  • that the MDS might only specify one or two mirrors to someone getting a write layout,.
  • that in that case, the MDS might take on the job of propagating updates to the other N-1 or N-2 mirrors.
  • that the number-of-mirrors hint is going to have some role in letting client's say "don't bury me under a pile of mirrors"
A related concern is how the occurrence of multiple mirrors relates to the issues regarding including "less optimal" mirrors discussed in section 4.2.  If one of the mirrors is local it would seem that any other would have to be less optimal, creating a problem if one wants all written data to go to at least two locations.

Section 1.2:

You say:

We defined a data server as a pNFS server, which implies that it can utilize the NFSv4.1 protocol to communicate with the client.

However in section 1.1 it contradicts that, saying:

Note that while the metadata server is strictly accessed over the NFSv4.1 protocol, depending on the Layout Type, the data server could be accessed via any protocol that meets the pNFS requirements..
I think the basic purpose of this section is to say that you are using the term "storage device" instead of "data server". I think you want to say that there is essentially no difference and you have to pick one term. If you try to explain the supposed differences between these two very similar concepts, you wind up in a swamp. Look out for alligators.

Section 2:

The second paragraph is not really relevant to the topic of this section. It is part of a general introduction to pNFS, as it applies to this mapping type.

As to the first paragraph, I see the basic problem as connected to the multiple use of the phrase "control protocol". Also, I'm troubled be the "MUST be defined" in the last sentence. Am I wrong to suppose that the document is in fact laying down the law, not to implementers, (which what the RFC2119 terms are for) but to itself.? As a way of checking whether I'm understanding the issues here, what would you say about the following as a replacement?

The coupling of the metadata server with the storage devices can be either tight or loose. In a tight coupling, there is a control protocol, a purpose-built protocol (not specified here) present to manage security, LAYOUTCOMMITs, etc. In the case if loose coupling, the control protocol's functionality is more limited, and a version of NFS might serve the purpose. Given the functionality available in such an arrangement, semantics with regard to managing security, state, and locking models may be different from those in the case of tight coupling and are described below.

Section 2.1

With regard to the first sentence, you need to state somewhere (section 2?) your general policy regarding the inheritance of semantics from the file mapping type. I presume that in the tight coupling case you basically inherit the semantics of the file mapping type.

With regard to the second sentence, you need to address the case in which commit is not needed, e.g. all the WRITEs were synchronous.

In the third sentence, I suggest deleting the "I.e.,"

Section 2.2:

It isn't clear to me what is intended by the first paragraph. it seems to me that you need to choose between specifying the requirements for mds-ds interaction or specifying how they are to interact. As of now, the text is somewhere between those two. For example, saying "SETATTR" makes it sound like you are describing a wire protocol. I suspect you don't intend to and if that's the case, need to indicate what the real requirements are.

Also, to get back to the issue of storage-device-vs.-data-server, generally storage devices don't have a 'filesystem exports level".

The last sentence of the first paragraph (the one after the semicolon) is confusing. It says "hence it provides a level of security equivalent to NFSv3."  The (minimal) level of security this provides derives from the use of AUTH_SYS.  It has nothing to do with NFSv3.  Using NFSv4 or NFSv2 with AUTH_SYS would be exactly the same.

In the second paragraph you stray pretty far into essentially describing a particular implementation. Also, the MDS only needs to propagate to the DS information to determine whether READs or WRITEs are valid. Also, I'm not sure what "go through the same authorization process" means. Do you mean that the client is authorized using the same criteria that would apply to authorizing READs and WRITEs issued through the MDS? Perhaps it would make sense to reference section 13.12 of RFC5661 here, either to incorporate its requirements or to say how those of the flexible files layout are different.

Section 2.3:

All the other sub-sections of section are organized around the distinction between tight and loose coupling, while in this sub-section this distinction is not mentioned at all. Instead this is organized by data access protocol, which I found confusing. I would up with the the following conclusions. Are these right?
  • That for NFSv3 only the loose coupling model is supported.
  • That the second paragraph describes the tight coupling model for NFSv4.0. I had originally thought it described it described the loose coupling model , but from section 5.1 concluded that it described the tight coupling model and that the loose coupling model for v4.0 was also supported.
  • That the third and fourth paragraphs describe the loose and tight coupling models respectively for v4.1
Suggest replacing the first paragraph by:

Locking-related operations are always sent by the clients to the metadata server. These operations include:
  • operations relating to opens such as OPEN, OPEN_DOWNGRADE, and CLOSE.
  • operations relating to byte-range locks such as LOCK, LOCKU, and LOCKT
  • operations relating to delegation management, such as DELEGRETURN.
  • miscellaneous operations relating to stateid management such as TEST_STATEID and FREE_STATEID.
In the second paragraph, suggest replacing "NFSv4' by 'NFSv4.0".

In the second paragraph, it says "in response to the state-changing operation" which isn't altogether consistent with the first paragraph lead-in. In particular it isn't clear:
  • How byte-range locks are dealt with. I presume mandatory byte-range locks are not supported but if so, the text should say that.
  • How changing open state (e.g open upgrades and downgrades) are dealt with if a GETLAYOUT does not occur at the right time to propagate the new stateid to the client.
With regard to the fourth paragraph, it says that "NFSv4.1 clustered storage devices ... use a back-end control protocol as described in [RFC5661]".  Unfortunately it isn't really described in RFC5661,   How about the following as a replacement for this paragraph:

NFSv4.1 clustered storage devices that do identify themselves with the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are dealt with using the tight coupling model.This includes implementation of a global stateid model as defined in [RFC5661]. In this case the MDS and storage devices co-operate using a back-end control protocol adequate to meet the requirements of file mapping type defined in [RFC5661] which apply to the tight-coupling case of the flexible file mapping type as well.
There are a number of locking-related matters that need to be dealt with somewhere, if not necessarily here:
  • delegation recall/revocation.
  • lease renewal and expiration.
  • reboots of MDS, clients, storage devices.
Section 4.1:

At the end of the second paragraph, suggest replacing "using NFSv4" by using the "specified minor version of NFSv4"

Suggest a paragraph after the current third paragraph to deal with optional features. For example,

In the case in which a minor version contains an OPTIONAL feature that the client might use to communicate with
the storage device, the client can check for the presence of support when the GETDEVICEINFO returns. If the
absence of support for the feature causes the device not to be acceptable, then this needs to be communicated to
the MDS by doing a LAYOUTREURN.

Section 4.2:

Suggest adding the following after all existing paragraphs.

In the case in which the two storage devices do not have the same server owner major ID, the two devices will be accessed over different sessions in the case in which Access is over NFSv4.1 or a later minor version. This is the clientid trunking is present or not. Given the way in which state is managed using the flexible files layout, there is no need for the client to treat state for the separate sessions to be shared based on clientid. This true whether tight or loose coupling in effect.
  • in the case of tight coupling all state management happens under the aegis of the clientid associated with the MDS.
  • in the case of loose coupling, there is no state management in that the clients imply use the stateids already established on storage devices by the mds.

Thus, the client may, in this case, accommodate NFSv4.1 multipathing without significant client-based support for clientid clientid trunking. When there are multiple sessions, the only difference between a situation in which there are multiple clientids and in which there is a single one is with regard to lease renewal.

Given the IESG's current attitude regarding RFC2119 terms (that the upper-case keywords need to be clearly called for), perhaps it makes sense to rewrite the following material from the third paragraph::

If some network addresses are less optimal paths to the data than others, then the MDS SHOULD NOT include those network addresses in ffda_netaddrs. If less optimal network addresses exist to provide failover, the RECOMMENDED method to offer the addresses is to provide them in a replacement device-ID-to-device-address mapping, or a replacement device ID

and instead write something like the following.

If some network addresses are less optimal paths to the data than others, then the MDS should not include those network addresses in ffda_netaddrs. If less optimal network addresses exist to provide failover, the appropriate method to offer the addresses is to provide them in a replacement device-ID-to-device-address mapping, or a replacement device ID

Section 5.1:

The seventh paragraph is confusing

For tight coupling, ffds_stateid provides the stateid to be used by the client to access the file. For loose coupling and a NFSv4 storage device, the client may use an anonymous stateid to perform I/O on the storage device as there is no use for the metadata server stateid (no control protocol). In such a scenario, the server MUST set the ffds_stateid to be zero.
The problem is with the phrase "may use an anonymous stateid" (emphasis added). The implication (or maybe it's an implicature) is that it might use some other stateid. Given that ffds_stateid MUST be zero, it's hard to imagine what other stateid the client might use.

Is the following replacement correct? If so, I feel it is clearer.
The field ffds_stateid provides the stateid to be used by the client to access the file. For tight coupling this stateid is established using the control protocol. When loose coupling in effect ffds_stateid must be set by the MDS to be zero. As a result client access to the storage device uses the anonymous stateid for NFSv4 protocols.

Section 5.2:

Suggest replacing the second sentence by the following:

Some reasons for this are accommodating clustered servers which share the same filehandle and allowing for multiple read-only copies of the file on the same storage device.

Section 5.3:

You would need to say something about the case of feature mismatch, as opposed to version mismatch to match my proposed changes for section 4.1.

Not sure if this fits in this section but there are a number of other sorts of mismatch that you might want to address:
  • mismatch as to an acceptable value for ffdv_tightly_coupled (or individualized coupling characteristics flags).
  • possible rejection due to mirror count.
Section 8.2:

I think you need a final paragraph to deal with cases in which the MDS is unable to determine that the appropriate updates have been done. For example, revocation of layouts or client reboot.

Section 9:

Suggest revising the first paragraph to read as follows:

When the LAYOUT4_RET_REC_FILE case of the LAYOUT_RETURN operation is being used, the lrf_body field of the layoutreturn_file4 is used to convey layout-type specific information to the server. Some relevant definitions from [RFC5661] are as follows.

You need to give some guidance regarding use (or non-use) of the FSID and ALL cases of LAYOUT_RETURN.

Section 9.1.1:

I think you are missing a "<>" after ffie_errors.

Secctions 9.2.1-9.2.2:

Wondering about use of nfstime4 here since:
  • It uses up 12 bytes of space.
  • Has a range of hundred of billions of years which I hope is overkill.
  • Is not convenient for the kinds of calculations you'll by doing with this sort of data.
If any code has been written to use this, it isn't worth changing at this point.  However, it it hasn't, the following formats should be considered:
  • uint32 microseconds has reasonable resolution and covers durations up to about 50 days.
  • uint64 nanoseconds has adequate resolution for the foreseeable future and covers durations up to hundreds of years.
Section 9.2.2:

In the first sentence, suggest replacing "differentiates" by "specifies".

In the second sentence suggest replacing "respectively for both read and write operations by "for read and write operations respectively."

Section 9.2.3:

In the third sentence of the penultimate paragraph, suggest replacing "it is infeasable" by "it may be infeasible"

In the last sentence of the penultimate paragraph, suggest deleting the word "contiguous".

In the last sentence of the final paragraph, suggest replacing "between" by "among."

Section 12.1:

Suggest replacing the last sentence by "The only parameter is optional so the client need not specify values for a parameter it does not care about".

You need to have more info about mirror hints somewhere, and this is a likely place.  Some issues to address:
  • Why a client might choose a particular ffmc_mirrors value
  • Why the server might choose to give a different mirror count than hinted by the client
  • What options a client has if a layout's  mirror count is not acceptable to him

Section 14:

Not sure what you mean by the statement "at the least the server MAY revoke client layouts and/ or device address mappings".  If that's what the server may do at the least, what might it do at the most?   For example, might it corrupt any file that the client has touched.  Or do you  mean that the client MUST at least do these things?

Section 15:

I think you're going to have problems with the the IESG with the section as written. I suggest rewriting it.

Some suggestions to consider for a replacement are listed below.
  • Start with the tight coupling model. Don't mention Kerberos specifically, but say this mapping type has all the security properties of NFSv4 and the pNFS file mapping type.
  • With regard to the loose coupling variant:
    • Stress that it is intended for, and SHOULD only be used in, appropriate environments, i.e.those characterized by physically isolated networks or VPN"s with comparable security characteristics.
    • Mention that this should be compared to the block mapping type. and that it has equivalent security against malicious clients and better protection against data corruption due to client mistakes.
    • If you are not going to implement or even specify something in an area, either don't mention it or simply say it is not part of this specification. Thinking out loud about why you are not doing something is not a good idea in something that is to be an RFC candidate. In particular, if you are doing revocation by changing ownership in order to accommodate NFSv3, the chances of people implementing NFSv3 servers with GSSRPCV3 support is near zero. It is better to be honest and say that this is simply a temporary expedient to support existing data servers and that kerberization of this revocation model is not going to happen.
I
Section 16:

You should also mention the stuff associated with CB_RECALL_ANY.

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4
Thomas Haynes | 15 Jan 19:09 2015

[PATCH] Explain opaque_auth

From: Tom Haynes <loghyr <at> primarydata.com>

As per the email discussion, here is the reference for opaque_auth.

Signed-off-by: Tom Haynes <loghyr <at> primarydata.com>
---
 flexfiles_back_references.xml | 14 ++++++++++++++
 flexfiles_middle_layout.xml   | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/flexfiles_back_references.xml b/flexfiles_back_references.xml
index 1cb0a57..19992c6 100644
--- a/flexfiles_back_references.xml
+++ b/flexfiles_back_references.xml
 <at>  <at>  -88,6 +88,20  <at>  <at> 
       target="ftp://ftp.isi.edu/in-notes/rfc3530.txt"/>
   </reference>

+    <reference anchor='RFC5531'>
+      <front>
+  <title abbrev='Remote Procedure Call Protocol Version 2'>RPC:
+  Remote Procedure Call Protocol Specification Version 2</title>
+  <author initials='R.' surname='Thurlow' fullname='R. Thurlow'>
+    <organization>Sun Microsystems, Inc.</organization>
+  </author>
+  <date year='2009' month='May' />
+      </front>
+
+      <seriesInfo name='RFC' value='5531' />
+      <format type='TXT' target='ftp://ftp.isi.edu/in-notes/rfc5531.txt' />
+    </reference>
+
   <reference anchor='RFC5661'>
     <front>
       <title>Network File System (NFS) Version 4 Minor Version 1 Protocol</title>
diff --git a/flexfiles_middle_layout.xml b/flexfiles_middle_layout.xml
index 590f180..b7d5b2c 100644
--- a/flexfiles_middle_layout.xml
+++ b/flexfiles_middle_layout.xml
 <at>  <at>  -200,6 +200,42  <at>  <at> 
     </t>

     <t>
+      Note that opaque_auth is defined in Section 8.2 of
+      <xref target='RFC5531' /> and is for reference:
+    </t>
+
+    <t>
+      &lt;CODE BEGINS&gt;
+    </t>
+
+    <figure>
+      <artwork>
+    enum auth_flavor {
+       AUTH_NONE       = 0,
+       AUTH_SYS        = 1,
+       AUTH_SHORT      = 2,
+       AUTH_DH         = 3,
+       RPCSEC_GSS      = 6
+       /* and more to be defined */
+    };
+      </artwork>
+    </figure>
+
+    <figure>
+      <artwork>
+    struct opaque_auth {
+        auth_flavor  flavor;
+        opaque       body&lt;400&gt;;
+    };
+      </artwork>
+    </figure>
+
+    <t>
+      &lt;CODE ENDS&gt;
+    </t>
+
+
+    <t>
       ffds_efficiency describes the metadata server's evaluation
       as to the effectiveness of each mirror. Note that this is per
       layout and not per device as the metric may change due to
--

-- 
1.9.3

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4

David Noveck | 7 Jan 10:58 2015
Picon

Fwd: RPC/RDMA read chunk round-up when inline content follows


---------- Forwarded message ----------
From: David Noveck <davenoveck <at> gmail.com>
Date: Tue, Jan 6, 2015 at 2:56 PM
Subject: Re: [nfsv4] RPC/RDMA read chunk round-up when inline content follows
To: "J. Bruce Fields" <bfields <at> fieldses.org>


> So if I understand correctly: fixing the servers will make them fail all
> writes from current clients,

If it hurts when you do that, don't do that.  it seems you can make servers
accept old and new client behavior, even if the old behavior is a BAD IDEA.

one way to map this into RFC2119 language is to say the servers SHOULD reject
older-form requests and MUST accept newer-form requests.


> andf fixing the clients will make the writes fail against all current servers.

not sure how to state this in rfc2119 language but I'd argue that one should not fix the
clients until severs are fixed to accept newer-form request.


> Isn't it a little late for that kind of change?  

I think it is.

> Or are there really so
> few NFS/RDMA users that we can afford to completely break backwards
> compatibility?

i guess it depends on whether you are one of those users.

The approach described in draft-ietf-nfsv4-versioning is to make server support for old-form
and new-form requests into separate features that the client can test for, allowing clients
to be fixed to adapt to server behavior in parallel with making servers which support both 
behaviors.

Not only is that quite heavyweight for this problem, but this draft is a ways away from 
IESG approval and publishing.


On Tue, Jan 6, 2015 at 1:58 PM, J. Bruce Fields <bfields <at> fieldses.org> wrote:
On Mon, Jan 05, 2015 at 06:50:44PM +0000, Tom Talpey wrote:
> I hope this additional reply doesn't make the thread too convoluted.
>
> > -----Original Message-----
> > From: nfsv4 [mailto:nfsv4-bounces <at> ietf.org] On Behalf Of Chuck Lever
> > Sent: Thursday, January 1, 2015 1:51 PM
> > To: David Noveck
> > Cc: Chunli Zhang; nfsv4 list (nfsv4 <at> ietf.org); Karen; Dai Ngo
> > Subject: Re: [nfsv4] RPC/RDMA read chunk round-up when inline content
> > follows
> >
> > Hi Dave-
> >
> > Sorry for the length. There's a lot to tease apart.
> >
> >
> > On Jan 1, 2015, at 7:20 AM, David Noveck <davenoveck <at> gmail.com> wrote:
> >
> > > > It turns out that Linux is the only NFS client that supports RDMA
> > > > and adds an operation (GETATTR) after WRITE in an NFSv4 compound.
> > >
> > > Presumably it sends the GETATTR as a separate chunk.  I suppose that
> > > could be a performance issue.
> >
> > Or efficiency: extra work has to be done to register that 16 byte piece of the
> > compound RPC and send it in a separate RDMA READ. That's extra transport
> > bytes on the wire, and an extra round trip (albeit a very fast one).
>
> It's my opinion this is a bug, in both the client and the server here.
>
> The intention of an RDMA Chunk is that it encodes all or part of an RPC
> argument, RPC result, or an entire RPC message. The fact that the client
> added a compounded GETATTR to the write data's Read Chunk is wrong.
> The fact that the server parsed the additional data after the write payload
> as an RPC request is also wrong. Keeping these segments separate is critical
> to the RFC5667 specification of the NFS binding to RPC/RDMA.

So if I understand correctly: fixing the servers will make them fail all
writes from current clients, and fixing the clients will make their
writes fail against all current servers.

Isn't it a little late for that kind of change?  Or are there really so
few NFS/RDMA users that we can afford to completely break backwards
compatibility?

--b.

>
> That said, the text in RFC5667 may be unclear. I'd suggest that it needs to
> be revisited anyway, to reflect implementation experience as well as the
> new behaviors of NFSv4.1, v4.2 etc. The document is five years old, and it
> reflects work done prior to that. In particular, while it was published at the
> same time as RFC5661 (NFSv4.1), it actually doesn't make any specific
> requirements for that minor version beyond mentioning the callback
> channel. That, plus pNFS, layouts, CREATE, OPENATTR, ACLs, etc.
>
> >
> > > > Linux sends the part of the compound before the opaque file data via
> > > > RDMA SEND. The opaque file data is put into the read list. Then the
> > > > GETATTR operation is put in the read list after the file data.
> > >
> > > > Existing NFS/RDMA servers are able to receive and process this
> > > > request correctly.
> > >
> > > The interesting question is whether they would work if one switched
> > > from the chunks-only-at-the-end approach to the alternative
> > > chunks-only-for-WRITE approach.
> >
> > The current Linux NFS server and the Solaris 11 update 2 NFS server do not
> > handle a request with additional inline content. There is a minor exception,
> > but let's put that aside for the moment.
> >
> > Recall that the Linux NFS client sends { PUTFH, WRITE, GETATTR } as its
> > NFSv4 WRITE compound. I've constructed a Linux client that sends the
> > GETATTR as additional inline content rather than at the end of the read list.
> >
> > The Linux server accepts the PUTFH and WRITE, but then returns 10044
> > (OP_ILLEGAL) for the third op, because it finds nonsense when it looks in the
> > XDR stream for the arguments of the third op. (I do have a fix for this).
> >
> > I can make the converse point: sending an RPC request where only a middle
> > argument is placed in a chunk list is clearly allowed by RFC 5666, and is not
> > forbidden by RFC 5667. Having "position" fields for each chunk in the read list
> > means a server MUST be prepared to accept inline content following a read
> > list
>
> I'm not sure I agree with "MUST", but I do agree that there is sufficient
> (redundant) information in the encoding of the request for the server to
> Perform such an action.
>
> >
> > Therefore current NFS/RDMA server implementations are broken. If we
> > agree that sending the final GETATTR in the read list is allowed, then servers
> > should accept that in addition to NFSv4 compounds where the final GETATTR
> > is sent as additional inline content.
> >
> > > > First question: does the Linux client comply with RFCs 5666 and 5667
> > > > when it sends an NFSv4 WRITE compound in this way, or should it be
> > > > changed?
> > >
> > > It clearly complies with RFC5666 which grants the sender a lot of
> > > freedom for the sender as to how it chooses to sends individual data
> > elements.
>
> While freedom is the intention of RFC5666, it is (was) the intention of RFC5667
> to narrowly define those freedoms for the NFS protocol family. There is a
> requirement in RFC5666 that each upper layer provide such a binding, with
> the appropriate rules to ensure interoperability.
>
> > >
> > > > The intention of RFC 5667 appears to be that the GETATTR belongs
> > > > inline, following the front of the compound.
> > >
> > > I think the basic idea is that RDMA is for large data elements, such
> > > as data to be written, but I don't think it comes out and forbids you
> > > from sending other things.
> >
> > This text from section 4 seems to disallow the use of a read list for anything
> > but opaque file data or symlink pathnames:
> >
> > "Similarly, a single RDMA Read list entry MAY be posted by the client  to
> > supply the opaque file data for a WRITE request or the pathname  for a
> > SYMLINK request.  The server MUST ignore any Read list for  other NFS
> > procedures, as well as additional Read list entries beyond  the first in the list."
>
> Note however that RFC5667 section 4 is explicitly about NFS versions 2 and 3,
> and the "MUST ignore" statement is purely about simplifying the nfsdirect
> layering to align with the (simple) requirements of NFS v2 and v3, transfer-wise.
>
> Section 5 in turn is silent on the issue of SYMLINK, which of course is implemented
> In NFSv4.x as the CREATE procedure. That's clearly an omission.
>
> I agree with many of the points you make below, but in the absence of clear
> normative statements in RFC5667, I think they simply reinforce the need
> for updating the nfsdirect specification.
>
> Tom.
>
>
>
> >
> > "The server MUST ignore" is a clear statement that placing other arguments
> > in a read list will prevent interoperation, at least for
> > NFSv2 and NFSv3 Direct.
> >
> > The "MAY" refers to an alternate mechanism of sending large RPC
> > requests: RDMA_NOMSG, where the entire RPC message is encapsulated in
> > a read list at position 0.
> >
> > > The only basis on which you might base a case that this violates
> > > RFC5667 wiule be based on the following text from section2:
> > >
> > > Large chunks of data,
> > > such as the file data of an NFS WRITE request, MAY be referenced by an
> > > RDMA Read list and be moved efficiently and directly placed by an RDMA
> > > Read operation initiated by the server.
> > >
> > > One could argue that this somehow implies you "MAY NOT" transfer other
> > > sorts of request data using RDMA read chunks.  I don't read it that way.
> >
> > > > Only the file data
> > > > payload belongs in the read list, if NFSv4 Direct is to be
> > > > consistent with NFSv2 and NFSv3.
> > >
> > > I don't think NFSv4 can be consistent with previous NFS protocols
> > > because it is different in having COMPOUND.
> >
> > I don't agree that NFSv4 cannot be consistent with legacy NFS. Perhaps it is
> > not stated clearly in RFC 5667 section 5 how to make NFSv4 equivalent to
> > NFSv2/3 Direct, but observe that:
> >
> > 1. If you consider only WRITE operations, sending only the opaque file
> >    data payload is allowed in all three cases.
> >
> > 2. If you consider NFSv2 SYMLINK, there is an argument (sattr)
> >    following the link pathname argument. RFC 5667 says only the link
> >    pathname may be sent in a read list. This is a clear requirement
> >    that the middle argument is conveyed via a read list, and the
> >    remaining arguments inline. The case where both the pathname and
> >    the sattr argument are sent in a read list is explicitly not
> >    allowed.
> >
> > The legacy-consistent way of sending an NFSv4 WRITE compound would be
> > to put only the opaque file data in a read list, no matter what else appears in
> > the NFSv4 COMPOUND request with the WRITE operation. (Small WRITEs of
> > course are always sent inline).
> >
> > > Your choice is either to be consistent with NFSv3:
> > >   * In having chunks only at the end of the request.
> > >   * In having chunks only for WRITE data.
> >
> > RFC 5667 section 5 has two sentences discussing the use of read lists with
> > NFSv4 COMPOUND requests:
> >
> > "The situation is similar for RDMA Read lists sent by the client and  applies to
> > the NFSv4.0 WRITE and SYMLINK procedures as for v3.
> >  Additionally, inline segments too large to fit in posted buffers MAY  be
> > transferred in special "RDMA_NOMSG" messages."
> >
> > Although it does suggest strongly that NFSv4 WRITE should be consistent
> > with NFSv3 WRITE, at first glance it doesn't help us decide which of your two
> > bullets is the correct interpretation.
> >
> > "The situation is similar" I believe references earlier text in section 5, not to
> > the text in section 4 that discusses WRITE/SYMLINK. Earlier, section 5 goes as
> > far as to say:
> >
> > "The Write list MUST be considered only for the COMPOUND procedure.
> >  This procedure returns results from a sequence of operations.  Only  the
> > opaque file data from an NFS READ operation and the pathname from  a
> > READLINK operation MUST utilize entries from the Write list."
> >
> > The "Only . . . MUST" construction in the third sentence is slippery.
> >
> > It can be read as allowing other operations and arguments to use a write list,
> > and as requiring READ and READLINK to use a write list in all cases, including
> > short payloads. That reading, while literal, seems contradictory with the rest
> > of the document.
> >
> > Or it can be read as allowing a write list only for READ and READLINK
> > operations in NFSv4 compounds. That reading I believe is consistent with the
> > rest of the document and earlier versions of the NFS Direct protocol, and
> > would also apply to WRITE in the read list case.
> >
> > Also worth mentioning is that NFSv4 does not have a separate SYMLINK
> > operation. Thus the explicit mention of an NFSv4 SYMLINK operation in
> > section 5 is incorrect. It could be replaced with CREATE(NF4LNK). Or using a
> > read list during symlink creation could simply be abandoned for NFSv4, in
> > favor of sending large CREATE requests via RDMA_NOMSG.
> >
> > In sum, a very particular reading of section 5 suggests that only WRITE
> > payloads in NFSv4 COMPOUND requests can be sent in a read list, but there
> > seems to be some wiggle room. It would help me, as an implementer, if RFC
> > 5667 section 5 could be clarified.
> >
> > > > Now suppose the Linux client is changed to place the GETATTR
> > > > operation inline, following the front of the compound.
> > >
> > > And assuming the existing servers are OK with that?
> >
> > > > When the opaque file data requires round-up because it's length is
> > > > not divisible by four, should there be a pad?
> > >
> > > No.
> > >
> > > > if so, where does it belong?
> > >
> > > > To put it another way: Since inline data does not have a "position"
> > > > field, should the receiver assume that inline content following the
> > > > read list begins right at the end of the rounded-up chunk?
> > >
> > > No.
> > >
> > > > or should
> > > > the receiver assume the inline content begins at the next XDR
> > > > position after the end of the chunk?
> > >
> > > Yes.
> > >
> > > > If the former, that would require inserting a zero pad before the
> > > > inline content. But that pad would cause the following inline
> > > > content to be sent unaligned, since a zero pad is always shorter
> > > > than 4 bytes long.
> > >
> > > > By implication, then, the receiver MUST conclude that round-up is
> > > > present in the case when inline data remains to be decoded (ie, the
> > > > following inline content always begins at the next XDR position).
> > > > The sender MUST NOT send a pad inline in this case. Is this correct?
> > >
> > > Yes.
> >
> > Thanks for confirming my interpretation, and for helping me to sharpen my
> > understanding of these RFCs.
> >
> > > > I've read RFC 5666 section 3.7. Paragraph 4 seems most relevant, but
> > > > it doesn't discuss the case where there is additional inline content
> > > > following a chunk list.
> > >
> > > I think the following is relevant:
> > >
> > > Because this position will not match (else roundup would not have
> > > occurred), the receiver decoding will fall back to inspecting the
> > > remaining inline portion.
> > >
> > > This may not be clear enough.  Maybe a clarifying editorial errata
> > > would be justified.
> >
> > Let me know if you'd like me to file one.
> >
> > > On Tue, Dec 30, 2014 at 12:53 PM, Chuck Lever <chuck.lever <at> oracle.com>
> > wrote:
> > > Hi-
> > >
> > > It turns out that Linux is the only NFS client that supports RDMA and
> > > adds an operation (GETATTR) after WRITE in an NFSv4 compound.
> > >
> > > Linux sends the part of the compound before the opaque file data via
> > > RDMA SEND. The opaque file data is put into the read list. Then the
> > > GETATTR operation is put in the read list after the file data.
> > >
> > > Existing NFS/RDMA servers are able to receive and process this request
> > > correctly.
> > >
> > > First question: does the Linux client comply with RFCs 5666 and 5667
> > > when it sends an NFSv4 WRITE compound in this way, or should it be
> > > changed?
> > >
> > > The intention of RFC 5667 appears to be that the GETATTR belongs
> > > inline, following the front of the compound. Only the file data
> > > payload belongs in the read list, if NFSv4 Direct is to be consistent
> > > with NFSv2 and NFSv3
> > >
> > > Now suppose the Linux client is changed to place the GETATTR operation
> > > inline, following the front of the compound.
> > >
> > > When the opaque file data requires round-up because it's length is not
> > > divisible by four, should there be a pad? if so, where does it belong?
> > >
> > > To put it another way: Since inline data does not have a "position"
> > > field, should the receiver assume that inline content following the
> > > read list begins right at the end of the rounded-up chunk? or should
> > > the receiver assume the inline content begins at the next XDR position
> > > after the end of the chunk?
> > >
> > > If the former, that would require inserting a zero pad before the
> > > inline content. But that pad would cause the following inline content
> > > to be sent unaligned, since a zero pad is always shorter than 4 bytes
> > > long.
> > >
> > > By implication, then, the receiver MUST conclude that round-up is
> > > present in the case when inline data remains to be decoded (ie, the
> > > following inline content always begins at the next XDR position). The
> > > sender MUST NOT send a pad inline in this case. Is this correct?
> > >
> > > I've read RFC 5666 section 3.7. Paragraph 4 seems most relevant, but
> > > it doesn't discuss the case where there is additional inline content
> > > following a chunk list.
> > >
> > > Thanks for reading!
> > >
> > > --
> > > Chuck Lever
> > >
> > > _______________________________________________
> > > nfsv4 mailing list
> > > nfsv4 <at> ietf.org
> > > https://www.ietf.org/mailman/listinfo/nfsv4
> > >
> >
> > --
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> >
> >
> >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4 <at> ietf.org
> > https://www.ietf.org/mailman/listinfo/nfsv4
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4 <at> ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4


_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4
David Noveck | 7 Jan 10:50 2015
Picon

Re: my (partial) review of draft-ietf-nfsv4-multi-domain-fs-reqs-00

> How about this for a document summary:



This document describes administrative constraints to the deployment of the NFSv4 protocols required for the construction of an NFSv4 file system name space supporting the use of multiple NFSv4 domains. Also described are adminstrative constraints to name resolution and security services appropriate to such a system. Such a name space is a suitable way to enable a Federated File System supporting the use of multiple NFSv4 domains.


It looks fine to me but I wonder if you might want to find some way to mention multi-domain-capable fs's.  As you indicate, that's a critical element of what
you're telling people by writing this document.

> I have not come up with an appropriate title change - but I do agree “requirements’ should be removed from the title.

That leaves "Multiple NFSv4 Domain File System".  As a way of avoiding the word "requirements", how about "Essential Elements for Multi-domain NFSv4 Namespaces"?

On Tue, Jan 6, 2015 at 8:33 PM, David Noveck <davenoveck <at> gmail.com> wrote:
But it does use the key words 'MUST’, ‘MUST NOT’, and ‘REQUIRED' that describe what must be in order to deploy a multiple NFSv4 domain file system namespace. 

Actually, it doesn't use those key words.

On Tue, Jan 6, 2015 at 1:21 PM, Adamson, Andy <William.Adamson <at> netapp.com> wrote:

> On Dec 30, 2014, at 8:49 PM, David Noveck <davenoveck <at> gmail.com> wrote:

Hi Dave

Thanks for your review.

>
> I have gone through this document and think that it is good work that needs to be published, but there are  a few matters that the working group needs to discuss before a more extensive review is useful. So this review is limited to the first sixteen lines of the draft (intended-status, title, abstract).  Given that I've found so many things that I don't like in such a short space, my claim that I like this document might seem kind of strange.  However, my big issue is that most of this preliminary material doesn't really match the document that we have before us.  One or the other has to change and I would much prefer that the preliminary material be changed to match the document body rather than the reverse.
>
> BTW, I have no issues with lines 1-2,  4-6,  and 9-11, so actually I have only eight lines to review :-)
>
> So let's start at line 3, "intended status":
>
> I don't see how this document can be standards-track.  It does not define a new feature and it is not marked as updating RFC5661 or RFC3530bis.

True. But it does use the key words 'MUST’, ‘MUST NOT’, and ‘REQUIRED' that describe what must be in order to deploy a multiple NFSv4 domain file system namespace. AFAICS these key words are not part of an informational RFC.

Neither RFC5661 nor RFC3530bis mention multiple NFSv4 domain deployments.

>
>
> I understand that this document is intended to satisfy the charter item that refers to "constraints and clarifications to the current NFSv4.0 and NFSv4.1 protocols"  but the current document isn't written to do that.

I disagree.

>  It is true that the handling of users and groups in RFC5661 and RFC3530bis could use some clarification but I think that is best done in a document solely devoted to that cleanup.

I don’t think any clarification of handling users and groups is needed for stand-alone NFSv4 domains.

> My view regarding these matters:
>       • Those sections are written assuming that if the client and server agree on how they handle name and domain everything will out, which is true, but it doesn't even begin to specify how clients and servers might come to agreement on that.

There are many ways to run stand-alone NFSv4 domains - three of which are described in draft-ietf-nfsv4-multi-domain-fs-reqs-00. "how clients and servers might come to agreement on that” is up to the administrator of the stand-alone domain. No problem there that I see.

>       • Although the problems are most pressing in the case of multi-domain deployments,

Not ‘are more pressing’ - but ‘prevent deployment’

>  there are problems with single-domain deployments as well.


>  Consider the case in which the server is set up to do the appropriate uid-name mapping and the client isn't.  In this case, the client will receive name strings it is not prepared to decode.

Which is fine if the administrator wants it that way - the file system name space still works. There is no need to force a stand-alone NFSv4 deployment to do anything. All the pieces are there to be used in any way the administrator desires. As per all of NFS, plenty of rope …..

>       • To really clarify this, you would have to be prepared to rewrite the relevant sections and I don't think anyone is up for that right now.

Not needed for stand-alone deployments. They already work.

>       • if you did go down that path, you wold face the issue of required protocol extensions, if not to allow client and server to negotiate appropriate handling, at least to allow client and server to decide whether they match.  I don't think that could be done until we have an NFSv4 versioning RFC.
>
> This document specifies how one might deploy/administer an NFSv4 namespace that involves multiple domains,

The document specifies what is REQUIRED in order to deploy/administer an NFSv4 namespace that involves multiple domains - not how one “might” deply/administer such a namespace. In other words, you can not run a multiple NFSv4 domian namespace without the “MUST, MUST_NOT and REQUIRED” portions of draft-ietf-nfsv4-multi-domain-fs-reqs-00.

> and is helpful in that regard.  It makes sense to me as an informational RFC.
>
> Regarding charter issues, I think we need to focus on clearly saying what needs to be said and what we think people need to understand and leave it to Spencer, Beepy, and Martin to address any purported charter issues.
>
> Now Let's go to line 7, the title:
>
> A more descriptive title would be "Creation of Multi-domain NFSv4 Namespaces."  The words "Requirements for" could be prepended.  They don't add anything but they might be helpful in avoiding having to change the file name, which might lead to document management difficulties.

I agree that the title can be changed to not include the word ‘requirements’.
>
>
>
> We now move from the "Document Title Chainsaw Massacre" portion of our program to the sequel "Document Chainsaw Massacre II, the Abstract".
>
> "This document describes constraints"
>
> The heart of the problem as I see it is that it seems that the abstract was written to match the charter item, rather than summarizing  the document in question.

Actually, the other way around.

>
>
> It is not clear what "constraints to the protocols" means and we should clearly say what is done which is to specify constraints to the deployment/administration of the protocols.

I agree. “constraints to the protocols” should be changed to "constraints to the deployment/administration of the protocols”

>
> Any such constraints are a minor part of the Document

No, the contraints is the whole point of the document. What can work in a stand-alone deployment in some cases will _not_ work in a multi domain deployment!

> and should not be mentioned so prominently (with more major item relegated to "as well as").
>
> "to the NFSv4.0 and NFSv4.1 protocols"
>
> This is really about all NFSv4 protocols.

OK.

>
> "as well as the use of multi-domain capable file systems"
>
> This is really the heart of the matter.  As I understand it, the line of demarcation between this document and the subsequent one is that this one is built assuming the use of multi-domain-capable file systems.

Correct. You can not deploy a read/write multiple NFSv4 domain namespace exporting non multi-domain capable file systems.

>
> "name resolution services, and security services required"
>
> Yup.
>
> "to fully enable a multiple NFSv4 domain file system,"
>
> Not clear what "fully enable" means.

The (not described) idea here is that a fully enabled multiple NFSv4 domain file system namespace is read/write capable for all users. This means that users from remote domains can create files, change ACLs etc on files on servers not in their NFSv4 domain, which requries said servers to export multi-domain capable file systems, have unique NFSv4 domain names, etc.

One could export a single-domain capable file system as read-only in a multiple NFSv4 domain file system name space: this is not “fully enabled”.


>
> Also, we are supporting a namespace rather than a filesystem
>
> such as a multiple NFSv4 domain Federated File System.
>
> call me "old-fashioned" but I have trouble parsing phrases like that which have lots of nouns used as adjectives.

:)

>
> So how about the following as a summary of the document:
>
> This document describes an approach to the construction of a file system name space supporting use of the NFSv4 protocols and allowing use of multiple NFSv4 domains.  This approach involves use of multi-domain capable file systems.  Also described are name resolution services and security services as well as administrative constraints appropriate to such a system.  Such a namespace is a suitable way to enable a Federated File System supporting the use of multiple NFSv4 domains.
>

Dave, could you please describe an "approach to deploying a multi-domain NFSv4 file system name space" that :

1) Doesn’t require exporting multi-domain capable file systems
2) Allows NFSv4 domain name collisions
3) Allows stringifiedUID/GID
4) Allows AUTH_SYS

or any combination of the above? In other words, this is not 'an approach'.

How about this for a document summary:

This document describes administrative constraints to the deployment of the NFSv4 protocols required for the construction of an NFSv4 file system name space supporting the use of multiple NFSv4 domains. Also described are adminstrative constraints to name resolution and security services appropriate to such a system. Such a name space is a suitable way to enable a Federated File System supporting the use of multiple NFSv4 domains.

I have not come up with an appropriate title change - but I do agree “requirements’ should be removed from the title.

—>Andy

>



_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4
Spencer Shepler | 6 Jan 00:14 2015
Picon

NFSv4 WG Last Call for pNFS Flexible File Layout - closes on Jan 20th

 

Hi.  I am announcing the start of a Last Call for the Parallel NFS (pNFS) Flexible File Layout (draft-ietf-nfsv4-flex-files-04.txt) I-D.

 

Please take time and review this document and provide feedback.  I believe this document is of particular interest to the WG and there has been productive discussion in the past on the content.

 

The Last Call will end on January 20th.

 

Spencer

 

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4

Gmane