Chuck Lever | 6 Oct 20:33 2015

Fwd: New Version Notification for draft-cel-nfsv4-rfc5666-implementation-experience-00.txt

This is a new personal I-D that I hope will play home to
the WG’s recommended changes to RFC 5666. Nothing is written
in stone, please review and comment.

> Begin forwarded message:
> From: internet-drafts <at>
> Subject: New Version Notification for draft-cel-nfsv4-rfc5666-implementation-experience-00.txt
> Date: October 6, 2015 at 2:27:32 PM EDT
> To: "Charles Lever" <chuck.lever <at>>, "Chuck Lever" <chuck.lever <at>>
> A new version of I-D, draft-cel-nfsv4-rfc5666-implementation-experience-00.txt
> has been successfully submitted by Charles Lever and posted to the
> IETF repository.
> Name:		draft-cel-nfsv4-rfc5666-implementation-experience
> Revision:	00
> Title:		RPC-over-RDMA Version One Implementation Experience
> Document date:	2015-10-06
> Group:		Individual Submission
> Pages:		38
> URL:  
> Status:
> Htmlized:
> Abstract:
>   Experiences and challenges implementing the RPC-over-RDMA Version One
>   protocol are described.
(Continue reading)

Black, David | 5 Oct 21:38 2015

Re: DLB review of: draft-ietf-nfsv4-scsi-layout-02.txt

Hi Christoph,

David> A few comments inline ...

> -----Original Message-----
> From: Christoph Hellwig [mailto:hch <at>]
> Sent: Monday, October 05, 2015 3:20 AM
> To: Black, David
> Cc: nfsv4 <at>
> Subject: Re: [nfsv4] DLB review of: draft-ietf-nfsv4-scsi-layout-02.txt
> Hi Dave,
> thanks for the feedback.  I've pushed out the updated to
> but I won't push out a new
> draft before processing at least some of the updates from Dave Noveck.
> I'm taken almost all of your suggestions, but a few comments below:

David> That sounds great, thanks.

> > I'd expand this to explain what sorts of names the three types of
> > designators allow, capitalize must (-> MUST) and observe that this
> > requirement forbids a number of other types of SCSI names, including
> > those based on T10 vendor IDs (designator type 1).
> I've also added T10 vendor IDs to the list.  I'm not sure why I excluded them,
> the intention was to not support all the relative identifiers which are useful
> for a NFS client to identify the device.

(Continue reading)

Frank Filz | 30 Sep 21:02 2015

Another question, this one for RFC 7530

It's not clear whether a client is allowed to send a LOCK request with
new_lock_owner true when the client has previously done so for the same lock
owner and file (and thus the server has an active stateid for that
owner/file pair). If it is allowed, should the server re-use the same
stateid, or discard the old stateid and issue a new one?



This email has been checked for viruses by Avast antivirus software.

nfsv4 mailing list
nfsv4 <at>

Frank Filz | 30 Sep 19:54 2015

Question about EXCLUSIVE4_1

Here are some statements from RFC 5661 that don't quite add up: Attribute 75: suppattr_exclcreat

The bit vector that would set all REQUIRED and RECOMMENDED attributes
that are supported by the EXCLUSIVE4_1 method of file creation via
the OPEN operation. The scope of this attribute applies to all
objects with a matching fsid.


In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1.
Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1
case, but because the server may use attributes of the target object
to store the verifier, the set of allowable attributes may be fewer
than the set of attributes SETATTR allows. The allowable attributes
for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat
(Section attribute. If the client attempts to set in
cva_attrs an attribute that is not in suppattr_exclcreat, the server
MUST return NFS4ERR_INVAL. The response field, attrset, indicates
both which attributes the server set from cva_attrs and which
attributes the server used to store the verifier. As described in
Section 18.16.4, the client can compare cva_attrs.attrmask with
attrset to determine which attributes were used to store the


(Continue reading)

internet-drafts | 26 Sep 00:47 2015

I-D Action: draft-ietf-nfsv4-rpcrdma-bidirection-01.txt

A New Internet-Draft is available from the on-line Internet-Drafts directories.
 This draft is a work item of the Network File System Version 4 Working Group of the IETF.

        Title           : Size-Limited Bi-directional Remote Procedure Call On Remote Direct Memory Access Transports
        Author          : Charles Lever
	Filename        : draft-ietf-nfsv4-rpcrdma-bidirection-01.txt
	Pages           : 15
	Date            : 2015-09-25

   Recent minor versions of NFSv4 work best when ONC RPC transports can
   send ONC RPC transactions in both directions.  This document
   describes conventions that enable RPC-over-RDMA Version One transport
   endpoints to interoperate when operation in both directions is

The IETF datatracker status page for this draft is:

There's also a htmlized version available at:

A diff from the previous version is available at:

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at

Internet-Drafts are also available by anonymous FTP at:
(Continue reading)

David Noveck | 25 Sep 21:30 2015

(old) comments regarding draft-ietf-nfsv4-scsi-01

I tried to review section 3 and was quite confused by it.  I realize that it closely follows RFC5663 but that didn't help dispel my confusion.   I think we have a problem in that those who have implemented pNFS block have never clearly explained exactly how it is supposed to work.  The basic problems that I saw in section 3 were:
  • A lack of clarity about whether security (on a per-principal basis) or locking (on a per-owner basis) was being enforced.
  • A lack of clarity about the respective roles of the MDS and client in making sure these constraints were enforced.
  • A lot of stuff that needed to be clearly explained before the Security Considerations section is left to be hurriedly explained there.
As an example of the kind of thing I'm talking about, consider the following:

Similarly, SCSI storage devices are unable to validate NFS Access Control Lists (ACLs) and file open modes, so the client must enforce the policies before sending a READ or WRITE request to the storage device.
Which has the following issues:
  • It's true that storage devices can't deal with ACL's but NFSv4 clients aren't supposed to. They are supposed to depend on the server to do this when using ACCESS and OPEN.
  • It isn't very clear what specific "policies" it is referring to.
  • If you are supposed to check this stuff before issuing the IO, what about the window between the check and the IO?
So here's a possible replacement to stimulate discussion.  It consists of a new section (with some subsections) and  a new proposed (very-much-shortened) Security Considerations section.  I'm not sure this accurately describes current implementations but I think it is clear and would work if implemented.  Please let me know if you disagree or have any other comments.

I've built this possible replacement based on Tom's intuition, as stated in draft-ietf-pnfs-types that it is the job of the client and data storage device to enforce NFSv4 semantics.  Although Tom focuses on the authorization issues only, we need to clearly extend that to deal with locking restrictions as well.  The new section is built around that idea and structures the responsibilities as follows:
  • the responsibility for making decisions as to what access is semantically valid is the server's.
  • the basic responsibility for enforcing those decisions is the pNFS clients'. 
  • When situations make it impossible for the client to properly enforce those decisions, the servers may fence a client as a last resort in order to make sure that NFSv4 semantics is respected.
Note that in this version I've eliminated the material about virtualized block and what it might be able to do.  This is not because I think it should not be discussed.  It is just that I think we need to focus on clearly describing the simple case first.  Later I'll have further comments about virtualized block.

2. Enforcing NFSv4 Semantics

The functionality provided by SCSI Persistent Reservations makes it possible for the MDS to control access by individual client machines to specific LUs. Individual client machines may be allowed to or prevented from reading or writing to certain block devices. Finer-grained access control methods are not generally available.  
For this reason, certain responsibilities for enforcing NFSv4 semantics, including security and locking, are delegated to pNFS clients when SCSI layouts are being used. The metadata server's role is to only grant layouts appropriately and the pNFS clients have to be trusted to only perform accesses allowed by the layout extents they currently hold (e.g., and not access storage for files on which a layout extent is not held). In general, the server will not be able to prevent a client that holds a layout for a file from accessing parts of the physical disk not covered by the layout. Similarly, the server will not be able to prevent a client from accessing blocks covered by a layout that it has already returned. The pNFS client must respect the layout model for this mapping type to appropriately respect NFSv4 semantics.  
Furthermore, there is no way for the storage to determine the specific NFSv4 entity (principal, openowner, lockowner) on whose behalf the IO operation is being done. This fact may limit the functionality to be supported and require the pNFS client to implement server policies other than those describable by layouts.
In cases in which layouts previously granted become invalid, the server has the option of recalling them. In situations in which communication difficulties prevent this from happening, layouts may be revoked by the server. This revocation is accompanied by changes in persistent reservation which have the effect of preventing SCSI access to the LUs in question by the client.
2.1 Use of Open Stateids
The effective implementation of these NFSv4 semantic constraints is complicated by the different granularities of the actors for the different types of the functionality to be enforced:
  • To enforce security constraints for particular principals.
  • To enforce locking constraints for particular owners (openowners and lockowners)
Fundamental to enforcing both of these sorts of constraints is the principle that a pNFS client must not issue a SCSI IO operation unless it possesses both:
  • A valid open stateid for the file in question, performing the IO that allows IO of the type in question, which is associated with the openowner and principal on whose behalf the IO is to be done.
  • A valid layout stateid for the file in question that covers the byte range on which the IO is to be done and that allows IO of that type to be done.
As a result, if the equivalent of IO with an anonymous or write-bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI layout type. The client MAY attempt such IO using READs and WRITEs that do not use pNFS and are directed to the MDS.

When open stateids are revoked, due to lease expiration or any form of administrative revocation, the server MUST recall all layouts that allow IO to be done on any of the files for which open revocation happens. When there is a failure to successfully return those layouts, the client in question MUST be prevented from any SCSI device covered by these layouts until ...

2.2 Enforcing Security Restrictions

The restriction noted above provides adequate enforcement of appropriate security restriction when the principal issuing the IO is the same as that opening the file. The server is responsible for checking that the IO mode requested by the open is allowed for the principal doing the OPEN. If the correct sort of IO is done on behalf of the same principal, then the security restriction is thereby enforced.

If IO is done by a principal different from the one that opened the file, the client SHOULD send the IO to be performed by the metadata server rather than doing it directly to the storage device.

2.3 Enforcing Locking Restrictions

Mandatory enforcement of whole-file locking by means of share reservations is provided when the pNFS client obeys the requirement set forth in Section 2.1 above. Since performing IO requires a valid open stateid an IO that violates an existing share reservation would only be possible when the server allows conflicting open stateids to exist.

The nature of the SCSI layout type is such implementation/enforcement of mandatory byte-range locks is very difficult. Given that layouts are granted to clients rather than owners, the pNFS client is in no position to successfully arbitrate among multiple lockowners on the same client. Suppose lockowner A is doing a write and, while the IO is pending, lockowner B requests a mandatory byte-range for a byte range potentially overlapping the pending IO. In such a situation, the lock request cannot be granted while the IO is pending. In a non-pNFS environment, the server would have to wait for pending IO before granting the mandatory byte-range lock. In the pNFS environment the server does not issue the IO and is thus in no position to wait for its completion. The server may recall such layouts but in doing so, it has no way of distinguishing those being used by lockowners A and B, making it difficult to allow B to perform IO while forbidding A from doing so. Given this fact, the MDS need to successfully recall all layouts that overlap the range being locked before returning a successful response to the LOCK request. While the ;lock is in effect, the server SHOULD respond to requests for layouts which overlap a currently locked area with NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a server MAY do this for all layout requests on the file in question as long as there are any byte-range locks in effect.

Given these difficulties it may be difficult for servers supporting mandatory byte-range locks to also support SCSI layouts. Servers can support advisory byte-range locks instead. The NFSv4 protocol currently has no way of determining whether byte-range lock support on a particular file system will be mandatory or advisory, except by trying operation which would conflict if mandatory locking is in effect. Therefore, to avoid confusion, servers SHOULD NOT switch between mandatory and advisory byte-range locking based on whether any SCSI layouts have been obtained or whether a client that has obtained a SCSI layout has requested a byte-range lock.

3. Security Considerations

Access to SCSI storage devices is logically at a lower layer of the I/O stack than NFSv4, and hence NFSv4 security is not directly applicable to protocols that access such storage directly. Depending on the protocol, some of the security mechanisms provided by NFSv4 (e.g., encryption, cryptographic integrity) may not be available or may be provided via different means. At one extreme, pNFS with SCSI layouts can be used with storage access protocols (e.g., parallel SCSI) that provide essentially no security functionality. At another extreme, pNFS may be used with storage protocols such as iSCSI that can provide significant security functionality. It is the responsibility of those administering and deploying pNFS with a SCSI storage access protocol to ensure that appropriate protection is provided to that protocol (physical security is a common means for protocols not based on IP). In environments where the security requirements for the storage protocol cannot be met, pNFS SCSI layouts SHOULD NOT be used. When security is available for a storage protocol, it is generally at a different granularity and with a different notion of identity than NFSv4 (e.g., NFSv4 controls access by principals to files, iSCSI controls initiator access to volumes). The responsibility for enforcing appropriate correspondences between these layers is shared by the metadata server and the pNFS client as described in Section 2.1. As with the issues in the first paragraph of this section, in environments where the security requirements are such that considerable client-side participation in enforcing security restrictions is not acceptable, pNFS SCSI layouts SHOULD NOT be used.

To return to the subject of virtualized block for a bit, let me first state my conclusions.
  • Virtualized block is a good thing to implement. This is particularly so now given the difficulties with RPC-RDMA (spec and implementation). It would be great to leverage SCSI's ability to do DMA and use it in a file context.
  • The primary advantage of virtualized block is that it enables much simpler layouts for the client to deal with; a client can receive a simple layout for the entire file simplifying his layout management task. In addition he should not have to deal with separate read and write layouts for supporting snapshots/clones.
  • It would be helpful if the client knew he had a simpler layout management task. This would require either new information in the device descriptions or a separate layout type for virtualized block
  • An important advantage of virtualized block is in the area of safety (as opposed to security). With virtualized block, a client-side arithmetic error cannot wind up destroying a critical piece of fs metadata. Any arithmetic error is most likely to affect the file in question, making the issues pretty much like those in doing pNFS file.
  • Given that scsi knows only about clients (and not principals) I don't see any way that we can totally avoid trusting the clients to some degree.
  • It might be possible to tell clients that layouts to particular devices only be used by particular principals (or principals in a particular group). That would mean that enforcement while still part of the client could be done in smaller set of places. This would also require a new layout type.
nfsv4 mailing list
nfsv4 <at>
Black, David | 23 Sep 01:18 2015

DLB review of: draft-ietf-nfsv4-scsi-layout-02.txt


In Prague, I promised to review this draft, so here are some comments.
In general the draft is well-written, so this review focuses on the
SCSI aspects that are new to this draft.

One of the consequences of making this a self-contained draft is that
a section should be added to explain/summarize the differences between
the SCSI layout and the block-volume layout (which is independent of
SCSI).  The draft abstract and intro should contain some high-level
points of comparison/difference.

Also, somewhere early in the draft, it should be stated that SCSI
persistent reservations MUST be supported by all entities (clients,
servers, storage systems) that support pNFS SCSI layouts.

Section 2.3.1 - Nit - say that the Device identification VPD page is
VPD page 0x83 which is retrieved via the INQUIRY command with the EVPD
bit set to one.  This may seem obvious, but I would not assume that
every reader of this document will speak fluent SCSI ;-).

   2.  The "DESIGNATOR TYPE" must be set to one of three values
       explicitly listed in the "pnfs_scsi_designator_type"

I'd expand this to explain what sorts of names the three types of
designators allow, capitalize must (-> MUST) and observe that this
requirement forbids a number of other types of SCSI names, including
those based on T10 vendor IDs (designator type 1).

   and note that ASCII may be used
   as the code set for UTF-8 text that contains only ASCII characters.

Important nit: "only ASCII characters" -> "only printable ASCII characters".
Every UTF-8 character is also an 8-bit "ASCII character" - don't go there.

   NFS clients thus MUST iterate the descriptors until a match for
   "sbv_code_set", "sbv_designator_type" and "sbv_designator" is found,
   or until the end of VPD page.

"iterate the descriptors until a match for ... is found, or until the end
of the VPD page." -> "check all the descriptors for a possible match to ... ."

Section - how does the MDS pass reservation key values to clients?
The answer is sbv_pr_key, which should be mentioned here.

   If using a single key per client, the MDS needs to be
   aware of the per-client fencing granularity.

I think that's true no matter what, but at Logical Unit granularity in
all cases.  The quoted text seems to imply that a single fencing action
(PREEMPT AND ABORT) could affect multiple logical units, which I don't
believe is correct.

I'd also add text to suggest a single key per client is a simpler
approach rather than key per [client, LU] pair, in part because it
simplifies identifying which clients have reservations when looking
at reservation state on the storage systems.


   In case of a non-responding client the MDS MUST fence the client by

This is subtle, but I don't think "MUST" is appropriate here because that
depends on MDS determination of "non-responding" which is implementation-
specific.  I'd suggest "MUST fence" -> "fences" in the current text.
Alternatively, it'd be ok to say that fencing MUST be performed by issuing
a "PERSISTENT RESERVE OUT" command ... - my concern is basing the "MUST"
on MDS determination that a client is non-responsive.

   Note that the client can
   distinguish I/O errors due to fencing from other errors based on the

For clarity: "status" -> "SCSI status (part of the SCSI sense data
returned for a command that encountered an I/O error)"


   A client that detects I/O errors on the storage devices MUST commit
   through the MDS, return all outstanding layouts for the device,
   forget the device ID and unregister the reservation key.  Future
   GETDEVICEINFO calls may refer to the storage device again, in which
   case a new registration will be performed.

Several nits in the first line:

   A client that detects I/O errors on the storage devices MUST commit
   A client that detects RESERVATION CONFLICT I/O errors on a storage
   device MUST commit all layouts that use the storage device

Also change last sentence to active voice:

   in which case a new registration will be performed.
   in which case the client will perform a new registration based
   on the key provided (via sbv_pr_key) at that time.

Section 2.8 should also discuss client interaction with volatile write
caches.  If the storage system contains a volatile cache (e.g., V_SUP
is set to one in the Extended Inquiry VPD page [page 0x86]), setting
the FUA bit to one in write commands is a very good idea.

Section 3 - Security considerations

   At one extreme, pNFS with SCSI
   layouts can be used with storage access protocols (e.g., parallel
   SCSI) that provide essentially no security functionality.  At the
   other extreme, pNFS may be used with storage protocols such as iSCSI
   that can provide significant security functionality.

References should be cited for both parallel SCSI and iSCSI.


> -----Original Message-----
> From: nfsv4 [mailto:nfsv4-bounces <at>] On Behalf Of Christoph Hellwig
> Sent: Saturday, August 15, 2015 4:48 AM
> To: internet-drafts <at>
> Cc: nfsv4 <at>; i-d-announce <at>
> Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-scsi-layout-02.txt
> This contains the review feedback from Tom, and two minor XDR tweaks
> based on implementation experience.
> I'd like to start a WG last call on this draft soon.
> _______________________________________________
> nfsv4 mailing list
> nfsv4 <at>

nfsv4 mailing list
nfsv4 <at>

David Noveck | 20 Sep 14:54 2015

Status of draft-ietf-nfsv4-rfc3530-migration-update

I notice that this is still listed in the Datatracker with the WG state "In WG Last Call" even though last call ended two weeks ago.

I note the definition of this state includes the following information:

A WG I-D in this state should remain "In WG Last Call" until the WG Chair moves it to another state. The WG Chair may configure the Datatracker to send an e-mail after a specified period of time to remind or 'nudge' the Chair to conclude the WGLC and to determine the next state for the document.

Although I think the datatracker idea is a good one for you to consider, the main purpose of this mail is to give the chair(s) a friendly reminder of the need to determine the next state for this document.
nfsv4 mailing list
nfsv4 <at>
Spencer Shepler | 18 Sep 08:12 2015

NFSv4 WG will NOT meet at IETF 94 in Yokohama


Hi.  Based on feedback for agenda items.  We will NOT be meeting in Yokohama.


Note that IETF 95 will be held in Buenos Aires, Argentina





nfsv4 mailing list
nfsv4 <at>
Mora, Jorge | 15 Sep 21:50 2015

review of draft-ietf-nfsv4-minorversion2-39.txt

Minor corrections:

*** On the sixth paragraph on section:  Finishing or Stopping a Secure Inter-Server Copy

   stored on the destination server.  The destination server will then

   delete the <"copy_to_auth", user id, source list, nounce, nounce MIC,

   context handle, handle version> privilege and the associated

   "copy_confirm_auth" RPCSEC_GSSv3 handle.  The client MUST destroy

What is "nounce"?

What are the "<" and ">"? Is the list enclosed by "<" and ">" related to "privilege"?

*** Spelling error "uiniquely" on second paragraph on section:  Inter-Server Copy via ONC RPC without RPCSEC_GSS

   used to uiniquely identify the destination server to the source

*** Spelling error "delallocates" on 13th paragraph on section:

7.  Space Reservation

   DEALLOCATE  This operation delallocates the blocks backing a region

*** Missing IO_ADVISE from Table 2 on section:

11.2.  New Operations and Their Valid Errors

*** Missing IO_ADVISE from the table "Operations" on section:


Also, this table and the next table "Callback Operations" don't have a table number.

These should be "Table 5" and "Table 6" respectively and all other table

numbers must be adjusted.

*** Spelling error "destinaton" on sixth paragraph on section:


   know which netloc4 network location the destinaton might use to


*** Duplicate word "the the" on 7th paragraph on section:


   di_length returned MAY be for the entire hole.  If the the owner has

*** Spelling error "o" and "wlll" on 8th paragraph on section:


   multiple o the clone block size wlll be zeroed.

*** Duplicate word "see see" on 9th paragraph on section:


   The CLONE operation is atomic in that other operations may not see     <--

   see any intermediate states between the state of the two files before


nfsv4 mailing list
nfsv4 <at>
Mora, Jorge | 15 Sep 21:44 2015

review of draft-ietf-nfsv4-minorversion2-dot-x-39.txt

Just a minor spelling error:

*** Duplicate word "Society Society" on the copyright notice:

   ///  * - Neither the name of Internet Society Society, IETF or IETF

nfsv4 mailing list
nfsv4 <at>