So here's a possible replacement to stimulate discussion. It consists of a new section (with some subsections) and a new proposed (very-much-shortened) Security Considerations section. I'm not sure this accurately describes current implementations but I think it is clear and would work if implemented. Please let me know if you disagree or have any other comments.
I've built this possible replacement based on Tom's intuition, as stated in draft-ietf-pnfs-types that it is the job of the client and data storage device to enforce NFSv4 semantics. Although Tom focuses on the authorization issues only, we need to clearly extend that to deal with locking restrictions as well. The new section is built around that idea and structures the responsibilities as follows:
Note that in this version I've eliminated the material about virtualized block and what it might be able to do. This is not because I think it should not be discussed. It is just that I think we need to focus on clearly describing the simple case first. Later I'll have further comments about virtualized block.
2. Enforcing NFSv4 Semantics
The functionality provided by SCSI Persistent Reservations makes it possible for the MDS to control access by individual client machines to specific LUs. Individual client machines may be allowed to or prevented from reading or writing to certain block devices. Finer-grained access control methods are not generally available.
For this reason, certain responsibilities for enforcing NFSv4 semantics, including security and locking, are delegated to pNFS clients when SCSI layouts are being used. The metadata server's role is to only grant layouts appropriately and the pNFS clients have to be trusted to only perform accesses allowed by the layout extents they currently hold (e.g., and not access storage for files on which a layout extent is not held). In general, the server will not be able to prevent a client that holds a layout for a file from accessing parts of the physical disk not covered by the layout. Similarly, the server will not be able to prevent a client from accessing blocks covered by a layout that it has already returned. The pNFS client must respect the layout model for this mapping type to appropriately respect NFSv4 semantics.
Furthermore, there is no way for the storage to determine the specific NFSv4 entity (principal, openowner, lockowner) on whose behalf the IO operation is being done. This fact may limit the functionality to be supported and require the pNFS client to implement server policies other than those describable by layouts.
In cases in which layouts previously granted become invalid, the server has the option of recalling them. In situations in which communication difficulties prevent this from happening, layouts may be revoked by the server. This revocation is accompanied by changes in persistent reservation which have the effect of preventing SCSI access to the LUs in question by the client.
2.1 Use of Open Stateids
The effective implementation of these NFSv4 semantic constraints is complicated by the different granularities of the actors for the different types of the functionality to be enforced:
- To enforce security constraints for particular principals.
- To enforce locking constraints for particular owners (openowners and lockowners)
Fundamental to enforcing both of these sorts of constraints is the principle that a pNFS client must not issue a SCSI IO operation unless it possesses both:
- A valid open stateid for the file in question, performing the IO that allows IO of the type in question, which is associated with the openowner and principal on whose behalf the IO is to be done.
- A valid layout stateid for the file in question that covers the byte range on which the IO is to be done and that allows IO of that type to be done.
As a result, if the equivalent of IO with an anonymous or write-bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI layout type. The client MAY attempt such IO using READs and WRITEs that do not use pNFS and are directed to the MDS.
When open stateids are revoked, due to lease expiration or any form of administrative revocation, the server MUST recall all layouts that allow IO to be done on any of the files for which open revocation happens. When there is a failure to successfully return those layouts, the client in question MUST be prevented from any SCSI device covered by these layouts until ...
2.2 Enforcing Security Restrictions
The restriction noted above provides adequate enforcement of appropriate security restriction when the principal issuing the IO is the same as that opening the file. The server is responsible for checking that the IO mode requested by the open is allowed for the principal doing the OPEN. If the correct sort of IO is done on behalf of the same principal, then the security restriction is thereby enforced.
If IO is done by a principal different from the one that opened the file, the client SHOULD send the IO to be performed by the metadata server rather than doing it directly to the storage device.
2.3 Enforcing Locking Restrictions
Mandatory enforcement of whole-file locking by means of share reservations is provided when the pNFS client obeys the requirement set forth in Section 2.1 above. Since performing IO requires a valid open stateid an IO that violates an existing share reservation would only be possible when the server allows conflicting open stateids to exist.
The nature of the SCSI layout type is such implementation/enforcement of mandatory byte-range locks is very difficult. Given that layouts are granted to clients rather than owners, the pNFS client is in no position to successfully arbitrate among multiple lockowners on the same client. Suppose lockowner A is doing a write and, while the IO is pending, lockowner B requests a mandatory byte-range for a byte range potentially overlapping the pending IO. In such a situation, the lock request cannot be granted while the IO is pending. In a non-pNFS environment, the server would have to wait for pending IO before granting the mandatory byte-range lock. In the pNFS environment the server does not issue the IO and is thus in no position to wait for its completion. The server may recall such layouts but in doing so, it has no way of distinguishing those being used by lockowners A and B, making it difficult to allow B to perform IO while forbidding A from doing so. Given this fact, the MDS need to successfully recall all layouts that overlap the range being locked before returning a successful response to the LOCK request. While the ;lock is in effect, the server SHOULD respond to requests for layouts which overlap a currently locked area with NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a server MAY do this for all layout requests on the file in question as long as there are any byte-range locks in effect.
Given these difficulties it may be difficult for servers supporting mandatory byte-range locks to also support SCSI layouts. Servers can support advisory byte-range locks instead. The NFSv4 protocol currently has no way of determining whether byte-range lock support on a particular file system will be mandatory or advisory, except by trying operation which would conflict if mandatory locking is in effect. Therefore, to avoid confusion, servers SHOULD NOT switch between mandatory and advisory byte-range locking based on whether any SCSI layouts have been obtained or whether a client that has obtained a SCSI layout has requested a byte-range lock.
3. Security Considerations
Access to SCSI storage devices is logically at a lower layer of the I/O stack than NFSv4, and hence NFSv4 security is not directly applicable to protocols that access such storage directly. Depending on the protocol, some of the security mechanisms provided by NFSv4 (e.g., encryption, cryptographic integrity) may not be available or may be provided via different means. At one extreme, pNFS with SCSI layouts can be used with storage access protocols (e.g., parallel SCSI) that provide essentially no security functionality. At another extreme, pNFS may be used with storage protocols such as iSCSI that can provide significant security functionality. It is the responsibility of those administering and deploying pNFS with a SCSI storage access protocol to ensure that appropriate protection is provided to that protocol (physical security is a common means for protocols not based on IP). In environments where the security requirements for the storage protocol cannot be met, pNFS SCSI layouts SHOULD NOT be used.
When security is available for a storage protocol, it is generally at a different granularity and with a different notion of identity than NFSv4 (e.g., NFSv4 controls access by principals to files, iSCSI controls initiator access to volumes). The responsibility for enforcing appropriate correspondences between these layers is shared by the metadata server and the pNFS client as described in Section 2.1. As with the issues in the first paragraph of this section, in environments where the security requirements are such that considerable client-side participation in enforcing security restrictions is not acceptable, pNFS SCSI layouts SHOULD NOT be used.
To return to the subject of virtualized block for a bit, let me first state my conclusions.
- Virtualized block is a good thing to implement. This is particularly so now given the difficulties with RPC-RDMA (spec and implementation). It would be great to leverage SCSI's ability to do DMA and use it in a file context.
- The primary advantage of virtualized block is that it enables much simpler layouts for the client to deal with; a client can receive a simple layout for the entire file simplifying his layout management task. In addition he should not have to deal with separate read and write layouts for supporting snapshots/clones.
- It would be helpful if the client knew he had a simpler layout management task. This would require either new information in the device descriptions or a separate layout type for virtualized block
- An important advantage of virtualized block is in the area of safety (as opposed to security). With virtualized block, a client-side arithmetic error cannot wind up destroying a critical piece of fs metadata. Any arithmetic error is most likely to affect the file in question, making the issues pretty much like those in doing pNFS file.
- Given that scsi knows only about clients (and not principals) I don't see any way that we can totally avoid trusting the clients to some degree.
- It might be possible to tell clients that layouts to particular devices only be used by particular principals (or principals in a particular group). That would mean that enforcement while still part of the client could be done in smaller set of places. This would also require a new layout type.