review of draft-ietf-nfsv4-scsi-layout-04
2015-11-24 10:29:15 GMT
- 2.1: Background and Architecture
2.4.1: Layout Requests and Extent Lists
- RFC5663 presented a quite dubious explanation for REQUIRING a very strict concurrency policy.
- draft-ietf-nfsv4-scsi-layout-04 presented a different, even more dubious, for requiring a somewhat-less-strict concurrency policy but left the strict policy REQUIREMENT in place.
- There is a possible justification for an even-less-strict concurrency policy in 2.4.2: Layout Commits. Unfortunately, the concurrency policy, while needed,doesn't fix the basic problem in 2.4.2: Layout Commits.
- Given how long this concurrency policy has been in the spec, we may want to keep it in effect for a while, until we are sure there are not other problems lurking that have been (partially or totally) covered up by this concurrency policy. Nevertheless, I can't see keeping this as REQUIRED.
- It isn't clear what the relationship is between the "SHOULD" and the lower-case '"is acceptable". It seems to say you don't have to bother about the "SHOULD" under certain circumstances, but if so, it's not clear why the "SHOULD" is there.
- whether something is a 'legacy" file system has no clear definition.
- it seems to me that this goes back to a lack of clarity in the previous sentences. if a file system is using 1K fragments, then that is its effective allocation granularity, and that should be in layout_blksize, even though the file system implementer might say that the block size is 8K.
BTW, one could help the situation with regard to 2.5: Extents are permissions, by ending the second paragraph just before the text "and writable extents". That eliminates the problem cited above which seems OK as the restriction in question never had a clear justification. I think we should consider removing it (and allowing the DS to write a single physical block on its own, without using read-modify-write).
It starts out with "logical regions of the file". That's OK.
- Then you get to "physical locations on a volume". The next sentence it is a "location on the logical volume". In the sentence after that, this is a "logical offset"
- Eventually this is translated into an "offset" into an LU (i.e. a logical unit). That offset is the thing that most people would consider "physical", since it is sent over SCSI.
The basic problem here is that if "readily available" has no clear definition (which I think it is what is meant here), there is no point in talking about it in a specification which needs to provide information to allow interoperabiity. If the meaning here is "the server can do whatever it chooses and the client has to live with it", then that is what should be said.If loga_minlength is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Readily is subjective, and depends on the layout type and the pNFS server implementation.
Note that RFC5663 compounds the difficulty by saying "For block layout servers, readily available SHOULD be interpreted". Given the RFC2119 definition of "SHOULD", it is not clear how a server or client could be aware of consequences of interpreting this phrase differently than indicated.If loga_minlength is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has readily available. As this term has no generally accepted definition, it is up to the layout type specification to specify any appropriate REQUIREMENTS or RECOMMENDATIONS.
According to [RFC5661], if the minimum requested size, loga_minlength, is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Given the lack of a clear definition of this phrase, in the context of the SCSI layout type, when loga_minlength is zero, the metadata server SHOULD:
- when processing requests for readable layouts, return all such, even if some extents are in the PNFS_SCSI_NONE_DATA state.
- when processing requests for writable layouts, return extents which can be returned in the PNFS_SCSI_READ_WRITE_DATA state.
- It doesn't seem to me to be right to predicate a "MUST" on the server's state of belief.
- The reference to "belief" means we have to deal with the issue in which the client's belief is incorrect. As this is written, there is a possibility that a client which is wrong in thinking he is extending the file might corrupt the file in doing the zeroing that this section tells him to do.
Since the block in question is in state PNFS_SCSI_INVALID_DATA,byte rages not written should be filled with zeros. This applies even if it appears that the area being written is beyond what the client believes to be the end of file
A LAYOUTRETURN operation represents an explicit release of resources by the client, This may be done in response to a CB_LAYOUTRECALL or before any recall, in order to avoid a future C_LAYOUTRECALL. When the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data structure specifies the region of the file layout that is no longer needed by the client. The client may return disjoint regions of the file by using multiple LAYOUTRETURN operations within a single COMPOUND operation.
2.4.3a: Layout Revocation.The LAYOUTRETURN operation is done without any SCSI layout specific data. The opaque "lrf_body" field of the "layoutreturn_file4" data structure MUST have length zero.
- Since what is being discussed is essentially fencing, this section need to reference the fencing section,
- The phrase "or a delegation being recalled, or the client failing to return a layout in a timely manner" suggests that revocation can happen due to a recall alone or when a client simply does not return a layout which has not been recalled. This needs to be fixed
- The reference to lease expiration suggest that revocation happens upon lease expiration without any recall attempt. although this is not reflected in my proposed text, it would be better/cleaner if we let expiration trigger recall and did not revoke immediately
Layout may be unilaterally revoked by the server, due to the client's lease time expiring, or the client failing to return a layout which has been recalled in a timely manner. For the SCSI layout type this is accomplished by fencing off the client from access to storage as describe in Section 2.4.8. When this is done, it is necessary that all I/Os issued by the fenced-off client be rejected by the storage This includes any in-flight I/Os that the client issued before the layout was revoked.
Note, that the granularity of this operation can only be at the host/LU level. Thus, if one of a client's layouts is unilaterally revoked by the server, it will effectively render useless *all* of the client's layouts for files located on the storage units comprising the logical volume. This may render useless the client's layouts for files in other file systems. See section 188.8.131.52 for a discussion of recovery from from fencing.
These statements are true but they are also true of NFSv4 file semantics.Block/volume class storage devices are not required to perform read and write operations atomically. Overlapping concurrent read and write operations to the same data may cause the read to return a mixture of before-write and after-write data.
This doesn't seem to make sense since, with striping, even if the storage devices did/could provide atomicity, striping, for example would undo that. An IO split between two stripes would not be atomic, even if each IO were atomic.
Overlapping write operations can be worse, as the result could be a mixture of data from the two write operations; data corruption can occur if the underlying storage is striped and the operations complete in different orders on different stripes. When there are multiple clients who wish to access the same data, a pNFS server can avoid these conflicts by implementing a concurrency control policy of single writer XOR multiple readers. This policy MUST be implemented when storage devices do not provide atomicity for concurrent read/write and write/write operations to the same data.
- Atomicity of reads and writes is not required by NFSv4 or provided by block devices If such atomicity is required for pNFS there needs to be a reason stated as to why, which I can't find.
- If you are writing overlapping areas at the same time and you consider the result "data corruption", then that corruption is due to the application doing something that doesn't make sense.
- If applications are indeed doing this, I can't see how atomicity would make the situation any better.
- NFSv4 has existing concurrency control mechanisms, such as share reservations and delegations that can prevent multiple clients from interfering with one another.
- I don't see the value of pNFS scsi clients doing such read-modify-write cycles, especially when they can do IO that is not block-aligned via the MDS. I always thought that was the intention of RFC5663 but I don't know what people actually implemented. In any case, given that such operations are very infrequent, I don't think it is right to suggest that pNFS clients should be doing such things. The fact that it would require the MDS to effect a concurrency control policy strengthens the argument that this should be something that should be done via the MDS,
- If you allowed writes to physical blocks that were not layout-block-aligned (See 2.1: Background and Architecture), you could do this safely.
- SCSI COMPARE AND WRITE could allow you to safely effect a write within a physical block.
- Even if you allowed read-modify-write to do non-block-aligned IO, the MDS would only have to prevent multiple writers. One writer plus multiple readers would be OK.
SCSI storage devices do not provide byte granularity access and can only perform read and write operations atomically on a block granularity, and thus would require read-modify-write cycles to write data smaller than the block size or which is otherwise not block-aligned Given the possibility that multiple clients performing such read-modify-write cycles might interfere, clients should avoid such updates and direct non-block-aligned writes to the MDS to perform safely. The only exceptions are:
- When the scsi operation COMPARE AND WRITE is available to allow client to do read-modify-write sequences safely.
- When the client has a write delegation for the file in question.
- When the client has an open denying write.
Note that the above does not apply to partial writes to writes of blocks within areas in the PNFS_SCSI_INVALID_DATA state (see Section 2.4.2). In that case a read-modify-write is not done and the area to be written is supplemented by zeros either before or after the area modified or both. Unlike the case above, this is safe because the server will only give a single client a layout for a block in this state (See Section 2.4.5)
Block/volume class storage devices are not required to perform read and write operations atomically. Overlapping operations to the same data thus will cause the read to return a mixture of before-write and after-write data. Additionally, further issues can arise if overlapping writes are done, the underlying storage is striped and the operations complete in different orders on different stripes. When multiple clients have access to the the same file at the same time, atomicity is not guaranteed by NFSv4 and the same applies when the scsi layout type is being used. Client-side caching can exacerbate the negative consequences of this situation. Share reservations and delegations can be used as ways of preventing destructive consequences arising from this fact.
Server's MAY implement a concurrency policy preventing simultaneous write access by multiple clients to the same block or preventing write access and read access to the same block being granted to different clients. In the case of blocks in state PNFS_SCSI_INVALID_DATA, the server MUST prevent multiple clients from being given a layout for the same block since multiple clients' partial-block updates might cause data to be lost . (See Section 2.4.4a). Also, if the MDS is doing a partial block write, it must not grant a write layout for that block, or recall one previously given out before reading the block in questionsThe existing third paragraph needs to be moved to the end since conflict with existing locks are the basic reason that layouts will be recalled or not granted. in addition, the first sentence of that paragraph need to be rewritten as follows:If a client makes a layout request that conflicts with an existing delegation, layout, or share reservation, the request will be rejected with the error NFS4ERR_LAYOUTTRYLATER.
- It doesn't cover what is reported by supported_attrs. probably should say the server must not support this on an fs only supporting pnfs using the scsi layout.
- Should mention that layout hints are primarily specified, not by SETATTR, but in the attributes specified with OPEN.
This sectio needs to be reworked in light of the fact that the MDS is going to enforcing NFSv4 semantics, and that conflicts with different NFSv4 locks may have different retry stratgeies.
Here's a possible replacement:
The server may respond to LAYOUTGET with a variety of error statuses. These errors can convey transient conditions or more long-lived conditions that are not expected to be resolved soon.
- The error NFS4ERR_RECALLCONFLICT indicates that the server has recently issued a CB_LAYOUTRECALL to the requesting client, making it necessary for the client to respond to the recall before processsing the layout request. A client can wait for that recall to be receive and processe or it can retry as for NFS4ERR_TRYLATER, as described below.
- The error NFS4ERR_TRYLATER is used to indicate that the server cannot immediately grant the layout to the client. This may be due to constraints on writable sharing of blocks by multiple clients or to a conflict with a recallable lock (e.g. a delegation). In either case, a reasonable approach for the client is to wait several milliseconds and retry the request. The client SHOULD track the number of retries, and if forward progress is not made, the client should abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the server
- The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server if layouts are not supported for the requested file or its containing file system. The server may also return this error code if the server is the progress of migrating the file from secondary storage, there is a conflicting lock that woul prevent the layout from being granted, or for any other reason that causes the server to be unable to supply the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, ethe client should abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the server
It is expected that a client will not cache the file's layoutunavailable state forever. In particular, when the file is closed or opened by the client, issung a new LAYOUTGET is appropriate.
- Need to replace "implemeting" by "implementing"
- Replace "indicated" by "indicates"
_______________________________________________ nfsv4 mailing list nfsv4 <at> ietf.org https://www.ietf.org/mailman/listinfo/nfsv4