Benny,
The only point I was trying to make
is that the spec should only refer to what the server should do and not
do and the exclusion of locks and outstanding layouts with conflicting
ranges is probably the simplest way to state it. As for enforcing it for
object it can be enforced at whatever granularity you want - even on individual
objects. A range credential is still being contemplated by the working
group (prompted by requests from at least two companies - Panasas being
one of them AFAIR

).
For SCSI block storage it is probably
impractical and so it is for file unless file follows the ObS model in
this case (as David Black's note appears to indicate).
Julo
| Benny Halevy <bhalevy <at> panasas.com>
07/07/2005 09:57
|
|
To
|
Julian Satran/Haifa/IBM <at> IBMIL, nfsv4 <at> ietf.org
|
|
cc
|
|
|
Subject
|
Re: [nfsv4] Another pNFS issue |
|
On July 7 2005 Julian Satran wrote:
>
>
> nfsv4-bounces <at> ietf.org wrote on 06/07/2005 16:54:21:
>
> > On July 5 2005 Dave Noveck wrote:
> > >>Another issue that I brought up at the end of our
CITI pNFS meeting
> > >
> > > and we
> > >
> > >>ran out of time to talk about is the interaction
of the layout
> > >
> > > retrieval
> > >
> > >>and locks.
> > >
> > >
> > >>The question is: When using mandatory locking,
should a server return
> > >
> > > a
> > >
> > >>layout for a byte-range to a client that conflicts
with a byte-range
> > >
> > > lock
> > >
> > >>owned by another client?
> > >
> > >
> > > I think it is important to recognize when discussing
this issue that
> > > mandatory
> > > byte-range locking in v4.0 is optional. It is
possible to make
> optional
> > >
> > > features mandatory in a minor version, so we scould
make mandatory
> > > locking
> > > mandatory in v4.1, but I haven't heard of a proposal
to do that.
> > >
> > >
> > >>I would like to say that layouts and locks are
independent, BUT,
> > >
> > > certain
> > >
> > >>layout types, e.g., block based, cannot support
mandatory locks
> > >
> > > without
> > >
> > >>this interaction.
> > >
> > >
> > > Correct but I would say that this interaction is an
> > > implemenation-specific
> > > one. Because it cannot do anything else, the
block-based layout type,
> > > if mandatory locking is implemented, must have this
interaction, but
> > > that
> > > doesn't mean that other forms of layout need to implement
the same
> > > interaction.
> >
> > I agree, the server must recall the layout if letting the
> > client use it may conflict with any guarantee the server
is
> > making to other clients. This may be true for mandatory
> > locks, share reservations, and (data) delegations.
> >
> > >
> > >
> > >>I'm assuming with our support for minor versioning
> > >>rules, it would not be ok to break mandatory locking
in the protocol.
> > >>Example:
> > >>1. Client1 locks the range 10-100 in file a.
> > >>2. Client2 requests the layout for bytes 0-20 in
file a.
> > >
> > >
> > >>Cases:
> > >>NFSv4 file- and Object-based layouts:
> > >>a) The server could propagate the lock information
to the data servers
> > >
> > > to
> > >
> > >>ensure correctness.
> > >
> > >
> > > It may. Garths current draft talks about state
propagation. I don't
> > > think
> > > it says anything specific to mandatory byte-range
locking though.
> >
> > Agreed. This might work for pnfs over files.
> >
> > >
> > >
> > >>b) Another option would be that the server recalls
all outstanding
> > >
> > > layouts
> > >
> > >>that conflict with the lock before granting the
mandatory byte-range
> > >
> > > lock.
> > >
> > >>All future layout requests that conflict with the
lock are rejected.
> > >
> > >
> > > If it doesn't do a), it has to do b), if it supports
mandatory
> > > byte-range
> > > locking at all.
> >
> > Agreed, this should be the default behavior.
> >
> > >
> > >
> > >>c) Another option is that we state that is ok for
pNFS supported file
> > >>systems to not support mandatory locks.
> > >
> > >
> > > As far as byte-range locks, it is OK not to support
mandatroy locks. I
> > > think the problem that you raise can also occur for
for share
> > > reservations
> > > which are not optional. If I open a file DENY=ALL,
then I have to
> > > either
> > > propagate the fact there is now a restriction on the
IO's that can be
> > > done
> > > by other clients or recall the layous and not give
out any new ones.
> > > Note
> > > that since it is a restriction Garth's latest document
says that
> for the
> > >
> > > file-based layout, this propagation must happen immediately),
and if I
> > > can't do that, I cannot give out layouts that make
it impossible to
> > > implement
> > > the protocol.
> > >
> > >
> > >>Block-based and other layouts with the inability
to propagate lock
> > >>information to the servers:
> > >>a) Options b) or c) from above are the only options
that I can see.
> >
> > I see another possible option: send a callback to all
> > clients holding layouts when the first mandatory lock
> > request has been received asking the layout holding clients
> > to lock (at the metadata server) every byte range they
are
> > accessing while it is being accessed over whatever storage
> > access protocol they are using. This will provide
the same
> > locking semantics of I/Os that go through the server, just
> > without moving any data through. When all locks are
> > released (possibly after some delay for hysteresis) the
> > server can send a callback to all layout holding clients
to
> > return to normal operation mode.
> >
>
> That would be a rule hard to enforce.
Are you going to enforce layout_return by fencing clients
off of storage? I'm not sure this can be done efficiently
for block storage and I'm pretty sure we're not going to do
that for objects (e.g. by changing the capability keys on
all objects in the layout) after layout return, for
performance reason.
The bottom line is that we trust the client not to use the
layout after layout return rather than enforce it. Trusting
it to issue lock requests before doing I/O is at the same
level IMHO.
> I think that a simple rule stating that when mandatory locks are used
a
> server must not give out layouts that conflict with the range locks
nor
> give out locks that conflict with layouts available to clients would
be
> enough for correctness and leave the implementer some choices (depending
> on how much state it maintains etc.).
I'm OK with going this way initially to cut down complexity
unless users expect scalable performance when using
mandatory byte-range locks to synchronize access to shared
files. Since applications like MPI-IO or some databases
have their own synchronization and distributed cache
coherency mechanisms I don't think this will be a
requirement in the short term.
>
> Some examples might then clarify choices (like recalling layouts,
> recalling locks etc.)
>
> > >
> > > Which
> > >
> > >>one do we want to support?
> > >
> > >
> > > If we are talking about share reservations as well
as mandatory
> > > byte-range
> > > locks, then c) is not an option, leaving b) as the
only choice for
> > > block-
> > > based layouts.
> >
> > This is true also for T10 OSD based storage which does
not
> > support byte range locking.
> >
> > >
> > >
> > >>Even if you think it should be underlying file
system dependent, I
> > >
> > > think
> > >
> > >>we should have a statement in the draft.
> > >
> > >
> > > What kind of a statement are you looking for? As
it is, I think you
> > > have
> > > clearly layed out the case that when block-based layouts
are used, you
> > > must
> > > have the interaction you described while there is
plenty of discussion
> > > for
> > > the file-based layous of the rules for state propagation,
when you have
> > > the
> > > option to do that.
> > >
> > > If there some specific text that you'd like added
that would make this
> > > all
> > > clearer? I think one of the more difficult issues
that we have is
> > > distinguishing,
> > > as we explain how things work, what is a specific
protocol obligation
> > > and
> > > what are the typical implementation choices that servers
will make. We
> > > have
> > > to have some of the latter or else the document will
just not be
> > > understandable
> > > by anybody who hasn't been involved in all this from
the start. The
> > > problem
> > > is that it is very easy to assume those tyical choices
are the only
> ones
> > > that
> > > you can make. As we go forward, this issue will
come up again and
> > > again.
> > >
> > > -----Original Message-----
> > > From: Dean Hildebrand [mailto:dhildebz <at> eecs.umich.edu]
> > > Sent: Tuesday, July 05, 2005 11:26 AM
> > > To: nfsv4 <at> ietf.org
> > > Subject: [nfsv4] Another pNFS issue
> > >
> > >
> > > Another issue that I brought up at the end of our
CITI pNFS meeting and
> > > we
> > > ran out of time to talk about is the interaction of
the layout
> retrieval
> > > and locks.
> > >
> > > The question is: When using mandatory locking, should
a server return a
> > > layout for a byte-range to a client that conflicts
with a byte-range
> > > lock
> > > owned by another client?
> > >
> > > I would like to say that layouts and locks are independent,
BUT,
> certain
> > > layout types, e.g., block based, cannot support mandatory
locks without
> > > this interaction. I'm assuming with our support
for minor versioning
> > > rules, it would not be ok to break mandatory locking
in the protocol.
> > > Example:
> > > 1. Client1 locks the range 10-100 in file a.
> > > 2. Client2 requests the layout for bytes 0-20 in file
a.
> > >
> > > Cases:
> > > NFSv4 file- and Object-based layouts:
> > > a) The server could propagate the lock information
to the data servers
> > > to
> > > ensure correctness.
> > > b) Another option would be that the server recalls
all outstanding
> > > layouts
> > > that conflict with the lock before granting the mandatory
byte-range
> > > lock.
> > > All future layout requests that conflict with the
lock are rejected.
> > > c) Another option is that we state that is ok for
pNFS supported file
> > > systems to not support mandatory locks.
> > >
> > > Block-based and other layouts with the inability to
propagate lock
> > > information to the servers:
> > > a) Options b) or c) from above are the only options
that I can see.
> > > Which
> > > one do we want to support?
> > >
> > > Even if you think it should be underlying file system
dependent, I
> think
> > > we should have a statement in the draft.
> > >
> > > Dean
> > >
> > >
> > >
> > > On Mon, 20 Jun 2005, Garth Goodson wrote:
> > >
> > >
> > >>There are a number of open issues and changes that
were suggested for
> > >>the pnfs draft ID (draft-welch-pnfs-ops-02.txt).
> > >>
> > >>Please post any feedback or opinions (or feel free
to add more).
> > >>
> > >>Thanks,
> > >>-Garth
> > >>
> > >>
> > >>3.1.1 Device IDs
> > >>
> > >>Proposal: device ids are valid only while a server
is up, remappings
> > >>while server is up must use a different device
id. (I prefer this)
> > >>
> > >>Alternative: device ids are attached to leases
and may be timed-out
> > >>(probably also need to be recallable or invalidated)
> > >>
> > >>3.1.2 Aggregation Schemes
> > >>
> > >>Proposal: aggregation scheme (e.g., striping is
part of opaque
> > >>layout-type defined structure)
> > >>
> > >>Alternative: make a general striping aggregation
that is outside of
> > >
> > > the
> > >
> > >>opaque structure; push striping up a level... (it
references
> > >
> > > layout-type
> > >
> > >>opaque specific structures which contain the devices)
> > >>
> > >>3.2.2 Operation sequencing
> > >>
> > >>Issue: Race condition exists between LAYOUTGET/LAYOUTRECALL
passing
> > >
> > > each
> > >
> > >>other on wire. Sessions does not solve this
since they are on
> > >
> > > different
> > >
> > >>channels. May require a seq-id to be returned
in layout ops.
> > >>
> > >>3.3.1 Identifying layouts
> > >>
> > >>Proposal: layouts are identified by <ClientID,
FH, offset, length>
> > >>
> > >>Alternative: may need to distinguish between read
vs. read/write
> > >
> > > layout
> > >
> > >>types, e.g., block/obj layouts may have separate
read vs. read/write
> > >>layouts and will want to handle them separately.
Propose adding
> > >
> > > iomode
> > >
> > >>to layout identification (and allowing recall to
specify the specific
> > >>mode or any mode).
> > >>
> > >>3.3.3 Copy-on-write
> > >>
> > >>Same as discussion for 3.3.1
> > >>
> > >>3.4 Recalling a LAYOUT
> > >>
> > >>Addition: Add a recall to recall all layouts pertaining
to a specific
> > >
> > > fsid.
> > >
> > >>Issue: Long callback recalls (if client has dirty
data that needs to
> > >
> > > be
> > >
> > >>flushed). Mostly wording on how long client
should have to write data
> > >>while holding the layout. It can always revert
to sending the data
> > >>through the metadata server if I/Os to data servers
fail (due to
> > >
> > > server
> > >
> > >>revoking layout).
> > >>
> > >>3.5 Committing a layout
> > >>
> > >>Addition: add mtime/atime hints to LAYOUTCOMMIT
(server can use them
> > >
> > > or
> > >
> > >>not based on current state -- e.g., server should
not allow time to go
> > >>backwards and should not set the mtime if the mtime
is already higher
> > >>than that being set). SETATTR used to specify
an exact time and is
> > >>constrained by regular V4 semantics. The
main difference between the
> > >>time in LAYOUTCOMMIT and SETATTR is that the times
set by SETATTR are
> > >>mandatory vs. hints.
> > >>
> > >>3.5.1 LAYOUTCOMMIT and EOF
> > >>
> > >>Change: instead of specifying newEOF with a flag
(depending on whether
> > >>client thinks it is setting a new EOF), client
will specify the last
> > >>byte to which it wrote.
> > >>
> > >>5.1 File Striping and Data Access
> > >>
> > >>Change: simplify striping layout -- have enum for
SPARSE vs. DENSE
> > >>layout instead of skip and start offset
> > >>
> > >>Issue: think about what error gets returned if
a client performs a
> > >>non(READ/WRITE/PUTFH/COMMIT) at a data server;
issue: this may be a
> > >>problem if a regular nfsv4 data server is used
as it has no way to
> > >>differentiate accesses.
> > >>
> > >>5.2 Global Stateids
> > >>
> > >>Issue: does not provide for unmodified NFSv4 data
servers. More
> > >>thinking must be done if unmodified V4 servers
are to be used as data
> > >>servers. See section: 5.7 for discussion.
> > >>
> > >>5.3 I/O mode
> > >>
> > >>Change: Don't restrict I/O mode to be RW. It
may be useful to have
> > >>read-only replicas to which a client can be directed
if the iomode is
> > >
> > > READ.
> > >
> > >>5.4.1 Lock State Propagation
> > >>
> > >>Issue: can seq ids in stateids be ignored on the
data servers (what
> > >>about if sessions are used?)
> > >>
> > >>5.4.3 Access State Propagation
> > >>
> > >>Issue: NFSv4 spec says that READs/WRITEs do not
require same principal
> > >>as OPEN. This opens a security hole, but
some implementations depend
> > >
> > > on
> > >
> > >>it. pNFS should probably go with the spec
on this and not change the
> > >>semantics.
> > >>
> > >>6.3 pnfs_devaddr4
> > >>
> > >>Change: switch on layouttype4 instead of devaddrtypes4
> > >>
> > >>Change?: add disk signature to list of types
> > >>
> > >>7 pnfs File Attributes
> > >>
> > >>Add: PREFERRED_ALIGNMENT, PREFERRED_BLOCKSIZE as
FSID level attributes
> > >>
> > >>9.2 LAYOUTCOMMIT
> > >>
> > >>Add: mtime/atime time attribute hints to args
> > >>
> > >>Change: neweof in result as per Object Internet
Draft. Basically,
> > >
> > > have
> > >
> > >>a specific structure for each layout type that
is the layoutcommit
> > >>layout (vs. the layout received by layout get).
This allows extra
> > >>opaque data to be sent on layoutcommit.
> > >>
> > >>General:
> > >>
> > >>IANA - think about whether additional layouttypes
go through IANA or
> > >>specification process.
> > >>
> > >>_______________________________________________
> > >>nfsv4 mailing list
> > >>nfsv4 <at> ietf.org
> > >>https://www1.ietf.org/mailman/listinfo/nfsv4
> > >>
> > >
> > >
> > > _______________________________________________
> > > nfsv4 mailing list
> > > nfsv4 <at> ietf.org
> > > https://www1.ietf.org/mailman/listinfo/nfsv4
> > >
> > > _______________________________________________
> > > nfsv4 mailing list
> > > nfsv4 <at> ietf.org
> > > https://www1.ietf.org/mailman/listinfo/nfsv4
> >
> >
> >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4 <at> ietf.org
> > https://www1.ietf.org/mailman/listinfo/nfsv4