Boris Protopopov | 1 Jul 03:25 2013

Re: BP rewrite via snapshots

Hi, Matt, Jim,

Yes, this seems quite similar to what we have looked at previously at Nexenta, calling it informally "intra-pool send/recv". We thought that it was a bit simplistic and not quite as efficient as desired because it moved all the data and all the metadata instead of moving a potentially much smaller subset that actually needs to be moved. 

We are currently working on a white paper that lays out our approach to BP rewrite. We do intend to implement our solution, and we will make it open source :) We would appreciate any help and advice from the community.

Our analysis indicates that (transparently) moving data that belongs to active filesystems but does not belong to snapshots is not a difficult problem. Such data by definition is referenced by a single tree, so there are no issue with having to track down many references and to transactionally update multiple adjacent trees in the lineage.

With the approach described in this thread, there seems to be still an issue of having to deal with the blocks that are referenced by the active/mutable trees but that belong to/were born in earlier snapshots. It would seem that those blocks could be copied with incremental send/recv, but could not be deleted after copying (because they are still referenced elsewhere).

So these blocks would have to be deal with explicitly, e.g. by "divorcing" them from the earlier snapshots -
resulting in additional copies of data that used to be "inherited" within the snapshot lineage, which is undesirable.

Our current focus is on improving the efficiency of the overall approach, and that is where things become challenging :) but still tractable

Having been buried for a while with the "ongoing activities", I have not been able to focus on the BP rewrite paper recently :) but I promise a decent white paper with the analysis of several approaches, including one discussed in this thread, in next couple weeks or so :)

Best regards, Boris.

Typos courtesy of my iPhone

On Jun 28, 2013, at 11:41 PM, "Matthew Ahrens" <mahrens <at> delphix.com> wrote:

I think this is a decent start, but there are a vast number of complications with what you propose.  If you want to change the way snapshots are linked together (with the ds_{prev,next}_snap_obj fields), then you will need to consider all the ways that these interact.  E.g. when I delete the a snapshot, its deadlist is merged with the next deadlist.  You will need to look at the dsl_dataset_phys_t, dsl_dir_phys_t, and the deadlists, and understand how swapping out the snapshot could effect each of those fields.

If you want to do something quick and dirty, without really understanding how the DSL works, and are willing to accept several constraints in the process (e.g. some operations are not allowed while the transistion is in progress; there must be plenty of free space), I would advise doing something similar to:

for each filesystem: 
   for each snapshot:
      zfs send -i <at> prevsnap pool/fs <at> snap | zfs recv pool/newfs
      zfs destroy pool/fs <at> prevsnap
   zfs snapshot pool/fs <at> lastsnap
   zfs send -i <at> prevsnap pool/fs <at> lastsnap | zfs recv pool/newfs
   zfs destroy -R pool/fs

Starting from this, you can then add support for clones, restarting after a reboot, etc.

--matt


On Fri, Jun 28, 2013 at 7:03 PM, Jim Klimov <jimklimov <at> cos.ru> wrote:
On 2013-06-29 00:11, Matthew Ahrens wrote:
On Fri, Jun 28, 2013 at 2:48 PM, Jim Klimov <jimklimov <at> cos.ru
<mailto:jimklimov <at> cos.ru>> wrote:

    While pondering about different ideas buzzing around :) ,
    I thought of BP rewrite, again. And I felt I see a solution.

Cool, it's always great to see people thinking about how ZFS could be
improved.  Is this something you're planning to implement?

Unfortunately, I am unlikely to implement anything of this scale due
to life happening. So I do the next best thing - try to brainstorm ;)


    Why not add a capability to reallocate an existing snapshot,
    with its logically fixed starting and ending points, and
    then, when the send_receive-like rewrite process is complete,
    just atomically replace the references and (defer-)release
    the old instance of the snapshot - just like any COW update
    of an existing block in some file?


Can you explain this in more detail?  What "references" exactly will you
be replacing?  Some fields in the dsl_dataset_t, or the blkptr_t?

I believe, the proper ones to replace would be DSL fields.
At least, according to the ZFS OnDisk Format PDF (page 30)
there is a sort of linked list with the active dataset
referencing the newest snapshot, that one referencing the
second newest and so on. The basic idea in my proposal
is to replace items in these chains one by one as these
snapshots are reallocated according to the user's new
requirements; the logical contents (specific userdata
blocks located at specific logical offsets in the files
or volumes) of the old and new "instance" of the snapshot
should be identical, so I think this substitution should
not corrupt the visible userdata when the old instance of
the snapshot is unlinked and ultimately removed.

Indeed, this is not so much a "BlockPointer Rewrite" as a
"Snapshot Rewrite", but hopefully it would have mostly the
same effect as people (and myself) expect from a BPRewrite.

Of course, for some tasks, such as evicting data from some
TLVDEV which we want to remove, this approach is an overkill
(it would suffice to relocate only certain blocks according
to the device number in their DVA's, for this example).
However, lacking *any* BPRewrite to do this "properly", having
a suboptimal tool now is better than having nothing for the
foreseeable future ;)



> What do you do with the blkptr_t's that you've send|recv'd but still
> exist in subsequent snapshots, the filesystem, or deadlists?

I must confess I don't understand this question completely,
which probably means that after all I didn't fully grasp all
the under-the-hood concepts of ZFS :(

Still, the way I see it, ZFS stores data in leafs of a big tree,
with branches being the metadata, including the uberblocks and
root, the directories of datasets and of their components such as
the snapshots, and ultimately a root dnode in each dataset which
contains trees of metadata for each file's blocks stored in the
blkptr_t structures.

Of course, pieces of metadata internal to the snapshot (references
to newly allocated userdata) are recreated anew, as for any newly
made write operation.

I think your question implies that other (probably newer) snapshots
and the active dataset may also contain references to the same DVAs
that the older snapshot contained (i.e. due to some normal update
operations which replaced some metadata and retained other items).

If so, this would indeed be a complication, though likely not fatal
if we can work around this problem by walking and similarly rewriting
the metadata as needed in newer depending datasets (snapshots, active,
clones...) At least for the newer snapshots, such replacement could
also be atomic (prepare new metadata, commit its inclusion into the
metadata tree), but for live data this might need some more careful
approach... unless this operation can not be presented to the system
as indistinguishable from a normal file update, when its logical
blocks begin having new on-disk locations and metadata after TXG #X.

I hope this wandering in the woods does not sound too stupid and
the proposal is not a still-born from the start? ;)


//Jim Klimov



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&id_secret=21635000-73dc201a
Powered by Listbox: http://www.listbox.com

illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription
Boris Protopopov | 1 Jul 03:43 2013

Re: BP rewrite, some thoughts

Hi, Nico,

That is an interesting list of opinions. 

I think there are a few uses of BP rewrite that are quite compelling in the enterprise storage setting - e.g.
storage tiering and re-balancing. 

I also think that the central issue here is whether or not we believe that cow and immutable trees are so
valuable that it is not acceptable to overwrite anything in place (which would be done when leaving the
forwarding info if I understand correctly)

If we do not believe so :) then bp rewrite becomes simpler, yet other things may become quite complex to manage

Boris

Typos courtesy of my iPhone

On Jun 29, 2013, at 1:03 AM, "Nico Williams" <nico <at> cryptonector.com> wrote:

> ZFS properties most relevant to BP rewrite:
> 
> - There is no way to enumerate nor even count references to any given
> block without first enumerating all paths to all blocks in the pool.
> 
> - Snapshots (and clones from them) cause these
> non-enumerable/countable references.
> 
> - Snapshots/clones are enumerable and countable without enumerating
> all paths to all blocks in a pool.
> 
> - It is possible to determine when no references to a block remain.
> 
> - There is a maximum number of references any one block in a pool
> might have: pools are finite, after all.
> 
>   Create a pool with one dataset and one small file in it, then
> create snapshots with abandon until the pool is full.  Clearly we can
> determine the maximum number of references any one block can have.
> Morevover, the number of extant references a live block can have is
> further bounded by the number of snapshots and clones that capture the
> block's birth txg.
> 
> Reasons not to bother with BP rewrite / reasons to want it:
> 
> - No need to evacuate devices: just use pools with enough redundancy
> that replacing and/or adding vdevs is sufficient.= for all operational
> needs.
> 
>   Yes, this is expensive, and many uses of ZFS fall outside the
> enterprise, and we should want all appropriate uses of ZFS to
> increase.
> 
>   In particular the ability to correct mistakes like adding a stripe
> instead of a mirror would be extremely useful to... those who make
> such mistakes.
> 
> - No need to defragment: just add storage.
> 
>   This too is expensive.
> 
> - No need to change encryption keys: just use strong crypto, with one
> key per-object and super-encrypt keys.  There's no need to rotate keys
> for revocation purposes: encryption in ZFS is a technique to protect
> against theft of storage devices, not against subversion of the host,
> nor against theft of a running host and its storage.
> 
>   Re-encryption is useful for algorithm agility -- one should want it
> to be possible, even if it one should rarely or never need it.
> 
> Possible BP re-write design starting points:
> 
> - Change ZFS so that block references are enumerable and countable.
> (Not likely to workout well.)
> or
> - Use dedup for the BP rewrite process (see earlier discussions).
> or
> - Optimize dedup for BP rewrite by leaving forwarding information
> where the old BP data was (see earlier discussions and below).
> 
> The last requires enough free storage space to swing it, since one
> cannot know the number of references a block might have (just the
> maximum) and any old block cannot be considered free until the last
> possible reference to it has been re-written -- this limits the
> usefulness of this technique for defragmentation.  That is: a lot of
> free space is needed for defragmentation (but not more than 50 (+
> overhead) % free space).  There may be a speed vs. space trade-off for
> defragmentation: the limited (and computable) maximum number of
> references that a block can have might allow a defragmentation process
> to stop defragmenting after re-writing so many blocks, then it can
> work on just finding (and re-writing) extant references to blocks
> re-written so far -- very slow, almost certainly way too slow.  The
> same applies to re-encryption.  But for vdev evacuation this technique
> is just fine since the free storage in the pool without the evacuated
> device must be sufficient, and swing storage combined with vdev
> evactuation BP re-write makes the other uses (defrag and
> re-encryption) feasible with only transient costs (the swing storage
> and the CPU and I/O to do the re-writes).
> 
> Some have argued that overwriting old versions of re-written blocks
> with forwarding information is bad because it makes it impossible to
> distinguish rewrites from corruption.  That is wrong though: when a
> block's checksum fails to match the blkptr_t's expected checksum then
> ZFS can a) attempt to interpret the bad block as a forwarding, b)
> check metadata written by the BP re-writer (written in the forwarding
> and in a log) to verify that the data found in the block as being a
> forwarding.  E.g.,  the re-writer could the block's DVA and store that
> along with the forwarding information, a nonce assigned for each
> snapshot/clone re-written (or in process of being re-written) and a
> checksum of all that, an the old block location, then also write the
> snapshot/clone and nonces to a log.  To verify that a block contains a
> forwarding just validate the checksum in the forwarding and then check
> that a) the DVA in the forwarding matches the DVA from the blkptr_t,
> b) nonce found in the forwarding matches the expected value from the
> BP re-write log for the named snapshot/clone, c) that the birth txg of
> the block is captured in the snapshot clone referenced in the
> forwarding information.  Additional checks could be done to further
> verify that a block legitimately contains forwarding data (e.g., a
> Bloom filter of re-written block metadata).
> 
> IMO:
> 
> - The fact that the most pressing uses of BP re-write are for
> non-enterprise uses means there's little likelihood of an
> implementation being funded.
> 
> - The ability to BP re-write would be like dedup: a great showcase, a
> selling point, but not really all that valuable.  See preceding point.
> 
> - Only the last of the BP re-write designs discussed above is worth
> bothering with.
> 
> Nico
> --
> 
> 
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com

Nico Williams | 1 Jul 09:05 2013

Re: BP rewrite, some thoughts

On Sun, Jun 30, 2013 at 8:43 PM, Boris Protopopov
<Boris.Protopopov <at> nexenta.com> wrote:
> That is an interesting list of opinions.
>
> I think there are a few uses of BP rewrite that are quite compelling in the enterprise storage setting - e.g.
storage tiering and re-balancing.

Ah, sure, good point.  My list of reasons for BP rewrite were not
intended to be exhaustive.  Feel free to add to it.

> I also think that the central issue here is whether or not we believe that cow and immutable trees are so
valuable that it is not acceptable to overwrite anything in place (which would be done when leaving the
forwarding info if I understand correctly)

The whole point of checksums is to detect corruption (and help correct
it).  Here we'd be "corrupting" a block to leave breadcrumbs as to
where to find the new version.  There can be partial updates (when
crashing), so BP re-write would have to leave behind an intent log of
re-writes, and would have to write new blocks, log, commit, then write
the forwardings.

Also, I forgot to mention (though we've talked about it earlier), that
it'd help greatly if the DVAs in blkptr_ts were not hashed -- that way
blkptr_ts of blocks that contain blkptr_ts don't change when the
blkptr_ts in them get relocated.  But this is not critical -- not
having this means only that the BP re-writer has to do more work to
make it possible to handle forwarding information.

> If we do not believe so :) then bp rewrite becomes simpler, yet other things may become quite complex to manage

I do believe in the COW principle.  I don't think it doesn't apply
here; it's the ability to treat the overwriting of a live block as
"corruption and attempt to recover by interpreting and validating as
forwarding data" that makes it OK.

Nico
--

Jim Klimov | 1 Jul 11:34 2013
Picon

Re: BP rewrite via snapshots

On 2013-07-01 03:25, Boris Protopopov wrote:
> We are currently working on a white paper that lays out our approach
> to BP rewrite. We do intend to implement our solution, and we will
> make it open source :) We would appreciate any help and advice from
> the community.

Well, it is sure good to hear that there are more people still thinking
about this, and even more than that possibly going on ;)

> Our analysis indicates that (transparently) moving data that belongs to
> active filesystems but does not belong to snapshots is not a difficult
> problem. Such data by definition is referenced by a single tree, so
> there are no issue with having to track down many references and to
> transactionally update multiple adjacent trees in the lineage.

After reading the interesting and informing points and thoughts in this
thread, I wonder if the statement above is not too naive: namely, if
deduplication is used (though easily detected by a bit in blkptr_t),
it is possible that there are more than one references to the block
in the DDT, even in an active ("live" as I mis-termed earlier) dataset.

So while for unique blocks and deduped with count==1 there is indeed
little work (though DDT block pointer must be updated in the latter
case), for deduped data there is likely about as much work as with
snapshots and clones - even more, because instead of looking for the
matches in some limited sub-tree (family with one origin or similar),
you'd have to search in the whole pool. Bummer ;-\

I wonder how unbearably costly it would be to build and maintain,
even if just temporarily, a (possibly DDT-like) map of "who references
this user-data block, and where this reference is located"?..

//Jim

Boris Protopopov | 1 Jul 18:42 2013

RE: BP rewrite, some thoughts

You see, Nico, 

checksums are great of course, but immutable block trees are somewhat orthogonal to that. The latter give
you always consistent on-disk state which turns out to be invaluable in many cases. Checksums cannot
quite do that.

Also, I believe it is important to know that the block trees are self-consistent based on the hashes, which
can only be done if we do hash DVAs (and therefore, can indeed assert that we point to the exactly right
blocks, as opposed to blocks with the same "contents", for instance). 

Boris. 

-----Original Message-----
From: Nico Williams [mailto:nico <at> cryptonector.com] 
Sent: Monday, July 01, 2013 12:05 AM
To: zfs <at> lists.illumos.org
Subject: Re: [zfs] BP rewrite, some thoughts

On Sun, Jun 30, 2013 at 8:43 PM, Boris Protopopov <Boris.Protopopov <at> nexenta.com> wrote:
> That is an interesting list of opinions.
>
> I think there are a few uses of BP rewrite that are quite compelling in the enterprise storage setting - e.g.
storage tiering and re-balancing.

Ah, sure, good point.  My list of reasons for BP rewrite were not intended to be exhaustive.  Feel free to add to it.

> I also think that the central issue here is whether or not we believe 
> that cow and immutable trees are so valuable that it is not acceptable 
> to overwrite anything in place (which would be done when leaving the 
> forwarding info if I understand correctly)

The whole point of checksums is to detect corruption (and help correct it).  Here we'd be "corrupting" a
block to leave breadcrumbs as to where to find the new version.  There can be partial updates (when
crashing), so BP re-write would have to leave behind an intent log of re-writes, and would have to write new
blocks, log, commit, then write the forwardings.

Also, I forgot to mention (though we've talked about it earlier), that it'd help greatly if the DVAs in
blkptr_ts were not hashed -- that way blkptr_ts of blocks that contain blkptr_ts don't change when the
blkptr_ts in them get relocated.  But this is not critical -- not having this means only that the BP
re-writer has to do more work to make it possible to handle forwarding information.

> If we do not believe so :) then bp rewrite becomes simpler, yet other 
> things may become quite complex to manage

I do believe in the COW principle.  I don't think it doesn't apply here; it's the ability to treat the
overwriting of a live block as "corruption and attempt to recover by interpreting and validating as
forwarding data" that makes it OK.

Nico
--


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now

RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc

Modify Your Subscription: https://www.listbox.com/member/?&

Powered by Listbox: http://www.listbox.com


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Boris Protopopov | 1 Jul 18:52 2013

RE: BP rewrite via snapshots

Hello, Jim.

I do not believe there is an issue with dedup, because all those references are explicitly kept track of by
the filesystem. So, the right things will happen "automagically" when you COW those during move. There is
no need to treat dedup-ed blocks differently, as far as correctness is concerned. 

Boris.

-----Original Message-----
From: Jim Klimov [mailto:jimklimov <at> cos.ru] 
Sent: Monday, July 01, 2013 2:35 AM
To: zfs <at> lists.illumos.org
Cc: Boris Protopopov
Subject: Re: [zfs] BP rewrite via snapshots

On 2013-07-01 03:25, Boris Protopopov wrote:
> We are currently working on a white paper that lays out our approach 
> to BP rewrite. We do intend to implement our solution, and we will 
> make it open source :) We would appreciate any help and advice from 
> the community.

Well, it is sure good to hear that there are more people still thinking about this, and even more than that
possibly going on ;)

> Our analysis indicates that (transparently) moving data that belongs 
> to active filesystems but does not belong to snapshots is not a 
> difficult problem. Such data by definition is referenced by a single 
> tree, so there are no issue with having to track down many references 
> and to transactionally update multiple adjacent trees in the lineage.

After reading the interesting and informing points and thoughts in this thread, I wonder if the statement
above is not too naive: namely, if deduplication is used (though easily detected by a bit in blkptr_t), it
is possible that there are more than one references to the block in the DDT, even in an active ("live" as I
mis-termed earlier) dataset.

So while for unique blocks and deduped with count==1 there is indeed little work (though DDT block pointer
must be updated in the latter case), for deduped data there is likely about as much work as with snapshots
and clones - even more, because instead of looking for the matches in some limited sub-tree (family with
one origin or similar), you'd have to search in the whole pool. Bummer ;-\

I wonder how unbearably costly it would be to build and maintain, even if just temporarily, a (possibly
DDT-like) map of "who references this user-data block, and where this reference is located"?..

//Jim

Richard Yao | 1 Jul 19:11 2013
Picon

RFC: zpool import -t for temporary pool names

Doing `zpool import pool newpool` to deal with namespace collisions has
the side effect of changing the pool name. This is a pain when a
production virtual machine is rendered unbootable and the root pool name
is the same in both the host and the guest. Recovery requires either
rolling back the zvol backing the VM or booting a LiveCD inside a VM in
order to do rollback of the guest pool. Neither option to be ideal.

I wrote a small patch that implements an option -t to zpool import that
makes the newpool name non-persistent to address that. It depends on a
one-line patch to remove zpool import -r that I wrote when I noticed
that option was unimplemented.

https://github.com/ryao/zfs/commit/3c8731fcdb87ebd83fb64595791d8b8721124135
https://github.com/ryao/zfs/commit/d8a9bf242f6921ca837cc3f0e7072d964a43fec9

I plan to file both a ZFSOnLinux pull request and an Illumos issue for
this. However, I remember that the patch (that someone else wrote) to
permit users to manually override ashift at vdev creation was unpopular
on the Illumos ZFS mailing list. With that in mind, I would like
comments before I request that ZFSOnLinux make another change to the
zpool command that is not present in Illumos.

On a related note, we need `zpool export pool newpool` to allow people
to fix pools that were accidentally imported without option -t. In
theory, that could handle the case that the temporary pool names handle
entirely. However, a pool whose name had to be changed because of a
namespace collision would be left with a different name should a crash
occur before the export. using a temporary pool name would avoid it
entirely. That makes me consider the two changes that I am proposing to
be complementary. I have not written the `zpool export pool newpool`
patch yet, but I welcome comments on the idea.

Richard Yao | 1 Jul 20:17 2013
Picon

Re: RFC: zpool import -t for temporary pool names

On 07/01/2013 01:11 PM, Richard Yao wrote:> Doing `zpool import pool
newpool` to deal with namespace collisions has
> the side effect of changing the pool name. This is a pain when a
> production virtual machine is rendered unbootable and the root pool name
> is the same in both the host and the guest. Recovery requires either
> rolling back the zvol backing the VM or booting a LiveCD inside a VM in
> order to do rollback of the guest pool. Neither option to be ideal.

That should have been "Neither option is ideal". It was originally, "I
consider neither option to be ideal", but I made a last minute change
before sending that email.

> I wrote a small patch that implements an option -t to zpool import that
> makes the newpool name non-persistent to address that. It depends on a
> one-line patch to remove zpool import -r that I wrote when I noticed
> that option was unimplemented.
>
>
https://github.com/ryao/zfs/commit/3c8731fcdb87ebd83fb64595791d8b8721124135
>
https://github.com/ryao/zfs/commit/d8a9bf242f6921ca837cc3f0e7072d964a43fec9

Someone in IRC pointed out to me that I should be using VERIFY0 instead
of ASSERT0. I just pushed a revised commit to address that:

https://github.com/ryao/zfs/commit/c31d6af22c93126a60fb06c6693e01547eafd7f0

> I plan to file both a ZFSOnLinux pull request and an Illumos issue for
> this. However, I remember that the patch (that someone else wrote) to
> permit users to manually override ashift at vdev creation was unpopular
> on the Illumos ZFS mailing list. With that in mind, I would like
> comments before I request that ZFSOnLinux make another change to the
> zpool command that is not present in Illumos.
>
> On a related note, we need `zpool export pool newpool` to allow people
> to fix pools that were accidentally imported without option -t. In
> theory, that could handle the case that the temporary pool names handle
> entirely. However, a pool whose name had to be changed because of a
> namespace collision would be left with a different name should a crash
> occur before the export. using a temporary pool name would avoid it
> entirely. That makes me consider the two changes that I am proposing to
> be complementary. I have not written the `zpool export pool newpool`
> patch yet, but I welcome comments on the idea.
>

Justin T. Gibbs | 1 Jul 22:05 2013

Re: 3752 want more verifiable dbuf user eviction

Sorry for the long delay in returning to this thread.  It seems we can only get "moments of focus" on upstreaming around here.  I'm having one now, and we'll see how long it lasts. :-)

On Jun 20, 2013, at 10:37 AM, Matthew Ahrens <mahrens <at> delphix.com> wrote:

On Wed, Jun 19, 2013 at 10:13 PM, Justin T. Gibbs <gibbs <at> scsiguy.com> wrote:

I don't follow this.  container_of() doesn't care what the type of the pointer is; it casts it to uintptr_t, so it would work just fine with either a dmu_buf_user_t* or a void*.

The "container_of()" implementations I'm familiar with (Linux and FreeBSD) perform an assignment without cast to a local (to the macro) pointer of the correct field type in order to catch human error.

Ah, I see, the "__GNUC_PREREQ__(3, 1)" case has that additional type safety.  Illumos doesn't have much GCC-specific code, having adopted GCC only recently.  I'll see about getting the special GCC version of container_of() into illumos as well.

Just so I have this straight, let me see if we are on the same page about the type safety.

The current code does:
dsl_dataset_t *ds = dmu_buf_get_user(dbuf);
This is type-unsafe, because this code doesn't know at compile time that dmu_buf_get_user() actually returns a dsl_dataset_t*.  We have to look at some other code to convince ourselves that whenever this code is run, we will get a dsl_dataset_t*, and not e.g. a dsl_dir_t*.

the code under review does:
dsl_dataset_t *ds = (dsl_dataset_t *)dmu_buf_get_user(dbuf);
Do we agree that does not add any type safety, but does add more casts, which is undesirable, all else being equal?

I think you're proposing we change this to:

dsl_dataset_t *ds = container_of(dmu_buf_get_user(dbuf), dsl_dataset_t, db_evict);

I wasn't proposing that we use container_of().  I was only pointing out that the return type proposed in the patch would make sense if we decided to lift the restriction that the dmu_buf_user_t  be the first member of its containing object and thus had to use container_of().

I've already switched our code base back to returning "void *" for these APIs and removed all but one cast.  The remaining cast is required for C++ compatibility because it appears in a header file.

Do you have any more direction on what to do about inline functions in headers?

This case still has the exact same type-unsafety as the current code, because we still don't know at compile time that the returned dmu_buf_user_t* is embedded in a dsl_dataset_t*.  It could be embedded in a dsl_dir_t*.

What we know at compile time is that dmu_buf_get_user() returns a dmu_buf_user_t*, and that we are using the right code to convert from that to a dsl_dataset_t*.  But we don't know if that is a valid conversion or not.  This is the same as the current code, where we are correctly converting from the value returned by dmu_buf_get_user() to a dsl_dataset_t (with a simple assignment), but we don't know at compile time if that conversion is valid.

So this seems like a reasonable solution to the problem of how to deal with an embedded data structure.  Another would be that used by list_t and avl_tree_t: having the embedder keep track of the offset & do the translation. But I don't see how this will catch more errors at compile time (i.e. have more type safety) than the current code.

I find the "recorded offset" thing really annoying during debugging sessions and would never advocate doing it elsewhere. :-)  This is one case where macros (or templates in C++) are better than functions for both type safety and space efficiency.  See the 4.4BSD queue macros: http://www.freebsd.org/cgi/man.cgi?query=queue&apropos=0&sektion=0&manpath=FreeBSD+9.1-RELEASE&arch=default&format=html


The webwev really contains two different changes: the use of queued eviction to avoid LORs, and the removal of db_user_data_ptr_ptr.  If it helps in the discussion or RTI process, we can split them into separate issues and webrevs. 

Good point.  Let's discuss the db_user_data_ptr_ptr first since we're already down in those details.

I agree that there are a number of ways to deal with db_user_data_ptr_ptr, none of which are that pretty.  How about a wildly different idea:  we make it so that the db_data (and thus ds_phys, zap_phys, etc) will never change if the caller has saved it (indicated by having an evict func).  Then the dataset code would simply set ds_phys = db->db_data once when the dataset_t was created (similarly for the dsl_dir, zap, and other DMU consumers).

I believe that the only case where the we actually change *db_user_data_ptr_ptr today is in the ZPL, for ZAP objects and SAs.  Today, we change the db_data when we dirty a dbuf which has older dirty records, so that we keep the txg-specific version of the data to be written out.  This never happens for MOS data, because it is only modified in syncing context.

To avoid changing db_data, we would always make a copy of the db_data for dbuf_sync_leaf() to use.  This would happen either in dbuf_sync_leaf() or in dbuf_dirty(), depending on which was called first.  If dbuf_sync_leaf() is called first, this would be an additional copy that we are not doing today.  Given that we would only copy a relatively few, small pieces of data, the performance impact shouldn't be too bad -- but we should definitely measure it.

So the code today looks like:

dbuf_dirty(txg=N)
 - db_data and the dirty record share a buffer

dbuf_dirty(txg=N+1)
 - if dirty record still present, and buffer still shared:
    - make a copy of buffer
    - set dbuf's db_data to new buffer
    - store to *db_user_data_ptr_ptr

dbuf_sync_leaf(txg=N)
 - use dirty record's dr_data

My proposed change is:

dbuf_dirty(txg=N)
 - db_data and the dirty record share a buffer

dbuf_dirty(txg=N+1)
 - if dirty record still present, and buffer still shared:
   - make a copy of buffer
   - if there is an evict func:
     - set dirty record's dr_data to new buffer
   - else:
     - set dbuf's db_data to new buffer

dbuf_sync_leaf(txg=N)
 - if buffer still shared with db_data, and there is an evict func:
    - make a copy, set dirty record's dr_data to new buffer

In my opinion, this is just another solution to a problem that doesn't need to exist.  Why not just engineer it away along with the additional code, copies, and complexity? :-)

In your other email, you seem to be okay using macro accessors instead of the union+macro scheme.  Just tell me what you want the macros called and I will happily do the work to generate an updated patch.

A few more minor things I ran across:

sa_handle_get_from_db(): Nice code cleanup, but I think we'll panic if we ever execute line 1388 (bzero-ing a NULL pointer).

The "handle = kmem_cache_alloc(sa_cache, KM_SLEEP);" line seems to have been dropped during the merge to Illumos.  Will fix.

various structs have members called "db_evict".  These should use the same prefix as the other members of each struct.  E.g. dsl_dataset_t's db_evict should be ds_evict.  dsl_dir_t's should be dd_evict.  zap_t's should be zap_evict. Etc.

I saw this too and have already addressed these style issues in our local tree.   We're auditing the rest of our changes for similar issues.  Right now we are re-synching our product trees with recent Illumos and will post updated webrevs once the merge passes our regression tests.

--
Justin

illumos-zfs | Archives | Modify Your Subscription
Deirdre Straughan | 1 Jul 23:52 2013

ZFS Internals Training from Joyent

Max Bruning will be teaching this popular course August 13-15 at Joyent HQ in San Francisco. Early bird pricing ends July 12th. 


Details and registration here: http://zfsinternals.eventbrite.com/



--


best regards,
Deirdré Straughan
Head of Educational Programs, Joyent
illumos-zfs | Archives | Modify Your Subscription

Gmane