Daniel Phillips | 26 Jul 2008 00:13

Re: Comparison to Hammer fs design

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
> 
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> :     Pedro.
> 
>     Those are interesting comments.   I think I found Daniel's email address
>     so I am adding him to the To:   Dan, feel free to post this on your Tux
>     groups if you want.

How about a cross-post?

>     I did consider multiple-parentage...  that is the ability to have a
>     writable snapshot that 'forks' the filesystem history.  It would be
>     an ultra cool feature to have but I couldn't fit it into the B-Tree
>     model I was using.  Explicit snapshotting would be needed to make it
>     work, and the snapshot id would have to be made part of the B-Tree key,
>     which is fine.  HAMMER is based around implicit snapshots (being able
>     to do an as-of historical lookup without having explicitly snapshotted
>     bits of the history).
(Continue reading)

Daniel Phillips | 26 Jul 2008 00:54

Re: Comparison to Hammer fs design

(resent after subscribing to the the kernel <at> crater)

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
> 
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> :     Pedro.
> 
>     Those are interesting comments.   I think I found Daniel's email address
>     so I am adding him to the To:   Dan, feel free to post this on your Tux
>     groups if you want.

How about a cross post?

>     I did consider multiple-parentage...  that is the ability to have a
>     writable snapshot that 'forks' the filesystem history.  It would be
>     an ultra cool feature to have but I couldn't fit it into the B-Tree
>     model I was using.  Explicit snapshotting would be needed to make it
>     work, and the snapshot id would have to be made part of the B-Tree key,
>     which is fine.  HAMMER is based around implicit snapshots (being able
(Continue reading)

Matthew Dillon | 26 Jul 2008 04:02

Re: Comparison to Hammer fs design

:>     so I am adding him to the To:   Dan, feel free to post this on your Tux
:>     groups if you want.
:
:How about a cross-post?

    I don't think it will work, only subscribers can post to the DFly groups,
    but we'll muddle through it :-)  I will include the whole of the previous
    posting so the DFly groups see the whole thing, if you continue to get
    bounces.

    I believe I have successfully added you as an 'alias address' to the
    DragonFly kernel list so you shouldn't get bounced if you Cc it now.

:Yes, that is the main difference indeed, essentially "log everything" vs
:"commit" style versioning.  The main similarity is the lifespan oriented
:version control at the btree leaves.

    Reading this and a little more that you describe later let me make
    sure I understand the forward-logging methodology you are using.
    You would have multiple individually-tracked transactions in
    progress due to parallelism in operations initiated by userland and each
    would be considered committed when the forward-log logs the completion
    of that particular operation?

    If the forward log entries are not (all) cached in-memory that would mean
    that accesses to the filesystem would have to be run against the log
    first (scanning backwards), and then through to the B-Tree?  You
    would solve the need for having an atomic commit ('flush groups' in
    HAMMER), but it sounds like the algorithmic complexity would be
    very high for accessing the log.
(Continue reading)

Daniel Phillips | 27 Jul 2008 13:51

Re: Comparison to Hammer fs design

linSubscribed now, everything should be OK.

On Friday 25 July 2008 19:02, Matthew Dillon wrote:
> :Yes, that is the main difference indeed, essentially "log everything" vs
> :"commit" style versioning.  The main similarity is the lifespan oriented
> :version control at the btree leaves.
> 
>     Reading this and a little more that you describe later let me make
>     sure I understand the forward-logging methodology you are using.
>     You would have multiple individually-tracked transactions in
>     progress due to parallelism in operations initiated by userland and each
>     would be considered committed when the forward-log logs the completion
>     of that particular operation?

Yes.  Writes tend to be highly parallel in Linux because they are
mainly driven by the VMM attempting to clean cache dirtied by active
writers, who generally do not wait for syncing.  So this will work
really well for buffered IO, which is most of what goes on in Linux.
I have not thought much about how well this works for O_SYNC or
O_DIRECT from a single process.  I might have to do it slightly
differently to avoid performance artifacts there, for example, guess
where the next few direct writes are going to land based on where the
most recent ones did and commit a block that says "the next few commit
blocks will be found here, and here, and here...".

When a forward commit block is actually written it contains a sequence
number and a hash of its transaction in order to know whether the
commit block write ever completed.  This introduces a risk that data
overwritten by the commit block might contain the same hash and same
sequence number in the same position, causing corruption on replay.
(Continue reading)

timothy norman huber | 27 Jul 2008 23:10
Picon

forward logging and NVRAM

Daniel,

Several weeks ago, during a discussion over coffee, you mentioned a rather intriguing optimization- by writing the start of the forward log chain to NVRAM rather than disk, you avoid a disk seek and write transfer for each forward log chain.  i believe there was another motivation that had to do with avoiding traversing the entire volume in case of a crash.  Was this design feature to overcome lame fsck performance?  

T


Timothy Huber
Strategic Account Development

cell 310 795.6599

181 Metro Drive, Suite 400 
San Jose, CA 95110 


_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 27 Jul 2008 23:29

Re: forward logging and NVRAM

Hi Timothy,

On Sunday 27 July 2008 14:10, timothy norman huber wrote:
> Daniel,
> 
> Several weeks ago, during a discussion over coffee, you mentioned a  
> rather intriguing optimization- by writing the start of the forward  
> log chain to NVRAM rather than disk, you avoid a disk seek and write  
> transfer for each forward log chain.  i believe there was another  
> motivation that had to do with avoiding traversing the entire volume  
> in case of a crash.  Was this design feature to overcome lame fsck  
> performance?

The same motivation, actually.  Writing the start of a forward log
chain to nvram instead of to some known location on disk means that I
do not have to worry about doing unnatural things to optimize such
loads as O_SYNC writing, where each transaction has to complete
before the next one can begin, which means that the forward log chain
never gets more than one element log, requiring a seek to the known
location for every transaction.  Not worse than journalling to be
sure, and usually better, but not as fast as avoiding a seek and disk
write per transaction entirely.

So if somebody could give me 16 bytes of NVRAM or so per volume I
would be most pleased and could use that to generate some nice O_SYNC
benchmarks :-)

By the way, I wonder why your posts do not show up in the Mailman
(Pipermail) archive, while mine and Matt's do?

Daniel
Daniel Phillips | 28 Jul 2008 07:46

Two kinds of atomic commit

Here I describe two slightly different methods that Tux3 will use to
implement atomic commit of data and metadata.  Both methods combine 
logical and physical forward logging to perform atomic commits
efficiently.  The discussion below assumes we are updating a btree leaf,
for example to make room for more inode data or to add pointers to a
file index.  The same techniques apply equally well to all structures
in the filesystem.

1) The Update a Clone Method

Instead of directly modifying a disk block corresponding to a btree
leaf, we allocate a new block for the leaf and copy the contents of
the original block to the new block, only in the buffer cache (no copy
is performed on disk).  We can now alter the new block at will and
flush it out to disk without affecting the on-disk btree at all.  But
have not yet linked the new block into the btree.  We could accomplish
that by performing a similar clone update recursively up to the root of
the tree, which creates a new tree root.  The whole chain of new blocks
would then be flushed out to disk and a pointer to the new root stored
at some predictable location so it can be found later.  This is the
"phase tree" method that I invented for Tux2, and is commonly called
"copy on write" these days.  It could also be called the "sue me" method
because Netapp likes to sue those such as Sun who implement it.

Fortunately, there is a better way to do it that I invented recently and
which I have never heard of anyone using before.  We modify only the
leaf node of the btree by cloning and record the pointer to the clone
only in the cached btree index block, not on disk.  To be able to
reconstruct the cached version of the index node after a crash, we log a
logical change record to the disk that says "write this leaf into that
btree index node".

We make whatever changes we need to the clone of the leaf node, then
construct a two block transaction consisting of the clone and a commit
block.  The commit block points at the new leaf node and also carries
the logical update record that describes how to modify the parent index
node recorded on disk to point at the new leaf.  If the commit block can
be allocated immediately adjacent to the clone then we can construct a
single bio to transfer the clone and the commit block to disk in one
operation.  Otherwise we construct two bios.  If there is some previous
commit transaction that has been staged but not yet committed, then we
modify its commit block to point at the location where our new commit
block will be stored.  Otherwise we have to seek to some known location
on the disk to store the location of the new commit.

To be able to tell after a crash whether the commit block and its
associated data made it to disk, we store a sequence number to
distinguish this commit block from a possible similar commit that might
have landed accidentally at the same location earlier, and store a hash
of the two blocks in the commit block.

Now the transaction is ready to go out to disk.  But if the disk is busy
with previous traffic (we hope it is) then we just stage the transaction
for commit, and commit the predecessor transaction instead.  This gives
a chance for any following transaction to extend the forward log chain
instead of having to start a new chain, which would need an extra seek
and transfer.

The original version of the leaf node on disk is not yet freeable,
because it may be needed after a crash (together with the logical update
record) to reconstruct the cached btree node image.  Only after we "roll
up" the logical update log entry by atomically updating the parent node
may we free the original leaf block and the commit block that recorded
the logical update.  Such a rollup happen periodically, otherwise we
would end up with an unreasonably long chain of logical updates
needing to be replayed on remount to reconstruct the btree image in
cache.

2) The Update in Place Method

An alternate method of atomically updating our btree leaf node is to
write it to disk twice: the first time along with a commit block that
says we intend to overwrite the leaf node with the data part of the
transaction, and the second write does the actual overwrite.  If we
crash before completing the second write, replay will load the image
of the btree block from the commit transaction into cache and we are
ready to continue where we left off.

Like the clone method, the update in place method constructs a
transaction, tries to link it into an existing forward log chain, or
falls back to writing the commit block location to a known location if
it has to, then stages the transaction for commit, committing the
predecessor transaction in the forward chain if there is one.  The
overwrite transfer can be submitted as soon as the commit transaction
has completed, which might consist of one or two bio transfers depending
on whether it was possible to allocate the transaction contiguously or
not.

It is possible that the updated leaf block may be needed as the base for
applying logical updates to reconstruct the cached version of the leaf
as a consequence of some future transaction, but the future transaction
must be able to rely on a known state of the leaf if it wants to log
logical changes against it.  Therefore the future transaction may have
to wait on the update in place transaction to complete as well, to
guarantee that the desired leaf state can be reconstructed.

Any future update in place of the same block has to wait for the
overwrite transfer to complete before submitting its own overwrite
transfer, otherwise the second overwrite may race ahead of the first
creating a stale disk block.

The whole update in place transaction can be freed as soon as the
overwrite transfer completes.

Comparison of the methods

Both methods are very efficient ways to achieve atomic commit.  The
update in place method has the advantage of not perturbing the original
disk layout, possibly reducing fragmentation.  The update clone method
has the advantage of one less write operation.  Tux3 will provide a
mount option to specify which method to use.

A related technique is atomic append.  A whole range of data blocks can
be written out along with a commit block in a single bio transfer, or
a small number of bio transfers in one transaction, along with a logical
record in the commit block to say which file the blocks are to be
appended, in the form of a logical update in the commit block which will
eventually result in the addition of one or more extents to the file
index, update of the size and mtime, and removal of the new data blocks
from the free map.  Thus, a file write of considerable size can be
atomically committed with a single bio transfer.

Regards,

Daniel
Daniel Phillips | 28 Jul 2008 08:39

Re: Comparison to Hammer fs design

On Sunday 27 July 2008 14:31, Matthew Dillon wrote:
> :A versioned extent:
> :
> :   struct extent { unsigned version:10, count:6, block:24; };
> :
> :Directory format to index the extent within a leaf:
> :
> :   struct entry { unsigned loglo:24, offset:8 };
> :   struct group { unsigned loghi:24, count:8 };
> :
> :Leaf layout (abridged):
> :
> :   struct leaf { unsigned groups; struct group[]; struct entry[]; struct extent[] };
> :...
> :See, I am really obsessive when it comes to saving bits.  None of these
> :compression hacks costs a lot of code or has a big chance of hiding a
> :bug, because the extra boundary conditions are strictly local.
>      
>     Man, those are insanely small structures.  Watch out for the cpu
>     and in-kernel-memory tracking overhead.

And there is a mistake, it should be:

-   struct extent { unsigned version:10, count:6, block:24; };
+   struct extent { unsigned version:10, count:6, block:48; };

The compression effort was about finding a way to represent 48 bits of
logical address for each extent and to avoid repeating logical addresses
for successive extents starting at the same logical address, using 32
bits instead of the 64 bits currently used by the ddsnap leaf format,
for a 33% overall improvement in leaf capacity.

Regards,

Daniel
Daniel Phillips | 28 Jul 2008 09:18

File index leaf format

In an earlier post I was seen fretting about how to reduce the 
relatively expensive linear search in a PHTree directory file dirent 
block to O(log(dirents)), while keeping the compactness and physical 
stability of the good old Ext dirent format.  It did not take too long 
to find one.

   struct tux3_dirent { unsigned inum:48, type:8, len:8; char name[]; };

Versus an Ext2 dirent, the two byte record size is gone and the inum
is increased from 32 bits to 48.  The fields are big-endian and the 
structure is byte-aligned, and therefore needs to be declared with 
attribute packed so that gcc will generate code that does not throw 
exceptions on alignment-challenged architectures when trying to pick up 
the 48 bit inum field.

A simple directory grows down from the top of the dirent block towards
the dirents:

   be_16 entries[];

Each big endian 16 bit entry is the offset of a dirent from the base of 
the block.  The entries are in lexically sorted order to support 
binsearch.  To maintain physical stability for telldir cookies, an 
entry is never moved once created.  Too bad about directory compaction, 
that is an offline operation.

To search for space within the dirent we make a copy of the directory 
and sort it.  In the base of the directory we record the size of the 
biggest contiguous free space in the dirent block in order to avoid 
most fruitless searches and the associated sort.  There is also a 
highwater mark to optimize the common case of dirent blocks that just 
fill up from bottom to top, with free space seldom created in the 
middle, and to know how far down the directory can grow before it
meets the dirents.

To delete an entry, just delete it from the directory and reduce the
count of entries.

The block header looks like:

   struct tux3_dirent_block { char magic[4]; unsigned count, high; };

The inum of the directory file may also be stored in the header to help
with fsck.

I am not in a big hurry to implement this, because the good old Ext2
dirent block format will serve quite nicely for now, and I can just cut 
and paste the code to implement that.  But when dirop speed rises to 
the top of this list of things that people are benchmarking, this 
improvement is ready and will measureably increase the speed of creates 
and random deletes.  Also, Tux3 is supposed to have 48 bit inode 
numbers, so the directory format has to support that.

Regards,

Daniel
Matthew Dillon | 28 Jul 2008 18:58

Re: Two kinds of atomic commit

:1) The Update a Clone Method
:
:2) The Update in Place Method
:
:Daniel

    Hmm.  Those seem fairly complex.  How would you deal with incidental
    operations on the B-Tree such as doing a split?  A single insert
    could wind up requiring a split of every internal node on the way
    down to the leaf.  I guess you could clone the blocks, but there are
    a ton of parent and child linkages that have to be changed when you
    split a node.  It sounds pretty expensive.

    Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
    Incidental meaning those operations (such as splits) which lead up to
    an insertion or deletion but do not actually perform the insertion
    or deletion.   On crash recovery you would undo partially completed
    B-Tree operations and then REDO the related logical operations.

    Having to allocate new blocks for B-Tree deletions could create a big
    issue when the filesystem becomes full or near-full.

					-Matt
					Matthew Dillon 
					<dillon <at> backplane.com>

Gmane