Daniel Phillips | 2 Mar 03:33

Design note: Btree index block life cycle

Hi Matt,

This note examines specifics of the Tux3 btree index block life cycle, including caching, redirection,
flushing and reconstruction during log replay.  This addresses some points that you raised during our
earlier discussion, for which my response at that time could fairly be described as hand waving.  One such
point was potential complexity compared to working with a more "uniform" btree design like the one you
have adopted for Hammer.

Hirofumi is CCed because he has bravely volunteered to complete the implementation of this part of the Tux3
atomic commit mechanism for userspace and kernel, thus providing me with the necessary motivation to
state the details precisely.

I think that we can see at this point that the complexity required to implement the Tux3 atomic commit
strategy is well bounded, especially as it is now mostly implemented and can be seen not to amount to a great
deal of code.  Mind you, some of it has not yet seen the light of day.  For example, I still have not addressed
all the details of allocation bitmap replay, which will require an additional, logical replay pass not
described here.  Oh well, that will be another note, and hopefully all of this will be up and running fairly soon.

By the way, congratulation on your apparent success with Hammer.  I hope to foment a movement to port it to
Linux, as it would seem to address a niche that is not well covered in Penguin land and is unlikely to be in the
forseeable future, by my project or any other I know of, which is to say: cluster replication and
continuous fine grained delta logging for high capacity servers.

Natural metadata position

For file index blocks, the "natural" position is near the file's inode table block, and near the beginning
of the file data blocks.  Roughly speaking, we would like to place index blocks on the same track as both the
associated inode table block and the first few file data blocks, so that the disk drive will pick these up
together into its track cache when it first seeks to that track.  Similarly, we want each inode table index
block to lie near the beginning of the group of inode table leaf blocks it references.  This is the ideal
(Continue reading)

Daniel Phillips | 2 Mar 03:37

Design note: Btree index block life cycle

(sent again with word wrapping)

Hi Matt,

This note examines specifics of the Tux3 btree index block life cycle, including caching, redirection,
flushing and reconstruction during log replay.  This addresses some points that you raised during our
earlier discussion, for which my response at that time could fairly be described as hand waving.  One such
point was potential complexity compared to working with a more "uniform" btree design like the one you
have adopted for Hammer.

Hirofumi is CCed because he has bravely volunteered to complete the implementation of this part of the Tux3
atomic commit mechanism for userspace and kernel, thus providing me with the necessary motivation to
state the details precisely.

I think that we can see at this point that the complexity required to implement the Tux3 atomic commit
strategy is well bounded, especially as it is now mostly implemented and can be seen not to amount to a great
deal of code.  Mind you, some of it has not yet seen the light of day.  For example, I still have not addressed
all the details of allocation bitmap replay, which will require an additional, logical replay pass not
described here.  Oh well, that will be another note, and hopefully all of this will be up and running fairly soon.

By the way, congratulation on your apparent success with Hammer.  I hope to foment a movement to port it to
Linux, as it would seem to address a niche that is not well covered in Penguin land and is unlikely to be in the
forseeable future, by my project or any other I know of, which is to say: cluster replication and
continuous fine grained delta logging for high capacity servers.

Natural metadata position

For file index blocks, the "natural" position is near the file's inode table block, and near the beginning
of the file data blocks.  Roughly speaking, we would like to place index blocks on the same track as both the
associated inode table block and the first few file data blocks, so that the disk drive will pick these up
(Continue reading)

Daniel Phillips | 3 Mar 01:48

Patch: Preliminary attempt at nospace handing

Hi,

I was not really expecting to see people testing out of space handling in Tux3 at this early stage, but Marcin
apparently failed to get that memo, so he went ahead and demonstrated directory corruption by hitting
nospace in his mass file creation test.  To avoid further embarrassment of this nature, I put together a
preliminary patch to fail gracefully in the file create instead of just stumbling forward and running out
of disk blocks deep in metadata update.

The classic way of dealing with this messy issue is, at the beginning of each atomic filesystem change one
overestimates the worst case number of blocks that will be required to store the change, including all
associated metadata blocks, and reserve that many "credits" against the remaining free blocks on the
filesystem.  If the remaining free blocks count lies below some safety margin depending on the user
privilege and our own confidence that the reservation code is correct, we bail out of the change with
ENOSPC, which bubbles up to the application.  If remaining free blocks are sufficient, on the other hand,
we press on with the change, counting the number of new blocks actually allocated.  When the change is done
(but not necessarily recorded on disk) any unused credit, that is, reservation in excess of actual blocks
consumed, is returned to the credit pool.

Well, I was too lazy to implement all the accounting code needed to keep track of actual allocations, and in
some cases the actual allocation does not occur inside the change anyway, but is deferred for a later flush
operation.  We already have this situation with delayed write allocation and we will soon have more of that
with delayed bitmap block flushing.  So rather than do that big, fragile hack right now, I tried a more
creative approach.  I make a gross overestimate of number of blocks required to store the transaction as
described above, but do not attempt to return unused credits to the credit pool.  Instead, I just let Tux3
"hit the wall" and detect insufficient unreserved free blocks while there still is plenty of actual free
space on the volume.  At this point, I force a flush to disk and reset credits to zero, then repeat the
reservation check against actual volume free space, now that all earlier reservations have been
translated into a concrete on-disk representation.  If this second check fails, it is a real out of space
condition and ENOSPC is returned to the application.

(Continue reading)

Daniel Phillips | 3 Mar 02:35

Re: Patch: Preliminary attempt at nospace handing

OK, maybe this revised patch will allow you to delete things after
reaching a volume full condition.  Not tested...

Daniel
Attachment (hack.nospace.patch): text/x-diff, 13 KiB
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 3 Mar 03:55

Re: Patch: Preliminary attempt at nospace handing

...and by the way, calling freeze_bdev from deep inside a filesystem
vfs method is unlikely ever to work reliably without deadlocking.  But
at least it demonstrates the principle when not pushed too hard.  We
will develop some other, more robust flushing model for real use.

Regards,

Daniel
Daniel Phillips | 4 Mar 08:00

Re: Patch: Preliminary attempt at nospace handing

Hi Marcin,

This patch against the latest mercurial repo attempts to resolve
yesterday's deadlock.  "It works for me".

Daniel
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 4 Mar 12:33

Re: Patch: Preliminary attempt at nospace handing

Here is a better patch that doesn't deadlock (the last one did) and
uses the more proper generic_sync_sb_inodes instead of freeze_bdev.

I used the make-many-files.c program that Marcin dug up somewhere on
the net to test this, making the partition 4 GB, which is a little
too small to hold all the files.  This exercises the nospace handling
nicely.

Note: one thing that freeze_bdev does that generic_sync_sb_inodes does
not is prevent new writes during the flush.  We will have to think
about how this is to be handled.

Daniel
Attachment (nospace.patch): text/x-diff, 8 KiB
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 5 Mar 02:31

Re: Patch: Preliminary attempt at nospace handing

Fixed the kernel header include problem introduced by the previous
version of the patch, so user code now builds properly.

Regards,

Daniel
Attachment (nospace.patch): text/x-diff, 9 KiB
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 5 Mar 03:20

Re: Patch: Preliminary attempt at nospace handing

Another small adjustment to make rm -r work without failing on nospace.

Regards,

Daniel
Attachment (nospace.patch): text/x-diff, 9 KiB
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 5 Mar 11:15

Test code: make-many-files.c

Here is the test code Marcin found on the web.  It makes a three level
deep tree of directories containing 405,000 files, with 15,000 files
per directory.  Make the directory tree, and then rm -r.  Repeat.  Bad
things happen, especially if the partition is about 4GB, which is 1GB
less than the space required to complete the run.

After a successful rm -r (it doesn't always succeed) then umount takes
an unreasonably long time, which looks like an interaction between VFS
and Tux3 where inode table blocks are getting re-dirtied and written
out over and over again.  This may have something to so with an
explicit test in generic_flush_sb_inodes for whether it is the blockdev
being flushed, which no long functions for us since we no longer use
the blockdev, but a regular inode for our metadata blocks.  The
correct solution is probably to write our one inode flusher along the
lines of generic_flush_sb_inodes, but doing exactly what we need it to
do.

Regards,

Daniel
Attachment (make-many-files.c): text/x-csrc, 777 bytes
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

Gmane