Daniel Phillips | 2 Feb 07:36

Rough draft of atomic commit for userspace

It compiles, it has not been tried.  There are probably mistakes and
omissions.

There are two new lists in the superblock:

  sb->pinned ... dirty btree nodes and bitmap blocks written per rollup
  sb->commit ... dirty blocks written per delta

This code implements a periodic rollup flush, which mainly moves blocks
from the pinned list to the commit list, and a delta flush, which
writes out the dirty metadata, then writes out the log blocks.

diff -r 4d5b59994ae7 user/buffer.c
--- a/user/buffer.c	Wed Jan 28 21:19:05 2009 -0800
+++ b/user/buffer.c	Sun Feb 01 22:33:25 2009 -0800
@@ -299,7 +299,7 @@ struct buffer_head *blockread(map_t *map
 	return buffer;
 }

-int blockdirty(struct buffer_head *buffer, unsigned newdelta)
+int blockdirty(struct buffer_head *buffer, unsigned newdelta, struct list_head *forked)
 {
 	unsigned oldstate = buffer->state;
 	assert(oldstate < BUFFER_STATES);
@@ -316,7 +316,7 @@ int blockdirty(struct buffer_head *buffe
 		buffer->data = clone->data;
 		clone->data = data;
 		clone->index = buffer->index;
-		set_buffer_state(clone, oldstate);
+		set_buffer_state_list(clone, oldstate, forked);
(Continue reading)

Lars Segerlund | 5 Feb 07:41
Picon
Gravatar

I found this on filesystem pointer corruption.

 I don't know how relevant it is for tux3 but I thought I would post
and if appliable perhaps it should be taken into account, ( or kept in
mind for later ).

http://www.cs.wisc.edu/wind/Publications/pointer-dsn08.pdf

 / regards, Lars Segerlund
Daniel Phillips | 6 Feb 09:51

Design note: Metablocks

The Tux3 disk superblock is now 96 bytes in size including magic number.  
Do we have any plans for the other 4000 bytes?  Yes: we will fill the 
bulk of the disksuper with pointers to "metablocks".  A metablock is 
like a superblock, except it is stored at some arbitrary place on the 
volume and it only contains data that may vary per delta commit.  Any 
fields that are set just once at mkfs time will remain in the normal 
disk superblock.

Metablocks are reserved in the allocation bitmap and distributed roughly 
evenly across the entire volume.  (We may not actually represent these 
allocations in otherwise empty bitmap blocks, to avoid creating 
hundreds of bitmap blocks at mkfs time.)

The notion of metablocks addresses three issues:

  * When a pointer to the beginning of the log chain must be stored,
    it can always be stored in a fairly nearby location.  In other
    words, if we have 500 metablocks, the maximum time required to
    seek to the closest one is 1/500th of the average seek time for
    a single spindle.

  * Atomic update: overwriting the superblock risks damaging the
    filesystem if the write is interrupted when partially completed.
    We avoid that by never choosing the last written metablock as
    the location for the next.

  * We avoid constantly overwriting the superblock, which might be
    beneficial for some flash devices.

The main benefit is avoiding seeking on delta commit.
(Continue reading)

Daniel Phillips | 8 Feb 11:27

Atomic commit status update

Logging and replay now function in the test_commit unit test.  Most of 
the atomic commit mechanism is now compiled into the both userspace and 
kernel code, though disabled to avoid breaking basic filesystem 
functionality during the prototyping period.

To run the prototype unit test:

  make UCFLAGS=-DATOMIC commit && ./commit testdev

The unit test also runs without defining ATOMIC, but btree blocks are 
not redirected and so no log transactions are generated, and no log 
blocks.

This test creates a new filesystem, runs some namespace transactions on 
it, evicts the pinned dirty btree metadata and attempts to reconstruct 
it via the replay.  Reconstruction is not fully functional yet, however 
the log blocks are (re)loaded and replayed successfully.  This creates 
a mountable filesystem in testdev.

Regards,

Daniel
OGAWA Hirofumi | 12 Feb 04:51
Picon
Gravatar

new blockdirty() for userspace

Hi,

This is new blockdirty() strategy for userspace.

To make stable buffer for backend, this forks buffers instead of only
data.  With this, backend will become simpler. Because the backend
doesn't need to care about block-fork, when it is flushing buffers.

To do it, this return forked buffer.  [The error check is FIXME for
now, it will be fixed with kernel change. Because blockdirty() for
kernel, it will return -EAGAIN as special case.]

Userspace is simple, so more explanation will be done with kernel
blockdirty().

    static-http://userweb.kernel.org/~hirofumi/tux3/

Please review, and please pull if it's ok.
--

-- 
OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
Daniel Phillips | 15 Feb 23:44

Challenge: Make Tux3 work well with flash disks

Hi all,

Please see this well written analysis of performance loss as a 
new-generation Intel flash disk "ages":

   http://www.pcper.com/article.php?aid=669
   "Long-term performance analysis of Intel Mainstream SSDs"

Though I have not really analyzed the issues completely at this time, I 
have the feeling Intel made a slight mistake in the way they combine 
writes.  I think that what they do is this: they have a "current" flash 
block, which starts fully erased, then each write transfer is appended 
until it is full.  So writes are combined in write order, which is a 
lot like the deduplication plan the Pune Institute students are 
pursuing.  The bucket idea is likely to have advantages and drawbacks 
similar to Intel's SSD write strategy.

The problem in both cases is the effect of rewrites, which cause data to 
be relocated away from its original position, leaving holes at the 
original position.  This may not be as big a problem with deduplication 
if the target application is mainly archive, but it is a serious and 
visible problem with a flash device that intends to act like a disk 
drive.

What happens is, when Intel's disk fills and ages, the best candidate 
block for erasing will have a high percentage of valid data on it, 
which has to be copied to a new location.  The performance of the disk 
under a steady write load will thus drop to a fraction of the erase 
speed, because a portion of data recovered by erasing has to be used to 
store valid data relocated from candidate erase blocks.
(Continue reading)

Daniel Phillips | 19 Feb 07:06

Tux3 Report: Tux3 boots up as root

Yesterday at 17:59 Japan Standard Time, Hirofumi Ogawa booted Linux to a 
Tux3 root filesystem for the first time in recorded history.  This 
notable feat was repeated by me today, without any help from a separate 
boot partition.

I toasted this auspicious occasion with a tall glass of Sake in honor of 
Hirofumi, who has done the vast majority of the kernel port, and lately 
has taken on a good chunk of the main design work as well.

Tux3 will be formally presented at Scale 7X, in Los Angeles this Sunday:

   http://scale7x.socallinuxexpo.org/conference-info/schedules

Thanks to the recent hard work, Tux3 will be able to attend this event 
in person as root on my trusty Shuttle.

Tux3 is still not ready to store real data though.  It still does not 
have crash recovery, although work is proceding well in that direction.  
Fsck would be nice too.  And of course we want versioning, the reason 
Tux3 came to be a project in the first place.  There are a few other 
issues as well.  Our plan remains the same:

   1) Get atomic commit and recovery working

   2) Present Tux3 for review

   3) Implement versioning during the review cycle

This way, we can get a few more eyeballs on the design as some of the 
formative elements solidify.  And ideally a few more hands to help pull 
(Continue reading)

Marcin Pohl | 21 Feb 16:57
Picon

Tux3 kernel panic dump

64bit archlinux system on intel q6600 with 6gb of ram.
this dump happened during untarring a kernel

the picture didnt quite capture the full screen, so if you need more,
please let me know

--Marcin
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Picon

Re: Tux3 kernel panic dump

2009/2/21 Marcin Pohl <marcinpohl <at> gmail.com>:
> 64bit archlinux system on intel q6600 with 6gb of ram.
> this dump happened during untarring a kernel

I do not see any tux3/fs specific calls in the trace.
To me looks like block layer related panic during bootup.
Yes/no?

Maybe Daniel can answer that better than me.

Thanks,
    --Pradeep
>
> the picture didnt quite capture the full screen, so if you need more,
> please let me know
>
> --Marcin
>
> _______________________________________________
> Tux3 mailing list
> Tux3 <at> tux3.org
> http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
>
>

--

-- 
Pradeep Singh Rautela
http://eagain.wordpress.com
http://emptydomain.googlepages.com
(Continue reading)

OGAWA Hirofumi | 22 Feb 19:21
Picon
Gravatar

kernel/ileaf.c fixes and cleanup

Hi,

I've found several bugs in kernel/ileaf.c. Even if all inodes on ileaf
was removed, we don't remove ileaf itself, at least for now.  So, we
have to handle empty ileaf, but in that case, some places is not
handling it correctly.

And several cleanups.

	static-http://userweb.kernel.org/~hirofumi/tux3/

I'm not reviewing those by myself, it is for the review. So, please
don't pull it yet.  However, with this patchset, it passes the fsstress
test more for a long time.

Thanks.
--

-- 
OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>

Gmane