Daniel Phillips | 1 Oct 2008 13:35

Tux3 Report: Extolling the Extensive Virtues of Extents

After a couple of weeks of painstakingly slow progress, Tux3 now is 
close enough to having extents that I feel it is time to sit back, 
limber up the proverbial knuckle joints for some heavy typing, and 
launch yet another Tux3 design missive into the wild blue yonder.

What is all this noise about extents and why is everybody doing it?  And 
what is an extent anyway?

Traditional filesystems record a pointer to each individual disk block, 
whereas modern extent-based filesystems record pointers to runs of disk 
blocks, each such extent having a starting address and a count of data 
blocks that store all or part of some file.  A file may consist of many 
extents, and may be sparse, which requires that logical addresses also 
be stored, in order to know the position of each extent within a file.  
An extent can therefore be thought of as a triple {logical address, 
physical address, block count} or alternatively, as a pair {physical 
address, block count} indexed by a logical address.  Tux3 uses the 
latter approach while Ext4 uses the former.

Extents themselves are simple things: just add a count to a block 
pointer and you have an extent, which is just what we did in Tux3.  The 
extent count is packed into 6 unused bits of a 64 bit physical block 
pointer.  "For free" you could say.

On the other hand, algorithms to handle extents are far from simple.  It 
is amazing how much complexity jumped out of the woodwork when that 
innocent little count field appeared.  For one thing, extents 
constitute variable sized logical objects so you can't store them in a 
radix tree as with the classic direct/indirect/double/triple scheme 
inherited by Ext2 from UFS.  A considerably more complex b-tree scheme 
(Continue reading)

Daniel Phillips | 1 Oct 2008 23:07

Thinking about losing the dleaf free space vars

Currently we have:

	unsigned leaf_free(BTREE, vleaf *leaf)
	{
		return to_dleaf(leaf)->used - to_dleaf(leaf)->free;
	}

and the same result is produced by:

	int leaf_free2(BTREE, void *vleaf)
	{
		struct dleaf *leaf = vleaf;
		struct group *gdict = (void *)leaf + btree->sb->blocksize, *gstop = gdict - leaf->groups;
		struct entry *edict = (void *)gstop, *entry = edict;
		struct extent *extents = leaf->table;
		for (struct group *group = gdict - 1; group >= gstop; group--)
			extents += (entry -= group->count)->limit;
		return (void *)entry - (void *)extents;
	}

Which is pretty efficient.  I wonder whether it is worth maintaining
the free/used variables in the dleaf at all?  There are only two places
where we actually need to know: in filemap.c, to decide whether we need
to split the leaf before inserting new extents, and in btree.c to
decide whether we can merge two leaves.  In both cases we have just
traversed the leaf contents anyway, and done most of the work of
computing the leaf free space.

Against this, we have a considerable amount of code for maintaining the
free and used variables in the leaf during various operations.  I have
(Continue reading)

Shapor Naghibzadeh | 1 Oct 2008 23:53
Picon
Gravatar

Re: Thinking about losing the dleaf free space vars

I think at this point the tracking used/free explicitly will make
detecting bugs easier.  Perhaps it should be left for now, and removed
later as an optimization.

Shapor

On Wed, Oct 1, 2008 at 2:07 PM, Daniel Phillips <phillips <at> phunq.net> wrote:
> Currently we have:
>
>        unsigned leaf_free(BTREE, vleaf *leaf)
>        {
>                return to_dleaf(leaf)->used - to_dleaf(leaf)->free;
>        }
>
> and the same result is produced by:
>
>        int leaf_free2(BTREE, void *vleaf)
>        {
>                struct dleaf *leaf = vleaf;
>                struct group *gdict = (void *)leaf + btree->sb->blocksize, *gstop = gdict - leaf->groups;
>                struct entry *edict = (void *)gstop, *entry = edict;
>                struct extent *extents = leaf->table;
>                for (struct group *group = gdict - 1; group >= gstop; group--)
>                        extents += (entry -= group->count)->limit;
>                return (void *)entry - (void *)extents;
>        }
>
> Which is pretty efficient.  I wonder whether it is worth maintaining
> the free/used variables in the dleaf at all?  There are only two places
> where we actually need to know: in filemap.c, to decide whether we need
(Continue reading)

Daniel Phillips | 2 Oct 2008 23:59

Tux3 gets a high speed atom smasher

(repost to our list)

The Large Hadron Collider no longer has a monopoly on smashing atoms, 
now that Tux3 has weighed in with a bit of its own atom smashing.  
Probably the coolest idea that has seen the light of day so far in Tux3 
is the concept of xattr atoms: numeric tokens that stand for xattr 
names.  In contrast to pretty much every other filesystem out there, 
Tux3 looks up an xattr in two stages:

  1) Look up the attribute name in the atom table
  2) Search for the atom in the inode's xattr cache

The sticky issue here is that there is no limit on the number of 
different xattr names that are possible, which means that there is no 
limit on the size of the atom table.  The fact that the number of 
distinct names used in practice tends to be very small is, 
unfortunately, insufficient justification to ignore the fact that 
nothing prevents the atom table from growing without bound.  Unless 
preventative measures are taken, any unprivileged user could carry out 
a denial of service attack simply by creating and deleting random 
xattrs on a file until the entire disk volume is filled with unused 
xattr names.

Addressing this problem required a lot of soul searching and wild design 
banter.  Finally we convinced ourselves that the only way to avoid the 
alligators in that marsh would be to implement full blown persistent 
reference counting for xattr names.  A reverse mapping table would also 
be necessary in order to be able to implement listxattr efficiently.  
Though these things sound scary and complex, it ended up being 
implemented in a really small amount of nicely efficient code.
(Continue reading)

Daniel Phillips | 3 Oct 2008 18:03

Extent support has landed

Tux3 now has extent support, for now and evermore.  I flipped over the 
#defines to use extents for more than just unit testing.  The tux3 user 
command now runs with extents and so do both of the Fuse versions.  I 
did not test any of these!  Somebody, kindly check and see what breaks.

I did test extents very lightly using the inode.c unit test, which 
successfully writes and reads back "hello world" and somewhat more 
exhaustively using the filemap.c unit tests, but this is still very 
green code.  I expect a number of bugs to turn up.

Developing the extent support was by no means a cut and paste job.  On 
the contrary it was nearly three weeks of grinding slow work.  This is 
new territory, quite unlike any filesystem code I have written before, 
and there was precious little guidance out there on the net.  The 
combinatorics are fairly horrifying as I touched on in an earlier post.  
There will be a longish post coming soon on the extent machinery and 
the new api I created for processing and editing extents, which worked 
out pretty well.  For now I will briefly describe how the filemap 
operations are structured.

Both buffer flush and buffer read are handled by the same function, 
filemap_extent_io, because much of the code is exactly the same.  No 
sense in letting thr common parts drift apart and end up making us fix 
bugs twice.

The extent io code is driven by the block-at-a-time bread and bwrite 
interfaces, which is perverse but it is much the way things work in 
kernel at the moment.  Naturally, we would really like big reads and 
writes to drive the extent interface directly, but if we choose to go 
that route we will basically get stuck with the task of re-implementing 
(Continue reading)

Daniel Phillips | 4 Oct 2008 01:41

Re: Extent support has landed

On Friday 03 October 2008 09:03, I wrote:
> Once we have the extent, we probe into the btree, 64 blocks below the 
> beginning of the extent, which ensures that no existing extents that 
> may overlap the io extent are missed...

...and I should mention, this constitutes a small efficiency that can be 
easily corrected.  Each btree leaf pointer simply needs to have a bit 
reserved to indicate that some extents may overlap into the next btree 
leaf, and we check that bit during the probe to determine whether to 
seek directly to the target extent address, or to the predecessor node.

This is the strategy used by HTree, and is one of the unique elements of 
HTree that distinquishes it from a standard BTree.  This "continuation" 
bit is to be implmented in the generic btree code so that all the Tux3 
btree flavors can take advantage of it, and therefore will be a step in 
the direction of implementing HTree and PHTree for tux3.

Regards,

Daniel
Daniel Phillips | 4 Oct 2008 07:55

Tux3 Report: It's What Next time once again

Today was one of those skate-in-the-dark kind of days here in Paradise, 
but whereas the Tux3 cabal quite enjoys the occasional dimly lit 
runabout in the concrete jungle, we make it a point not to keep you, 
loyal readers, in the dark about upcoming Tux3 development directions.

The past three week's work made something of a liar of me, as we 
completely ignored the task of integrating versioned pointers that I 
had talked about earlier.  Instead, the Cabal decided that we would be 
better served by putting the effort into extents, in the interest of 
benchmarking well right from the start.

In fact, Tux3 will never actually have versioned pointers, but rather 
versioned extents, a much more ambitious proposition and messy enough 
that we now plan to defer that coding until some time after the initial 
kernel port, which is now expected to land with atomic commit and 
extended attributes but without versioning.  Another way of putting it: 
I promised to cut corners by leaving something for later, and that 
something is going to be versioning.  Tux3 will thus first arrive in 
kernel as more or less a functional equivalent of Ext3 (with bigger 
limits and shiny atom thingies) instead of something more exotic.

Atomic commit is a biggish project in its own right.  As with xattr 
atoms and certain aspects of the extent strategy, we forge into 
unexplored territory here with some new ideas on how to handle this 
tricky task robustly and efficiently.  The concept of forward logging, 
which is just one part of the planned mechanism, was described in the 
original Tux3 announcement:

   http://lkml.org/lkml/2008/7/23/257

(Continue reading)

Philip Pokorny | 6 Oct 2008 08:33
Favicon

Encoding of extent information

I was wondering how you were encoding the length into "unused" bits of the extent pointers.

I've seen you use high-order bits elsewhere in your design, so I assume you took 6 bits from the top of the pointer thinking that 2^58 should be big enough for a block pointer.

I wonder if you have ever heard of this alternate encoding for extents.  It depends on several assumptions:

1. The minimum extent size is 2x the addressable unit size.  In the case of a block device with 512 byte sectors, that would be a 1k minimum extent (2 sectors).

2. Extents can only be of size 2^n

3. An extent of size 2^n is aligned on a 2^n addressable unit boundry.

For a filesystem, these would seem to be reasonable assumptions.

For any given number N (except 0) that describes an extent, the first unit in the extent is given by the expression "N & (N-1)", the last unit is "N | (N-1)" and the width of the extent is "(N ^ (N-1))+1"

Forcing extents to be aligned on 2^n boundaries of the underlying block device should improve I/O to composite devices like RAID arrays (which have 2^n "chunk" sizes).

This scheme imposes no limit on the size of an extent and with no sacrifice of high order bits.  Further, using this scheme low-order bits can be used for other purposes at the cost of a larger minimum extent width.  Current filesystems are generally limited to 4k block sizes which would allow for 2 low-order bits for "special purposes" which would be masked off before the above expressions.

To illustrate:

N = 200   is a 16 block extent from 192 to 207

it can be devided into two 8 block extents: N=196 (192-199) and N=204 (200-207)
or into four 4 block extents: N=194 (192-195), N=198 (196-199), N=202 (200-203), N=206 (204-207)
or finally into eight 2 block extents:  N= 193, 195, 197, 199, 201, 203, 205, 207

Just a thought...

--
Philip Pokorny, RHCE
Technical Director - Penguin Computing
Voice: 415-370-0835  Toll free: 888-PENGUIN
www.penguincomputing.com

_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Pranith Kumar | 8 Oct 2008 08:27
Picon

Errors with latest build

Hey all,

I did a make tests with the latest build and got the following errors, been trying to make head and tail of it.
May be someone can point what's wrong to me :) or may be Ill beat them to it :-P

bobby <at> lappy:~/proj/tux3/user/test$ make tests
gcc -std=gnu99 -Wall -g -D_FILE_OFFSET_BITS=64 vfs.o dleaf.c -o dleaf
valgrind --error-exitcode=200 --leak-check=full ./dleaf
==5364== Memcheck, a memory error detector.
==5364== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==5364== Using LibVEX rev 1804, a library for dynamic binary translation.
==5364== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==5364== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework.
==5364== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==5364== For more details, rerun with: -v
==5364==
--- leaf test ---
dwalk_probe: probe for 0x3000055
dwalk_probe: extent = 0, exstop = 0
==5364== Invalid read of size 4
==5364==    at 0x804C342: main (dleaf.c:641)
==5364==  Address 0x4184428 is 0 bytes after a block of size 1,024 alloc'd
==5364==    at 0x4022AB8: malloc (vg_replace_malloc.c:207)
==5364==    by 0x804A0F6: leaf_create (dleaf.c:102)
==5364==    by 0x804C209: main (dleaf.c:632)
==5364==
==5364== Invalid read of size 4
==5364==    at 0x804C356: main (dleaf.c:642)
==5364==  Address 0x4184428 is 0 bytes after a block of size 1,024 alloc'd
==5364==    at 0x4022AB8: malloc (vg_replace_malloc.c:207)
==5364==    by 0x804A0F6: leaf_create (dleaf.c:102)
==5364==    by 0x804C209: main (dleaf.c:632)
==5364==
==5364== Invalid read of size 1
==5364==    at 0x804B375: dwalk_pack (dleaf.c:419)
==5364==    by 0x804C407: main (dleaf.c:645)
==5364==  Address 0x4184428 is 0 bytes after a block of size 1,024 alloc'd
==5364==    at 0x4022AB8: malloc (vg_replace_malloc.c:207)
==5364==    by 0x804A0F6: leaf_create (dleaf.c:102)
==5364==    by 0x804C209: main (dleaf.c:632)
==5364==
==5364== Invalid read of size 1
==5364==    at 0x804B387: dwalk_pack (dleaf.c:419)
==5364==    by 0x804C407: main (dleaf.c:645)
==5364==  Address 0x4184428 is 0 bytes after a block of size 1,024 alloc'd
==5364==    at 0x4022AB8: malloc (vg_replace_malloc.c:207)
==5364==    by 0x804A0F6: leaf_create (dleaf.c:102)
==5364==    by 0x804C209: main (dleaf.c:632)
group -1/0 at entry -1/0
dwalk_pack: add entry 0x3001001
dwalk_pack: add group 0
dwalk_pack: add extent 0
group 1/1 at entry 2/1
dwalk_pack: add entry 0x3001002
dwalk_pack: add extent 1
group 1/1 at entry 4/2
dwalk_pack: add entry 0x3001003
dwalk_pack: add extent 2
group 1/1 at entry 6/3
dwalk_pack: add entry 0x3001004
dwalk_pack: add extent 3
group 1/1 at entry 8/4
dwalk_pack: add entry 0x3001005
dwalk_pack: add extent 4
group 1/1 at entry 10/5
dwalk_pack: add entry 0x3001006
dwalk_pack: add extent 5
1 entry groups:
==5364==
==5364== Conditional jump or move depends on uninitialised value(s)
==5364==    at 0x804A3E0: dleaf_dump (dleaf.c:157)
==5364==    by 0x804C65C: main (dleaf.c:653)
==5364==
==5364== Conditional jump or move depends on uninitialised value(s)
==5364==    at 0x804A472: dleaf_dump (dleaf.c:159)
==5364==    by 0x804C65C: main (dleaf.c:653)
==5364==
==5364== Use of uninitialised value of size 4
==5364==    at 0x804A408: dleaf_dump (dleaf.c:160)
==5364==    by 0x804C65C: main (dleaf.c:653)
  0/6: 3001001 => 1/1; 3001002 => 2/1; 3001003 => 3/1; 3001004 => 4/1; 3001005 => 5/1; 3001006 => 6/1;
==5364==
==5364== Conditional jump or move depends on uninitialised value(s)
==5364==    at 0x804B7D0: dleaf_check (dleaf.c:469)
==5364==    by 0x804C674: main (dleaf.c:654)
==5364==
==5364== Conditional jump or move depends on uninitialised value(s)
==5364==    at 0x804B805: dleaf_check (dleaf.c:472)
==5364==    by 0x804C674: main (dleaf.c:654)
==5364==
==5364== ERROR SUMMARY: 29 errors from 9 contexts (suppressed: 11 from 1)
==5364== malloc/free: in use at exit: 1,024 bytes in 1 blocks.
==5364== malloc/free: 1 allocs, 0 frees, 1,024 bytes allocated.
==5364== For counts of detected errors, rerun with: -v
==5364== searching for pointers to 1 not-freed blocks.
==5364== checked 60,220 bytes.
==5364==
==5364==
==5364== 1,024 bytes in 1 blocks are definitely lost in loss record 1 of 1
==5364==    at 0x4022AB8: malloc (vg_replace_malloc.c:207)
==5364==    by 0x804A0F6: leaf_create (dleaf.c:102)
==5364==    by 0x804C209: main (dleaf.c:632)
==5364==
==5364== LEAK SUMMARY:
==5364==    definitely lost: 1,024 bytes in 1 blocks.
==5364==      possibly lost: 0 bytes in 0 blocks.
==5364==    still reachable: 0 bytes in 0 blocks.
==5364==         suppressed: 0 bytes in 0 blocks.
make: *** [dleaftest] Error 200
bobby <at> lappy:~/proj/tux3/user/test$

Thanks,
--
Pranith.
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 8 Oct 2008 08:31

Re: Encoding of extent information

On Sunday 05 October 2008 23:33, Philip Pokorny wrote:
> I was wondering how you were encoding the length into "unused" bits of the extent pointers.
> 
> I've seen you use high-order bits elsewhere in your design, so I assume you took 6 bits from the top of the
pointer thinking that 2^58 should be big enough for a block pointer.
> 
> I wonder if you have ever heard of this alternate encoding for extents.  It depends on several assumptions:
> 
> 1. The minimum extent size is 2x the addressable unit size.  In the case of a block device with 512 byte
sectors, that would be a 1k minimum extent (2 sectors).
> 
> 2. Extents can only be of size 2^n
> 
> 3. An extent of size 2^n is aligned on a 2^n addressable unit boundry.
> 
> For a filesystem, these would seem to be reasonable assumptions.
> 
> For any given number N (except 0) that describes an extent, the first unit in the extent is given by the
expression "N & (N-1)", the last unit is "N | (N-1)" and the width of the extent is "(N ^ (N-1))+1"
> 
> Forcing extents to be aligned on 2^n boundaries of the underlying block device should improve I/O to
composite devices like RAID arrays (which have 2^n "chunk" sizes).
> 
> This scheme imposes no limit on the size of an extent and with no sacrifice of high order bits.  Further,
using this scheme low-order bits can be used for other purposes at the cost of a larger minimum extent
width.  Current filesystems are generally limited to 4k block sizes which would allow for 2 low-order bits
for "special purposes" which would be masked off before the above expressions.
> 
> To illustrate:
> 
> N = 200   is a 16 block extent from 192 to 207
> 
> it can be devided into two 8 block extents: N=196 (192-199) and N=204 (200-207)
> or into four 4 block extents: N=194 (192-195), N=198 (196-199), N=202 (200-203), N=206 (204-207)
> or finally into eight 2 block extents:  N= 193, 195, 197, 199, 201, 203, 205, 207
> 
> Just a thought...

Wow, that is diabolical, it is certainly worth thinking about.  There
may be other benefits of the alignment restriction, such as helping
control fragmentation.   But does it not require more extents to
represent odd sized files at the short end of the scale?  Which is not
necessarily a problem - we might find it useful to allocate a little
more extent space than actually needed anyway.  It needs a quite a bit
more analysis.  The scheme does save bits all right.  Very interesting.

Out of curiosity, where did you run into this?

Regards,

Daniel

Gmane