Daniel Phillips | 1 Aug 06:58

leaf_split has landed

Another messy little function for the file index btree set of btree
leaf functions has landed, leaf_split:

   http://tux3.org/tux3?f=0fed5998fbec;file=test/leaf.c

This appears to work.  It was able to split the sample leaf at any
entry position, include zero (leaving an empty leaf in the original
btree block) or num_entries (leaving an empty leaf in the split block).

This was not a whole lot easier to write than leaf_insert.  In fact,
this whole two level design makes the coding quite hard, but the
benefit is clearly worth the effort I think.  The functions so far are
very fast, and the structure is really, really compact.  Which will
help offset the fact that tux3 will often be loading a lot of version
data that it doesn't actually need in order to get at a particular
version, including the current version.  I do not think this will be a
problem in practice because heavily versioned files are actually pretty
rare, and cache effects most likely will not be noticeable until up in
the several hundred version range.  At that level, probably every
versioning filesystem is going to exhibit some cache effects, not just
Tux3.

And we have a long term plan for keeping the cache footprint of
versioned extents to a manageable level, even for thousands of
versions, described in the original lkml posting.

Another possibility is to keep around a "digest" metadata btree that
contains only the pointers for a particular version, and make that be
consistent somehow with the "full" weave-style metadata.  Since this
would be maintained only for the current version, total metadata still
(Continue reading)

Daniel Phillips | 2 Aug 02:46

Development roadmap

Here is an initial list of development tasks goals, with the more
immediate tasks in higher resolution.  I will be updating this as
things proceed.  First I will just try to list all the code bits
that need writing, organized into initial prototype, then the march
to full feature set.

This is a very rough draft, to be refined as we go.

Allocation

 - Initially use bitmap code lifted from ddsnap

Generic Btree

 - Generalize ddsnap btree operations to use tables of leaf and index
   node methods

Btree leaf methods

 - 11 leaf methods or so to implement for each of five different kinds
   of btree

 - Btree variants:
    - inode table
    - file index
    - free extents
    - atime table
    - directory

 - Leaf methods:
(Continue reading)

Daniel Phillips | 2 Aug 02:56

Tux3 wikipedia article

Whoever thinks that Tux3 should have a wikipedia entry can go here and
make arguments about why the entry kindly created by Stefan Hostetler
should not be deleted.  Please check out the guidelines re arguments
against deletion:

   http://en.wikipedia.org/wiki/Tux3
   http://en.wikipedia.org/wiki/Wikipedia:Deletion_policy
   http://en.wikipedia.org/wiki/Wikipedia:Arguments_to_avoid_in_deletion_discussions

I think Tux3 deserves a wikipedia entry because:

 - The Tux3 project is well underway.  It has lots of interest from
   both potential users and developers.

 - Functional code for parts of the implementation is publicly
   available.

 - There is a fair amount of credible commentary on the design.

 - The pointer versioning technology of Tux3 is real, implemented,
   tested and robust.

 - There is a substantive body of code and documentation supporting
   the design approaches being taken.

 - ddsnap which is released and near beta can be considered a working
   prototype of portions of Tux3

Regards,

(Continue reading)

Daniel Phillips | 2 Aug 03:04

Tux3 and virtualization

Tux3 and server virtualization will go very well together.  You create
one root volume image for all the servers, then take a snapshot of it
for each server instance.  Each server can then make changes to its
version of the root filesystem, and each server can furthermore make
snapshots of its snapshot for the purpose of replication or backup.

A virtual environment like UML can let the virtual server directly
mount the Tux3 filesystem on the host, while virtual environments
without such capability would have to loopback mount a single file
of the Tux3 host filesystem, which would be the root volume image.

Note that ddsnap can already do this.  The likelihood is that Tux3
will be able to do it more efficiently, at least until similar
algorithms are backported into ddsnap.

Regards,

Daniel
Daniel Phillips | 2 Aug 03:22

leaf_merge lands

The leaf_merge operation turned out quite short and readable compared
to leaf_split.  I wonder why that is.  It was only sightly easier to
write though.  Lot of scrambled eggs for data when the slightest thing
was out of place.  For a while I was seeing great accurate merges when
I had not even finished implementing the whole algorithm, then to my
surprise, when the whole algorithm was implemented, suddenly everything
was corrupt.  The reason: my test consisted of splitting one leaf into
two, then merging back into the same leaf.  Some original data from
before the split was still hanging around in original leaf, and due to
a bug, the data being copied back from the merged leaf was heading way
off into free space.  Not landing where it was supposed to at all.
Leaving the original, correct data in the right place.  Then naturally
there was a bug in the final step of the algorithm, creating the odd
effect of an incomplete program producing correct results that became
incorrect when completed.

I valground it this time.  Picked up a bug.  Moral of the story: always
valgrind.  Never do not valgrind.

Now for the last of the file index leaf operations: delete.  This is
different from the others in that it is more of a process than a single
operation.  Delete sweeps through an entire leaf running some algorithm
to decide if each extent should be kept, deleted or altered.  I want to
avoid embedding the version delete algorithm, which is complex, inside
the equally complex delete algorithm, which would invert the layering
I have in mind where the versioning code is at a higher level than the
leaf management, create a tie between two layers, and result in a big
ugly blob of a special purpose hack.  So instead, I think I will set up
the delete as a process that first initializes a "path" containing the
current state of a leaf traversal including pointing at the beginning
(Continue reading)

Antonio Vargas | 2 Aug 12:45
Picon
Gravatar

On avoiding deadlock when allocating space (and handling quota)


Hi Daniel, your first mention about managing to do deadlock-free quotas reminded me of the method I used on my userspace prototype of tux2.

When implementing the physical structure of the metatree, the metaroot replicas contained the first 16 or 32 inodes which were reserved for internal use. So, using that scheme, inode 1 had the free bitmap as the metatree, inode 2 had a list of blocks ready to use and inode 3 had as datablocks the rest of the inode table.

So, during mkfs time not only the blocks used for metadata were marked as used on the free bitmap, but also some blocks not yet used for actual data would be marked as used in the bitmap but would get added to inode 2 as a list of blocks ready to be allocated without having to touch the free bitmaps.

Then, when starting an operation which could potentially need many new blocks due to COW of the metadata trees, I would first of all ask for a upper bound of blocks to get reserved. These blocks would get reserved for this operation and be taken out of the ready-to-alloc list on memory, and any COW operations later needed would simply grab a block from that operation-private list in memory.

When an operation finished, any blocks which had been reserved would be returned to the ready-to-allocate list, and when a phase finished it would first initiate an internal-only operation to replenish that ready-to-allocate list from the bitmaps, using the previously existing ready-to-allocate blocks as a way to avoid needing to modify the free bitmap while it was being updated to get more blocks out of it.

Of course this means that the maximum length of a phase would be capped by the number of ready-to-alloc blocks we always kept already reserved on the free bitmap. But it also meant that deciding when to write a phase to disk was easy: when the number of in-memory ready-to-allocate blocks were reaching a number small enough to be unable to guarantee a worst case COW process on the bitmap tree.

I think these same methods could be adapted to work with extents for tux3, and can enable a deadlock free implementation of many operations without having to do much explicit work in each of the operations.

--
Greetz, Antonio Vargas Gonzalez aka winden of rgba^ntw^bg

http://winden.wordpress.com/
windenntw <at> gmail.com

Every day, every year
you have to work
you have to study
you have to scene.
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Shapor Naghibzadeh | 2 Aug 15:15
Picon
Gravatar

Re: Another modest code drop

On Wed, Jul 30, 2008 at 9:46 PM, Daniel Phillips <phillips <at> phunq.net> wrote:
> The split.c code is now generally better, and actually runs the test
> case correctly, which the previous version did not.
>
> Turning the groups vector upside down like the entries vector made
> things considerably more regular.  The leaf_insert function is now
> pretty readable, which is something of an achievement given all the
> corner case in it.
>
>   http://tux3.org/tux3?f=35d77cb24c1f;file=test/leaf.c
>
> One neat thing about this insert is, there is only one memmove on
> the fast path, and that one usually moves zero bytes.  Otherwise as
> can be seen, the fast path is very fast.

From leaf.c:

32  * The top level index is a table of groups of entries all having the same
33  * high 24 bits of logical address which is only stored once, along with the
34  * 8 bit count of entries in the group.  Since there can be more than 256
35  * entries at the same logical address, there could be more than one group
36  * with the same logical address (not handled yet, which limits the number of
37  * different versions in the leaf to 256).

I might be missing something here, but it looks like its worse than
that.  Since you don't handle more than one group with the same loghi,
if a group has more than 255 entries, the count simply loops and you
start corrupting the leaf.  The attached test illustrates the issue.

Shapor
Attachment (test.patch): application/octet-stream, 638 bytes
_______________________________________________
Tux3 mailing list
Tux3 <at> tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Daniel Phillips | 2 Aug 22:23

Re: Another modest code drop

On Saturday 02 August 2008 06:15, Shapor Naghibzadeh wrote:
> From leaf.c:
> 
> 32  * The top level index is a table of groups of entries all having the same
> 33  * high 24 bits of logical address which is only stored once, along with the
> 34  * 8 bit count of entries in the group.  Since there can be more than 256
> 35  * entries at the same logical address, there could be more than one group
> 36  * with the same logical address (not handled yet, which limits the number of
> 37  * different versions in the leaf to 256).
> 
> I might be missing something here, but it looks like its worse than
> that.  Since you don't handle more than one group with the same loghi,
> if a group has more than 255 entries, the count simply loops and you
> start corrupting the leaf.  The attached test illustrates the issue.

You did not miss anything, I did.  I just can't be quite that lazy, and
in the process we will remove that 255 version limitation.  Here is a
partial fix which is just to start a new group not only when loghi is
different, but when the groups count hits 255.  This additional split
criterion fails to preserve the partitioning of logical addresses
between the two groups so that all the members of one group are below
a given address and all the members of the other are above.  This can
be corrected by partitioning the members at group split time for the
overflow case.

-	if (group == grbase || loghi < group->loghi) {
+	if (group == grbase || loghi < group->loghi || group->count == 255) {

Anybody want to take a run at that?  It is admittedly pretty tricky
code to work on, but this is not harder than the stuff that already
works.

Merge can also violate the groups count limit, so an additional check
is needed there too.  But this case is simpler because no there is no
"unpartitioning" to do.

Regards,

Daniel
Shapor Naghibzadeh | 3 Aug 00:27
Picon
Gravatar

[PATCH] Check for leaf full

# HG changeset patch
# User shapor <at> yzf.shapor.com
# Date 1217715726 25200
# Node ID 1586642d110c4a0f8d6919bad6fdd7d51eaf70a2
# Parent  c23a601a39db5c6081b848ef398882e9faebcad2
Check for leaf full.

For now we test for worst case.  We may fit one less extent in the leaf
(worst case) to simplify by not checking for new entry creation in
advance.

diff -r c23a601a39db -r 1586642d110c test/leaf.c
--- a/test/leaf.c	Sat Aug 02 15:16:41 2008 -0700
+++ b/test/leaf.c	Sat Aug 02 15:22:06 2008 -0700
@@ -200,6 +200,8 @@ int leaf_insert(struct leaf *leaf, block
 	unsigned loglo = target & 0xffffff, loghi = target >> 24;
 	void *used = leaf->used + (void *)leaf;
 	// need room for one extent + maybe one group + maybe one entry
+	if (leaf_free(leaf) < sizeof(struct extent) + sizeof(struct group) + sizeof(struct entry) )
+		return -1;

 	/* find group position */
 	struct group *group;
Shapor Naghibzadeh | 3 Aug 08:44
Picon
Gravatar

[PATCH] Fix entry table overflow (by splitting groups)

# HG changeset patch
# User shapor <at> yzf.shapor.com
# Date 1217745661 25200
# Node ID 2c679288390fb11949fa30625be33b257db9ba02
# Parent  e0d569213df6ba1ed4e8d2029b8de596f71403ca
Fix entry table overflow (by splitting groups).
Add ENOSPC check for leaf.

diff -r e0d569213df6 -r 2c679288390f test/fleaf.c
--- a/test/fleaf.c	Sat Aug 02 13:52:11 2008 -0700
+++ b/test/fleaf.c	Sat Aug 02 23:41:01 2008 -0700
@@ -199,18 +199,20 @@ int leaf_insert(struct leaf *leaf, block
 	unsigned loglo = target & 0xffffff, loghi = target >> 24;
 	void *used = leaf->used + (void *)leaf;
 	// need room for one extent + maybe one group + maybe one entry
+	if (leaf_free(leaf) < sizeof(struct extent) + sizeof(struct group) + sizeof(struct entry))
+		return -1;

 	/* find group position */
 	struct group *group;
 	for (group = groups; group > grbase; group--) {
-		if (loghi <= group->loghi)
+		if (loghi < group->loghi || loghi == group->loghi && (group - 1 <= grbase || loghi != (group - 1)->loghi ||
loglo <= (entries - group->count)->loglo))
 			break;
 		extents += (entries - group->count)->limit;
 		entries -= group->count;
 	}

-	/* insert new group if no match  */
-	if (group == grbase || loghi < group->loghi || group->count == 255) {
+	/* insert new group if no match or split is required */
+	if (group == grbase || loghi < group->loghi || (entries - group->count)->limit == 255) {
 		printf("new group at %i\n", group - grbase);
 		memmove(used - sizeof(*group), used, (void *)(group + 1) - used);
 		*group = (struct group){ .loghi = loghi, .count = 0 };
@@ -218,6 +220,16 @@ int leaf_insert(struct leaf *leaf, block
 		grbase--;
 		entries--;
 		leaf->groups++;
+	}
+
+	/* split the group if its out of space */
+	if ((entries - (group - 1)->count)->limit == 255) {
+		group->count = (group - 1)->count / 2;
+		struct entry *ep = entries - 1 - group->count;
+		int remove = (ep + 1)->limit;
+		while (ep > entries - 1 - (group - 1)->count)
+			(ep--)->limit -= remove;
+		(group - 1)->count -= group->count;
 	}

 	/* find entry position */
@@ -367,9 +379,12 @@ int main(int argc, char *argv[])
 	printf("--- leaf test ---\n");
 	unsigned hi = 1 << 24, hi2 = 3 * hi;
 	unsigned targets[] = { 0x11, 0x33, 0x22, hi2 + 0x44, hi2 + 0x55, hi2 + 0x44, hi + 0x33, hi + 0x44, hi + 0x99 },
next = 0;
-	for (int i = 0; i < 260;i++)
-		leaf_insert(leaf, i << 13, (struct extent){ .block = i });
-	leaf_dump(leaf);
+	for (int j = 0; j < 2; j++) {
+		for (int i = 0; i < 130; i++)
+			for (int k = 0; k < 3; k++)
+				leaf_insert(leaf, (i * 2 + j) << 13, (struct extent){ .block = i, .version = k });
+		leaf_dump(leaf);
+	}
 	leaf_insert(leaf, targets[next++], (struct extent){ .block = 0x111 });
 	leaf_insert(leaf, targets[next++], (struct extent){ .block = 0x222 });
 	leaf_insert(leaf, targets[next++], (struct extent){ .block = 0x333 });

Gmane