Maciej Piechotka | 12 May 2013 21:21
Picon

Tux3 - mmap and snapshotting

Hi,

I've become interested in testing tux3 but I have a few questions before I try 
it out:

1. In todo list there was mentioned mmap support. What's happening currently 
on mmap call (does it fail or is it just using for example slower emulation by 
kernel using other entry points)?

2. Is tux3 lvm snapshot safe (i.e. I can create a snapshot while fs is 
mounted)? I understend that tux3 native snapshotting will be added in future 
but some form of snapshotting is required by my current backup strategy

3. Have the benchmarks been done using sdd or traditional hdd? How does tux3 
handles fragmentation if it was sdd?

Best regards
Raymond Jennings | 12 May 2013 19:50
Picon

Updated pre-merge checklist

When will tux3 be ready to merge to the kernel?

Also, what else needs to b edone?

Those latest performance numbers have me drooling and I can't wait to migrate.
_______________________________________________
Tux3 mailing list
Tux3 <at> phunq.net
http://phunq.net/mailman/listinfo/tux3
Liu Yuan | 8 May 2013 10:45
Picon

[PATCH] kernel: fix a compile error

From: Liu Yuan <tailai.ly <at> taobao.com>

make -C /lib/modules/`uname -r`/build/ M=`pwd` CONFIG_TUX3=m modules
make[1]: Entering directory `/home/yliu/linux-src'
  CC [M]  /home/yliu/tux3/user/kernel/dir.o
/home/yliu/tux3/user/kernel/dir.c: In function ‘tux_create_entry’:
/home/yliu/tux3/user/kernel/dir.c:148:4: error: ‘rec_len’ undeclared (first use in this function)
/home/yliu/tux3/user/kernel/dir.c:148:4: note: each undeclared identifier is reported only once for
each function it appears in
/home/yliu/tux3/user/kernel/dir.c:130:52: error: unused variable ‘orec_len’ [-Werror=unused-variable]
cc1: all warnings being treated as errors
...

Signed-off-by: Liu Yuan <tailai.ly <at> taobao.com>
---
 user/kernel/dir.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/user/kernel/dir.c b/user/kernel/dir.c
index 93485ba..ee8c413 100644
--- a/user/kernel/dir.c
+++ b/user/kernel/dir.c
 <at>  <at>  -127,7 +127,7  <at>  <at>  loff_t tux_create_entry(struct inode *dir, const char *name, unsigned len,
 	struct sb *sb = tux_sb(dir->i_sb);
 	tux_dirent *entry;
 	struct buffer_head *buffer, *clone;
-	unsigned reclen = TUX_REC_LEN(len), rec_len, name_len, offset;
+	unsigned reclen = TUX_REC_LEN(len), name_len = 0, rec_len, offset;
 	unsigned blocksize = sb->blocksize;
 	block_t block, blocks = *size >> sb->blockbits;
 	void *olddata;
--

-- 
1.7.9.5

_______________________________________________
Tux3 mailing list
Tux3 <at> phunq.net
http://phunq.net/mailman/listinfo/tux3
Daniel Phillips | 8 May 2013 01:24
Picon

Tux3 Report: Faster than tmpfs, what?

When something sounds to good to be true, it usually is. But not always. Today
Hirofumi posted some nigh on unbelievable dbench results that show Tux3
beating tmpfs. To put this in perspective, we normally regard tmpfs as
unbeatable because it is just a thin shim between the standard VFS mechanisms
that every filesystem must use, and the swap device. Our usual definition of
successful optimization is that we end up somewhere between Ext4 and Tmpfs,
or in other words, faster than Ext4. This time we got an excellent surprise.

The benchmark:

    dbench -t 30 -c client2.txt 1 & (while true; do sync; sleep 4; done)

Configuration:

    KVM with two CPUs and 4 GB memory running on a Sandy Bridge four core host
    at 3.4 GHz with 8 GB of memory. Spinning disk. (Disk drive details
to follow.)

Summary of results:

    tmpfs: Throughput 1489.00 MB/sec max_latency=1.758 ms
    tux3:  Throughput 1546.81 MB/sec max_latency=12.950 ms
    ext4:  Throughput 1017.84 MB/sec max_latency=1441.585 ms

Tux3 edged out Tmpfs and stomped Ext4 righteously. What is going on?
Simple: Tux3 has a frontend/backend design that runs on two CPUs. This
allows handing off some of the work of unlink and delete to the kernel tux3d,
which runs asynchronously from the dbench task. All Tux3 needs to do in the
dbench context is set a flag in the deleted inode and add it to a dirty
list. The remaining work like truncating page cache pages is handled by the
backend tux3d. The effect is easily visible in the dbench details below
(See the Unlink and Deltree lines).

It is hard to overstate how pleased we are with these results. Particularly
after our first dbench tests a couple of days ago were embarrassing: more than
five times slower than Ext4. The issue turned out to be inefficient inode
allocation. Hirofumi changed the horribly slow itable btree search to a
simple "allocate the next inode number" counter, and shazam! The slowpoke
became a superstar. Now, this comes with a caveat: the code  that produces
this benchmark currently relies on this benchmark-specific hack to speed up
inode number allocation. However, we are pretty sure that our production inode
allocation algorithm will have insignificant additional overhead versus this
temporary hack. If only because "allocate the next inode number" is nearly
always the best strategy.

With directory indexing now considered a solved problem, the only big
issue we feel needs to be addressed before offering Tux3 for merge is
allocation. For now we use the same overly simplistic strategy to allocate
both disk blocks and inode numbers, which is trivially easy to defeat to
generate horrible benchmark numbers on spinning disk. So the next round
of work, which I hope will only take a few weeks, consists of improving
these allocators to at least a somewhat respectable level.

For inode number allocation, I have proposed a strategy that looks a lot
like Ext2/3/4 inode bitmaps. Tux3's twist is that these bitmaps are just
volatile cache objects, never transferred to disk. According to me, the
overhead of allocating from these bitmaps will hardly affect today's
benchmark numbers at all, but that remains to be proven.

Detailed dbench results:

tux3:
    Operation      Count    AvgLat    MaxLat
    ----------------------------------------
    NTCreateX    1477980     0.003    12.944
    Close        1085650     0.001     0.307
    Rename         62579     0.006     0.288
    Unlink        298496     0.002     0.345
    Deltree           38     0.083     0.157
    Mkdir             19     0.001     0.002
    Qpathinfo    1339597     0.002     0.468
    Qfileinfo     234761     0.000     0.231
    Qfsinfo       245654     0.001     0.259
    Sfileinfo     120379     0.001     0.342
    Find          517948     0.005     0.352
    WriteX        736964     0.007     0.520
    ReadX        2316653     0.002     0.499
    LockX           4812     0.002     0.207
    UnlockX         4812     0.001     0.221
    Throughput 1546.81 MB/sec  1 clients  1 procs  max_latency=12.950 ms

tmpfs:
    Operation      Count    AvgLat    MaxLat
    ----------------------------------------
    NTCreateX    1423080     0.004     1.155
    Close        1045354     0.001     0.578
    Rename         60260     0.007     0.470
    Unlink        287392     0.004     0.607
    Deltree           36     0.651     1.352
    Mkdir             18     0.001     0.002
    Qpathinfo    1289893     0.002     0.575
    Qfileinfo     226045     0.000     0.346
    Qfsinfo       236518     0.001     0.383
    Sfileinfo     115924     0.001     0.405
    Find          498705     0.007     0.614
    WriteX        709522     0.005     0.679
    ReadX        2230794     0.002     1.271
    LockX           4634     0.002     0.021
    UnlockX         4634     0.001     0.324
    Throughput 1489 MB/sec  1 clients  1 procs  max_latency=1.758 ms

ext4:
    Operation      Count    AvgLat    MaxLat
    ----------------------------------------
    NTCreateX     988446     0.005    29.226
    Close         726028     0.001     0.247
    Rename         41857     0.011     0.238
    Unlink        199651     0.022  1441.552
    Deltree           24     1.517     3.358
    Mkdir             12     0.002     0.002
    Qpathinfo     895940     0.003    15.849
    Qfileinfo     156970     0.001     0.429
    Qfsinfo       164303     0.001     0.210
    Sfileinfo      80501     0.002     1.037
    Find          346400     0.010     2.885
    WriteX        492615     0.009    13.676
    ReadX        1549654     0.002     0.808
    LockX           3220     0.002     0.015
    UnlockX         3220     0.001     0.010
    Throughput 1017.84 MB/sec  1 clients  1 procs  max_latency=1441.585 ms

Apologies for the formatting. I will get back to a real mailer soon.

Regards,

Daniel
Lars Segerlund | 22 Mar 2013 15:47
Picon

mainlining ....

 Quick question, what requirements do we have for mainlining ?

 Perhaps a rough tasklist or goal would be nice to have, I can think
this as a way to motivate developers, much more than being mainline
itself.

 / regards, Lars Segerlund.
Raymond Jennings | 19 Mar 2013 09:52
Picon

kernel merge

What I've heard so far about tux3 is very promising.

When can consideration be given to merging it into the mainline linux kernel?

For starters it's a great way to increase the testing base, and I'm
actually confident enough in it to start using it on my desktop.
Daniel Phillips | 14 Mar 2013 02:16
Picon

Tux3 Report: Birds of a Feather flocked Together

Today's Tux3 report is a little shorter than usual, not because there is less to report but because we are very busy preparing for the next Tux3 report, expected to be long and hopefully interesting.

The first annual Tux3 BOF at FAST was held month in San Jose. Here is the group photo:

      http://phunq.net/files/tux3.bof.fast.2013.jpg

Some of the usual suspects may be easily recognized.[1] Hirofumi attended remotely from Tokyo via an audio link kindly provided by Shapor (on the left).

Though no notes were kept, the gist of the meeting can be summed up easily:

   1) Purpose: What niche does Tux3 expect to fill?
 
   2) Progress: What works now and what remains to be done?
 
   3) Time frame: How many person-years of development work remain?
 
Purpose: With its compact footprint and even more compact feature set, we agree that the natural niches for Tux3 are personal workstations and portable devices. While Tux3 is designed to scale to enterprise levels, such scalability has not yet been proven, some features required for enterprise deployment are not yet available, and a record of reliability has not yet been established. Therefore, the enterprise segment would be better served for the time being by such established projects as Ext4. This is not a serious issue given that the personal workstation and portable device segment already amounts to some hundreds of millions of potential users.

Progress: Tux3 has come far enough that there now can be no turning back. According to our initial measurements, we expect to raise the bar in terms of performance while at the same time introducing a stronger filesystem consistency model. The list of outstanding work items to bring Tux3 to a state suitable for testing by early adopters is quite short.

Timeframe: The question was raised of how many man years of development would be needed to bring Tux3 to a usable state. I suggested eight, which was immediately questioned on the basis that Ext4 and its predecessors easily account for several hundred man years of development. My answer was accordingly clarified to be the number of years that our small team can reasonably afford to dedicate to the project in order to bring it to the point where further manpower accretion tends to become exponential. This hockey stick effect is a necessary event in the lifetime of any filesystem project that will actually be deployed. It was generally thought that merging the code base would help bring that moment closer.

Beer: The beer, provided by tux3.org, was excellent. Several instances may be observed actively deployed in the photo. Next year, we hope and expect that our beer will be kindly provided for us by some wealthier organization.

So long for now. Next week I will report in some detail on our recent efforts to develop a next-generation directory index for Tux3.

[1] The gentleman operating the camera itself did not wish to be recognized at this time.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 <at> phunq.net
http://phunq.net/mailman/listinfo/tux3
Daniel Phillips | 28 Jan 2013 06:55
Picon

Tux3 Report: Initial fsck has landed

Initial Tux3 fsck has landed

Things are moving right along in Tux3 land. Encouraged by our great initial
benchmarks for in-cache workloads, we are now busy working through our to-do
list to develop Tux3 the rest of the way into a functional filesystem that a
sufficiently brave person could actually mount.

At the top of the to-do list is "fsck". Because really, fsck has to rank as
one of the top features of any filesystem you would actually want to use.
Ext4 rules the world largely on the strength of e2fsck. Not just fsck, but
certainly that is a large part of it. Accordingly, we have set our sights on
creating an e2fsck-quality fsck in due course.

Today, I am happy to be able to say that a first draft of a functional Tux3
fsck has already landed:

    https://github.com/OGAWAHirofumi/tux3/blob/master/user/tux3_fsck.c

Note how short it is. That is because Tux3 fsck uses a "walker" framework
shared by a number of other features. It will soon also use our suite of
metadata format checking methods that were developed years ago (and still
continue to be improved).

The Tux3 walker framework (another great hack by Hirofumi, likewise the
initial fsck) is interesting in that it evolved from tux3graph, Hirofumi's
graphical filesystem structure dumper. And before that, it came from our btree
traversing framework, which came from ddsnap, which came from HTree, which
came from Tux2. Whew. Nearly a 15 year history for that code when you trace
it all out.

Anyway, the walker is really sweet. You give it a few specialized methods and
poof, you have an fsck. So far, we just check physical referential integrity:
each block is either free or is referenced by exactly one pointer in the
filesystem tree, possibly as part of a data extent. This check is done with
the help of a "shadow bitmap". As we walk the tree, we mark off all referenced
blocks in the shadow bitmap, complaining if already marked. At the end of
that, the shadow file should be identical to the allocation bitmap inode. And
more often than not, it is.

Cases where we actually get differences are now mostly during hacking, though
of course we do need to be checking a lot more volumes under different loads
to have a lot of confidence about that. As a development tool, even this very
simple fsck is a wonderful thing.

Tux3 fsck is certainly not going to stay simple. Here is roughly where we are
going with it next:

    http://phunq.net/pipermail/tux3/2013-January/001976.html
    "Fsck Revisited"

To recap, next on the list is checking referential integrity of the directory
namespace, a somewhat more involved problem than physical structure, but not
really hard. After that, the main difference between this and a real fsck
will be repair. Which is a big topic, but it is already underway. First simple
repairs, then tricky repairs.

Compared to Ext2/3/4, Tux3 has a big disadvantage in terms of fsck: it does
not confine inode table blocks to fixed regions of the volume. Tux3 may store
any metadata block anywhere, and tends to stir things around to new locations
during normal operation. To overcome this disadvantage, we have the concept of
uptags:

    http://phunq.net/pipermail/tux3/2013-January/001973.html
    "What are uptags?"

With uptags we should be able to fall back to a full scan of a damaged volume
and get a pretty good idea of which blocks are actually lost metadata blocks,
and to which filesystem objects they might belong.

Free form metadata has another disadvantage: we can't just slurp it up from
disk in huge, efficient reads. Instead we tend to mix inode table blocks,
directory entry blocks, data blocks and index blocks all together in one big
soup so that related blocks live close together. This is supposed to be great
for read performance on spinning media, and should also help control write
multiplication on solid state devices, but it is most probably going to suck
for fsck performance on spinning disk, due to seeking.

So what are we going to do about that? Well, first we want to verify that
there is actually an issue, as proved by slow fsck. We already suspect that
there is, but some of the layout optimization work we have underway might go
some distance to fixing it. After optimizing layout, we will probably still
have some work to do to get at least close to e2fsck performance. Maybe we can
come up with some smart cache preload strategy or something like that.

The real problem is, Moore's Law just does not work for spinning disks. Nobody
really wants their disk spinning faster than 72000 rpm, or they don't want to
pay for it. But density goes up as the square of feature size. So media
transfer rate goes up linearly while disk size goes up quadratically. Today,
it takes a couple of hours to read each terabyte of disk. Fsck is normally
faster than that, because it only reads a portion of the disk, but over time,
it breaks in the same way. The bottom line is, full fsck just isn't a viable
thing to do on your system as a standard, periodic procedure. There is really
not a lot of choice but to move on to incremental and online fsck.

It is quite possible that Tux3 will get to incremental and online fsck before
Ext4 does. (There you go, Ted, that is a challenge.) There is no question that
this is something that every viable, modern filesystem must do, and no,
scrubbing does not cut the mustard. We need to be able to detect errors on the
filesystem, perhaps due to blocks going bad, or heaven forbid, bugs, then
report them to the user and *fix* them on command without taking the volume
offline. If that seems hard, it is. But it simply has to be done.

So that is the Tux3 Report for today. As usual, the welcome mat is out for
developers at oftc.net #tux3. Or hop on over and join our mailing list:

    http://phunq.net/cgi-bin/mailman/listinfo/tux3

We are open to donations of various kinds, particularly of your own awesome
developer power. We have an increasing need for testers. Expect to see a
nice simple recipe for KVM testing soon. Developing kernel code in userspace
is a normal thing in the Tux3 world. It's great. If you haven't tried it yet,
you should.

Thank you for reading, and see you on #tux3.

Regards,

Daniel
Daniel Phillips | 27 Jan 2013 09:21
Picon

Fsck Revisited

The story so far

Our prototype fsck uses a filesystem tree walking framework that was actually
actually evolved from Hirofumi's "tux3graph" graphical filesystem structure
generator. At this point, the most interesting thing our shiny new fsck does
is to verify bitmap allocation integrity:

  Verify unique block references and consistent allocation maps

   * Each block marked as used is referenced exactly once by the filesystem
     tree (possibly as part of an extent) and each block not referenced is
     marked free.

The algorithm for this is simple. Given a "shadow" bitmap file the same size
as the allocation bitmap file, initialized to zero, walk the entire
filesystem tree marking each referenced block as used in the shadow bitmap.
Then compare the allocation and shadow bitmap files: they should be equal. (A
few global blocks must be specially marked.)

This algorithm is easily extended to repair inconsistent bitmaps. Simply copy
any blocks that differ from the shadow to the allocation bitmap. This can be
done in cache, then the above check may be repeated to verify success, prior
to committing repaired bitmap blocks to disk. Note: the log may need to be
adjusted to remove pending bitmap updates. Alternatively, log entries may be
added rather than flushing bitmap blocks, creating a state consistent with a
future rollup.

This is actually a respectable part of what we need in a finished fsck. The
largest remaining issues involve directories:

  Verify directory entry referential integrity

   * Each directory entry must reference a valid inode and each valid inode
     must either be referenced by at least one directory entry or be in the
     orphan table or have an orphan entry in the log.

   * The number of directory entries referencing a given inode must be
     consistent with the inode link count attribute.

   * The type of each directory must be consistent with the type attribute
     of the referenced inode.

Algorithm

For now we can do this with a simple table, the "shadow inode number" table,
with one byte per inode number, initially zero. It is impractical to map the
entire 48 bit inode number range in this way, so we will also need to enforce
a limit on the range of allocated inode numbers. For the time being, that can
be the smallest binary power greater than the volume size in blocks, which I
call the volume "wrap size". (Volume wrap size will be important in other
contexts as well, volume resizing in particular.)

Each inode shadow map byte has two fields: a type field (3 bits) and a count
field (5 bits). The count field is too small to represent the maximum
possible link count, which in theory is the same as the number of inodes. For
practical purposes, we will therefore limit link count to less than 2**32.
Large link counts are stored in a "large" shadow inode map file, 32 bits per
inode. Large link counts will be rare, so the shadow inode map will be sparse,
and (except for pathological cases) small in terms of total blocks, even
though each entry in it might force an entire block to be allocated.

The algorithm has two passes. Pass one walks all directories in inode table
order. For each directory entry, the inode shadow map byte corresponding to
the directory entry inode number is updated. If the map byte is zero, the
inode number has not been seen before, so the map byte is set to a count of
one and the map byte type field is set to the type of the directory entry.
Otherwise, the map byte type field is checked to ensure that it corresponds to
the directory entry type, and the count field is incremented if less than its
maximum value. If the count field is at its maximum value, then the
corresponding field in the large shadow inum count map is incremented.

Pass two walks all inodes in the inode table, checking that the inode type
attribute matches the inode shadow map type and that the inode link count
attribute matches the shadow map count. If the shadow map count is at its
maximum value then the sum of the shadow map count and the large shadow map
count is used.

Pass one of this algorithm can be combined with the existing fsck inode table
walk. Pass two must be a second inode table walk. It is possible that we will
only ever need two complete inode table walks, unless errors are detected and
we decide to attempt repair.

If this algorithm completes without detecting errors then we have proved an
important component of referential integrity at the namespace level. However
there are other checks on directories we might wish to perform. We should
check that the name of each directory entry is unique. (Versioning will
increase the complexity of this test.) If the directory is indexed, we should
verify referential integrity of the index.

Directory Connectivity

We need to check that directory parent links form a completely connected,
acylic (inverted) tree and that all directories belong to this tree. For this,
we may wish to build the complete tree in memory in pass one. After pass two,
we walk the complete "shadow" tree starting from the root inode, and set each
corresponding shadow inode map byte to zero for each tree node. If some shadow
inode map byte is already zero, we have detected a cycle. After walking the
shadow tree, we check that the shadow inode number map is completely zero. If
not, then some directory subtrees are disconnected.

Repair

Here are a few ideas for types of repair we can do, ranging from practical and
immediately doable to rampant flights of fancy.

Perhaps the most straightforward repair is allocation bitmaps. There is not
much left to the imagination: after we have discovered all properly allocated
blocks and all plausible lost subtrees, each block is either referenced from
the filesystem tree or it is not. We may simply copy any blocks that differ
from the shadow bitmap to the allocation bitmap.

If the type given in some dirent does not match the type of the referenced
inode then we must guess which one is wrong (and they both may be). Examining
the contents of the file data tree may help: if it consists entirely of
properly tagged directory entry blocks then the inode most probably is a
directory.

If we have many lost inodes then we could search the entire volume for lost
blocks tagged as directory entry blocks. These may lead us back to a missing
directory file, or perhaps we can reconstruct a directory in its entirety from
lost directory entry blocks.

More repair ideas coming later. Thanks for reading.

Regards,

Daniel
Daniel Phillips | 21 Jan 2013 23:24
Picon

Allocation Strategy

Allocation Strategy

Here, I have jotted down some obvious facts about allocation strategy and
do not go too deeply into what we will actually implement for Tux3. Partly
because our thinking is still evolving, and partly because I want to keep this
note short. And admittedly, it ended up far from short in spite of my
considerable efforts to contain it. So please bear with me, and whoever
makes it all the way to the end, please let me know.

Goals:

  * Favor algorithms that optimize for spinning media in a way that
    also helps solid state media or at least does not harm it.[1]

  * Perfect is the enemy of good enough. We want to improve our initial
    layout and aging behavior to the point where Tux3 moves to the middle
    of the pack, but not spend a lot of time ust now attempting to be the
    very best. However, in principle there is no reason[2] why Tux3 cannot
    do layout at least as well as any existing filesystem,

Ideal Layout

Let us define the goodness of a given disk layout in terms of how many seeks
are needed to read all the files of some directory tree. For this purpose, a
seek is defined as any gap at all between the last block of a region that we
read into cache synchronously, and the first block of the next.

This definition of seeking does not quite match reality because a modern disk
will cache read data in units of entire tracks, so that a small gap between
read targets introduces only a small probability of head movement, increasing
as the gap becomes larger. This fact will help us in algorithm design, however
for the moment we will content ourselves with this theoretical definition.

The ideal layout is one where each block read into cache by a recursive
traversal of all the data in a directory tree immediately follows the
previously read block. Assuming that the top level directory inode is already
cached, we proceed to read the first block of the directory. This requires
reading some btree metadata blocks, typically one tree node and one data
extent leaf (dleaf) in this case, followed by reading the directory data
block itself. So the ideal layout places these first three blocks in the
order: node, leaf, data.

The directory block contains entries that reference inodes. For the ideal
layout, we require that the storage order of directory entries corresponds
exactly to the referenced inode numbers. Now we need to read the inode of the
first directory entry in the block. That requires reading an inode table
block (ileaf) into cache, which in turn typically requires reading one more
btree node. So the next two blocks in our ideal layout should be a inode table
index block and the inode table block that contains the directory entry's
inode. The ileaf should be followed immediately by a data tree index node and
a dleaf, then all the data blocks of the file. If the file has so many data
extents that more dleaf blocks are needed, these will be interleaved with file
data extents so that each is encountered during recursive traversal exactly
when needed.

Continuing the traversal, more directory entry blocks and inode table blocks
are needed from time to time, and as above, the ideal layout places these and
their associated index blocks in position to be read just when needed, with no
seeks.

Aging

It would be wonderful if we could always lay out data on disk in exactly this
ideal pattern, however practical issues get in the way. We are allowed to go
back and rewrite file data at any point, and even if the size is unchanged, we
will not be able to write the new data at the original location because that
would violate Tux3's atomic commit guarantee: the new data must be durably
recorded on media before any old data may be overwritten. So file rewrites
always move data blocks to new locations, and usually move a few metadata
blocks as well. So much for our ideal layout aspirations.

Deletes also leave gaps in our ideal layout, consisting of free data blocks,
freed directory records, and gaps in the inode table, in addition to
relocating a few metadata blocks. This is called "aging".

Our challenge is to define algorithms that tend to maintain a layout similar
in some sense to our ideal layout, as a volume ages. I will now briefly
describe some of the tools I have identified to help us implement some
initial approximations of production quality layout algorithms.

Allocation Goal

Our primary tool is the "allocation goal", that is, the region of the volume
we will search for free space to store some metadata block or data extent. In
many cases, the best place to search is immediately after the last place we
found. In fact, the ideal layout above can be produced by simple sequential
allocation, provided that each free space search is performed in the order the
blocks or extents will be needed during read traversal.

A nice line of attack is to arrange the steps of our disk update algorithms
appropriately, so that a simple, sequentially allocation goal produces the
desired ideal layout, or an approximation of it diluted mainly by skipping
over previously allocated blocks. To support this, we maintain a "goal"
variable, which points to the next block after the last one allocated.

Sequential Allocation

Pure sequential allocation is generally satisfactory only for initial writes
into an empty region. For rewrites, we must establish a new allocation goal or
we will surely suffer from undesirable behavior where rewritten data migrates
upward endlessly, leaving any former neighbors far away. This suggests a
seemingly workable general strategy: allocate sequentially unless we detect a
good reason to jump the allocation goal to some new position. Hopefully, we
will allocate sequentially most of the time, which would be both efficient and
nearly ideal in many cases.

Three Way Interleaving

In an ideal layout, three different kinds of filesystem object are woven
together: directory blocks, inode table blocks and data blocks, each with
their associated index metadata. It is therefore desirable that inode
allocation order, data block allocation order, and directory entry order
all correspond well, and that associated blocks all lie close to each other.
This is just another way of saying that layout should be close to ideal,
but expressed in terms of the main objects involved, each of which tends to
be handled by its own constellation of code that we now must tie together
in some reasonable way.

The main tool I propose for tying inode allocation order to block allocation
order is simply to assume some fixed, linear distribution of inodes
throughout a volume. For now, we assume one inode per block, Under real world
loads, the actual inode density could easily be less than that, when files
are large. If we implement some optimization to store small files entirely in
the inode table, we could conceivably see more than one inode per volume
block. We do not need to worry much about the former case, because it is easy
to leave gaps in the inode table to limit the inode population. In the
latter case, we might simply allocate some inode numbers away from their ideal
positions, or we can use a technique I call "folding", which I will describe
in a future post.

Inode number corresponds to block allocation goal

In brief, I propose that layout generally be driven by the inode number of
the containing directory, which in turn corresponds to the allocation goal of
the first directory block of the parent directory (and associated index
blocks). The inode number goal of each regular file in the directory is
chosen by linear interpolation based on the offset of the file from the
beginning of the directory. The block allocation goal for each regular file
is just its inode number. In this way, three kinds of filesystem object are
tied together in a rough correspondence.

Bouncing

The accuracy of this correspondence depends on several crude assumptions, the
most significant being that the inode number of the directory is chosen
appropriately. This must necessarily be a guess, which will sometimes be
very wrong, perhaps placing a large directory in a region with insufficient
space. To deal with this and similar issues, we introduce the notion
of allocation "bounces". If our first choice based on the above relationships
fails due to lack of space, we bounce the search to some new place according
to a repeatable formula, so that neighbors that also bounce will tend to
bounce to nearby places.

Bouncing can continue over potentially many iterations if each successive
bounce happens to land in a densely populated region. In the limit, the bounce
algorithm must cover the entire volume, so that a volume may be filled
entirely, at the cost of increased fragmentation as the probability of
bouncing increases. As we bounce, we may decide to satisfy an extent
allocation in multiple fragments, in preference to allocating all of it
further away. The details of such heuristics are less important than the fact
that bouncing guarantees to satisfy any allocation request, provided
that sufficient free space remains on the volume.

Excessive bouncing may become an issue if the initial choice of directory
inode number is unlucky. In that case, we may override the default goal by
adding a goal attribute to the directory inode, to retarget all allocations
within that directory to some region with sufficient free space. New files
added to the directory will be located at the new goal, while rewrites of
existing files will eventually relocate some of the older files to the updated
goal as well.

We might contemplate implementing automatic migration of existing files in the
event an allocation goal is overridden, however this will need to be done with
care, because the additional IO traffic could well do more harm than good.

Subdirectories

Goal setting by interpolation based on inode number breaks down in the
presence of subdirectories, which may be arbitrarily large. While storing an
entire subdirectory complete with all its file data and its own
subdirectories inline within the parent directory is necessary to obtain an
ideal layout, we do not lose much be relaxing that requirement and "bumping"
each subdirectory to some region outside its parent directory. On recursive
read, this only introduces two new seeks, one to go to the subdirectory and
one to return to the parent.

Considering the cases: the subdirectory may have many files, in which case the
two additional seeks do not cost much in comparison the traversal, or the
subdirectory has few files, in which case the total read load is not likely to
be very heavy. The pathological case, many tiny directories, might not be
worth worrying about, because in the real world the purpose of creating a
directory is usually to store more than one thing there. Another way of
putting it is, we may be willing to sacrifice a few seeks in the name of
simple layout algorithms. Over time, we may find there is something to be
gained by relaxing this bumping behavior and storing entire subdirectories
inline when convenient. For now, I propose the concept of "fully bumped"
directory layout, with details of how to choose appropriate bumping goals left
as an open question.

Interpolation

In order to interpolate allocation goals effectively and efficiently, we need
to make some assumptions about the average size of directory entries and
files. For this, we may just pick some arbitrary plausible values, such as 20
bytes for an average directory entry and 32K for an average file. So, for any
given new directory entry, we estimate its inode number goal by dividing its
offset within the directory by the average directory entry size, and adding
to the inode number of the directory, giving both an inode number and a block
allocation goal, according to the proposed one-to-one correspondence between
inode number and block number.

Such assumptions might prove to be wildly wrong according to actual activity
on the filesystem. We might contemplate a "learning" algorithm, where we
iteratively improve our estimates according to observed behavior, or we
might simply rely on our bouncing algorithm to resolve collisions, hopefully
degrading layout efficiency only slightly.

Inode Number Clustering

At present, Tux3 employs a simple inode table leaf format that does not index
each inode individually, but instead indexes the leaf block as a whole, with
contained inodes laid out linearly from there. The indexing space thus saved
in the leaf is considerable, perhaps 8-10% reduction in total inode leaf size.
However, this introduces a bias in favor of dense inode number packing, in
the absence of which it is possible for each inode table block to end up with
just one inode in it, or about 2% leaf fullness, which would be a problem. To
combat this, we introduce a technique called inode number clustering, where
inode allocation goals falling within a given, fixed size region, are biased
towards the lower end of that region, helping to control leaf fragmentation.

When it comes to filesystem integrity checking, inode number clustering has
important benefits beyond simply supporting a simple leaf layout scheme and
reducing average leaf size. To verify that every inode has a correct link
count, we can walk all the directories, incrementing link counts for all
referenced inodes, then walk the inode table checking that each link count
matches its respective inode link count attribute. A large table is required
to record the counts. If allocated inodes tend to be densely packed with
large intervening gaps then a radix tree would be a very efficient structure
for this. Otherwise, we might need to consider some less efficient structure
such as a fully associative btree, at the cost of considerably larger memory
footprint.

Discussion

With its nondestructive atomic update model, Tux3 faces new and largely
unexplored challenges in terms of optimizing disk layout. Because data and
metadata must migrate to new locations on every rewrite, Tux3 could prove
considerably more sensitive to aging related issues than Ext4, for example.
However, with careful attention to detail, it is likely that such issues can
be controlled to the extent that Tux3 layouts are competitive with Ext4
more often than not. On the positive side, Tux3 is able to place inode table
blocks near their related directory entry and file data blocks, which could
go a long way towards evening the odds.

It is possible that by careful provisioning of free space near allocated
items, the layout degradation cost due to constant repositioning may prove
negligible. (Note that Ext4 also tends to provision a significant amount of
free space, in anticipation of future directory growth.)

If it turns out to matter, Tux3 supports a per file data overwrite mode,
relaxing its normally stringent file data atomicity guarantee. Though so far
we have been unable to detect any performance improvement from that, it is
possible that our test coverage has simply not been sufficient. We may find
later that such a default mode for file data improves aging behavior without
unduly impacting crash consistency. Or ideally, we may even find that aging is
not typically degraded, so that Tux3's stronger than usual integrity
guarantees are gained essentially for free.

Notes

[1] Many layout optimizations for spinning disk either also help solid state
layout or do not harm it. For example, minimizing read seeks also tends to
improve block packing of related data, which is "good" sharing, the opposite
of false sharing. This reduces write multiplication in the SSD flash
translation layer by reducing the number of long lived disk sectors that would
otherwise need to be copied to new locations prior to block erasure.

[2] A potential reason that Tux3 might not be able to achieve the same
performance level as other hypothetical filesystems under certain loads is
that it does not embed inodes in directories. To ameliorate this, Tux3
requires a relatively more complex allocation strategy to interleave
directory and inode table blocks efficiently. In some cases it might be
impossible to avoid extra disk reads, because each name lookup in a directory
entry block must be followed by an inode lookup in an inode table block. In
the case of an isolated, random name lookup with cold cache, an extra read
is clearly required, though perhaps not an extra seek due to disk track
caching. This can fairly be regarded as a rare case, and measurably slower
only if the random read load is massive. Against this, there are certain
advantages to a separate inode table, including a potentially smaller cache
footprint for some fsck steps, and higher performance for pure namespace mass
operations such as find and ls -R, which benefit from a densely packed
namespace with no inode data intrusions.

Regards,

Daniel
Daniel Phillips | 14 Jan 2013 02:45
Picon

What are uptags?

What are uptags, you ask? Uptags are the latest made-in-tux3 idea for
enhancing the recoverability of damaged Tux3 filesystems, in the unlikely
event that that should ever occur (joke).

The purpose of an uptag is to improve the odds in favor of a fsck repair
operation being able to reattach apparently unreferenced metadata to the
correct original filesystem object. An uptag is something like a reverse link,
but is not precisely that. Usually, an uptag will record only the least
significant bits of several interesting values. For example, a dleaf uptag
might record 16 bits of inode number, 16 bits of block offset, 8 bits of delta
number and 8 bits of version number (when we have versions). This rather
extreme value masking is purely to save space: uptags may be valuable, but in
general, metadata space is even more valuable, because uptags only affect you
in the rare event you need to repair a volume while efficient use of
metadata affects everything you do.

Uptags appear in the lowest few bytes of metadata blocks. Each type of
metadata block will have an uptag, except for metadata leaves where regular
binary size is more important (examples: bitmaps; atom blocks; atime table,
all of which are either expendable or recoverable by a full filesystem scan).
The first two bytes of each uptag is its magic number. There is a different 16
bit magic number for each metadata block type. The remaining 6 bytes of uptag
data is devoted to context information we hope will prove useful when
repairing volumes. Other fields are: parent object; object offset; delta
number; version number. As mentioned above, we only record the least
significant bits of each value. So a directory entry block uptag might be:

   magic:16 = UPTAG_DIRENT
   delta:8 = low bits of delta commit number
   version:8 = low bits of block version tag
   owner:16 = low bits of directory_inode_number
   block:16 = low bits of directory logical block

The uptag for a dleaf might be slightly different because the dleaf already
adequately records the logical range it covers, and therefore is relatively
easy to reattach to a file data index after becoming separated somehow. So we
might allocate more bits to the owner field and fewer to the logical block
number. This is also true of ileaf blocks. Interior btree index nodes have
slightly relaxed requirements because, if we know the correct offsets of all
leaf blocks, rebuilding the btree index is easy. Of course, we may not know
all the offsets exactly, and in that case, we may be able to make educated
guesses using additional information from what we think are remnants of the
original index tree.

So that is the basic concept of uptags. We should discuss the details here and
on irc, to convince ourselves that uptags really are likely to be helpful to a
filesystem check repair pass. Then after we are satisified with the general
concept, we should work out the details of each kind of metadata uptag and add
them to our metadata definitions. Then... the exciting part... we can try
crosschecking the uptags using Hirofumi's new fsck prototype. How cool is
that?

Daniel

Gmane