Alex Tomas | 1 Dec 2006 01:15

[RFC] delayed allocation, mballoc, etc


Good day,

I'd like to ask the community to discuss and review few things
I've been working on. we propose set of patches with intention
to improve performance of ext4:

 * locality groups

   to achieve good performance writing many small files
   we need to allocate them closely each to other. the
   simplest way could be to allocate all small files using
   next block after the previous small file. and this would
   work well for a single-job case. for multi-job case (few
   untar's, for example) this would break job locality and
   cause performance penaly in subsequent access. locality
   groups idea may help here: let's group all files by some
   property. pgid, for example. now, every time the kernel
   ask filesystem to flush dirty pages, we flush inodes from
   1st group, then from 2nd and go on. this one we can form
   large contiguous allocations (for a whole group) achieving
   good throughput and preserve quite good locality.

 * scalable block reservation

   this is required to protect from -ENOSPC when pages enter
   pagecache w/o space allocation (delayed allocation). it
   also should scale well on high-end SMP as every cpu has
   one "pool" of block. when pool is empty, the filesystem
   rebalance free blocks between all cpus
(Continue reading)

Alex Tomas | 1 Dec 2006 01:16

[RFC] delayed allocation testing on node zefir


Node zefir: 2 x PIII-933MHz, 896MB RAM, SCSI HDD
================================================

Test description:

    * dd512w - write 512MB
    * dd512r - read 512MB
    * dd512rw - rewrite 512MB
    * dd1024w - write 1024MB
    * dd1024r - read 1024MB
    * dd1024rw - rewrite 1024MB
    * dd2048w - write 2048MB
    * dd2048r - read 2048MB
    * dd2048rw - rewrite 2048MB
    * kernuntar - untar kernel
    * kernstat - stat kernel tree
    * kerncat - cat kernel tree to /dev/null
    * kernrm - remove kernel tree
    * kernuntar2 - 2 concurrent kernuntar
    * kernstat2 - 2 concurrent kernstat
    * kerncat2 - 2 concurrent kernstat
    * kernrm2 - 2 concurrent kernrm
    * dbench4 - dbench with 4 threads
    * dbench8 - dbench with 8 threads
    * dbench16 - dbench with 16 threads
    * dbench32 - dbench with 32 threads 

Legenda:

(Continue reading)

akpm | 1 Dec 2006 08:05

+ ext3-4-dont-do-orphan-processing-on-readonly-devices.patch added to -mm tree


The patch titled
     ext3/4: don't do orphan processing on readonly devices
has been added to the -mm tree.  Its filename is
     ext3-4-dont-do-orphan-processing-on-readonly-devices.patch

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: ext3/4: don't do orphan processing on readonly devices
From: Eric Sandeen <sandeen <at> redhat.com>

If you do something like:

# touch foo
# tail -f foo &
# rm foo
# <take snapshot>
# <mount snapshot>

you'll panic, because ext3/4 tries to do orphan list processing on the
readonly snapshot device, and:

kernel: journal commit I/O error
kernel: Assertion failure in journal_flush_Rsmp_e2f189ce() at journal.c:1356: "!journal->j_checkpoint_transactions"
kernel: Kernel panic: Fatal exception

for a truly readonly underlying device, it's reasonable and necessary
to just skip orphan list processing.
(Continue reading)

sho | 1 Dec 2006 11:29
Picon
Picon

Re:[RFC][PATCH 0/3] Extent base online defrag

Hi,

Thank you for your further testing.

>From the test the file seems to be pretty fragmented, still the defrag
>utility gives ENOSPC error.

I found the problem of the workaround code related to journal in my
patches.  2048 blocks was always passed to ext3_journal_start so that
the journal blocks in transaction didn't exceed j_max_transaction_buffers.
However, j_max_transaction_buffers could be adjusted to the number of
filesystem blocks.  Maybe your filesystem's j_max_transaction_buffers
would be less than 2048, so the error would occur.

When the above problem occurs, the following message will be output
in syslog.  Could you check your syslog?

"JBD: e4defrag wants too many credits (2048 > XXXX)"

I am working for fixing the workaround code, but the work isn't
finished yet.  I will update my patches next week.
If you need to run my defrag soon, you can use the following
patch to avoid the above problem as the provisional solution.

After Alex's patches and my previous patches are applied,
you can apply the following patch.
---
diff -uprN -X linux-2.6.16.8-tnes-org/Documentation/dontdiff
linux-2.6.16.8-tnes-org/fs/ext3/extents.c linux-2.6.16.8-work/fs/ext3/extents.c
--- linux-2.6.16.8-tnes-org/fs/ext3/extents.c	2006-12-01 10:37:28.000000000 +0900
(Continue reading)

Andreas Dilger | 1 Dec 2006 13:06

Re: [RFC][PATCH 0/4] BIG_BG: support of large block groups

On Nov 30, 2006  14:41 -0500, Theodore Tso wrote:
> * We ignore the problem, and accept that there are some kinds of
> filesystem corruptions which e2fsck will not be able to fix --- or at
> least not without adding complexity which would allow it to relocate
> data blocks in order to make a contiguous range of blocks to be used
> for the allocation bitmaps.  
> 
> The last alternative sounds horrible, but if we assume that some other
> layer (i.e., the hard drive's bad block replacement pool) provides us
> the illusion of a flawless storage media, and CRC to protect metadata
> will prevent us from relying on an corrupted bitmap block, maybe it is
> acceptable that e2fsck may not be able to fix certain types of
> filesystem corruption.

I'd agree that even with media errors, the bad-block replacement pool
is almost certainly available to handle this case.  Even if there are
media errors on the read of the bitmap, they will generally go away
if the bitmap is rewritten (because of relocation).  At worst, we
would no longer allow new blocks/inodes to be allocated that are
tracked by that block, and if we are past 256TB then the sacrifice
of 128MB of space is not fatal.  It wouldn't even have to impact any
files that are already allocated in that space.

> without any of these protections, I'd want to keep the block group
> size under 32k so we can avoid dealing with these issues for as long
> as possible.  Even if we assume laptop drives will double in size
> every 12 months, we still have a good 10+ years before we're in danger
> of seeing a 512TB laptop drives.  :-)

Agreed, I think there isn't any reason to increase the group size
(Continue reading)

Alex Tomas | 2 Dec 2006 12:02

Re: Design alternatives for fragments/file tail support in ext4


>>>>> Theodore Ts'o (TT) writes:

 TT> Storing the tail as an extended attribute
 TT> =========================================
 TT> BSD FFS/UFS Fragments
 TT> =====================

what if we would apply BSD FFS/UFS Fragments design to inodes?
something like basic inode size is still 128 bytes, but we
could unite few contiguous inodes to a larger one and use that
space to store extented attributes, including a tail ?

I foresee couple issues:

a) as we can't predict tail size, we'd need to be able
   to reallocate inode or do some sort of delayed inode
   allocation

b) inode space is quite limited and this why we may exhaust
   it very quickly

thanks, Alex

  
Johann Lombardi | 4 Dec 2006 18:21
Picon

[PATCH] ext4: fix credit calculation in ext4_ext_calc_credits_for_insert

Hello,

The patch below fixes a nit in ext4_ext_calc_credits_for_insert().
Besides, credits for the new root are already added in the index split
accounting.

Johann

Signed-off-by: Johann Lombardi <johann.lombardi <at> bull.net>
Signed-off-by: Alex Tomas <alex <at> clusterfs.com>
---

Index: linux-2.6.19/fs/ext4/extents.c
===================================================================
--- linux-2.6.19.orig/fs/ext4/extents.c	2006-12-04 16:59:41.760257656 +0100
+++ linux-2.6.19/fs/ext4/extents.c	2006-12-04 17:04:35.566592304 +0100
 <at>  <at>  -1534,16 +1534,17  <at>  <at>  int inline ext4_ext_calc_credits_for_ins

 	/*
 	 * tree can be full, so it would need to grow in depth:
-	 * allocation + old root + new root
+	 * we need one credit to modify old root, credits for
+	 * new root will be added in split accounting
 	 */
-	needed += 2 + 1 + 1;
+	needed += 1;

 	/*
 	 * Index split can happen, we would need:
 	 *    allocate intermediate indexes (bitmap + group)
(Continue reading)

Erik Mouw | 4 Dec 2006 19:02

[PATCH] Update ext[23] mailing list address

The ext[23] mailing list moved from sf.net to vger.kernel.org so update
the MAINTAINERS file.

Signed-off-by: Erik Mouw <erik <at> harddisk-recovery.com>

diff --git a/MAINTAINERS b/MAINTAINERS
index 8385a69..e21ba04 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
 <at>  <at>  -1091,13 +1091,13  <at>  <at>  M:	miku <at> iki.fi
 S:	Maintained

 EXT2 FILE SYSTEM
-L:	ext2-devel <at> lists.sourceforge.net
+L:	linux-ext4 <at> vger.kernel.org
 S:	Maintained

 EXT3 FILE SYSTEM
 P:	Stephen Tweedie, Andrew Morton
 M:	sct <at> redhat.com, akpm <at> osdl.org, adilger <at> clusterfs.com
-L:	ext2-devel <at> lists.sourceforge.net
+L:	linux-ext4 <at> vger.kernel.org
 S:	Maintained

 F71805F HARDWARE MONITORING DRIVER
 <at>  <at>  -1673,7 +1673,7  <at>  <at>  S:	Supported
 JOURNALLING LAYER FOR BLOCK DEVICES (JBD)
 P:	Stephen Tweedie, Andrew Morton
 M:	sct <at> redhat.com, akpm <at> osdl.org
-L:	ext2-devel <at> lists.sourceforge.net
(Continue reading)

Nikolai Joukov | 4 Dec 2006 19:33
Picon
Favicon

[RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

As we promised on the linux-ext4 list on October 31, here is the patch
that adds secure deletion via a trash-bin functionality for ext4.  It is a
compromise solution that combines secure deletion with the trash-bin support
(the latter had been requested by even more people than the former :-).

The patch moves the files marked with the -s (secure deletion) or the -u
(undelete) ext2/3/4 attributes to a .trash/<uid>/ directory.  It does so
reliably because it does it in the kernel upon every unlink operation.
User-mode tools can miss some unlinks, which is unacceptable for secure
deletion.  The per-user subdirectories allow us to solve the problem
with permissions.  The .trash directory is owned by root and has
drwxr-xr-x permissions.  The per-uid subdirectories are owned and only
accessible by their corresponding owners (drwx------) so users cannot
read each other's files in the trash.  Alternative solutions would
require changing a lot of ext3 code.

Right now we do not store the whole path to the file moved into the
trash-bin.  We keep the filenames or append them with numbers in case of
collisions.  In the future we may change it to store the whole path
information if necessary.

Later (e.g., when the system is idle) a user-mode daemon can just scan
.trash's subdirectories, overwrite the files marked with -s, and unlink
the files.  The purging and data overwriting is performed entirely in user
space.  It may or may not be necessary to add a mechanism for the file
system to initiate the user-mode deletion once the space becomes scarce.

The benefits of this secure deletion approach are obvious:
1) small kernel patch;
2) two solutions (trash-bin and secure deletion) in one;
(Continue reading)

Erik Mouw | 4 Dec 2006 21:17
Picon

Re: [PATCH] Update ext[23] mailing list address

(This patch applies on top of my previous patch to this list.)

Now ext[23] have correct mailing list information, add information
about the ext4 filesystem to the MAINTAINERS file.

Signed-off-by: Erik Mouw <erik <at> harddisk-recovery.com>

diff --git a/MAINTAINERS b/MAINTAINERS
index e21ba04..7ec9a0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
 <at>  <at>  -1100,6 +1100,12  <at>  <at>  M:	sct <at> redhat.com, akpm <at> osdl.org, adilge
 L:	linux-ext4 <at> vger.kernel.org
 S:	Maintained

+EXT4 FILE SYSTEM
+P:	Stephen Tweedie, Andrew Morton
+M:	sct <at> redhat.com, akpm <at> osdl.org, adilger <at> clusterfs.com
+L:	linux-ext4 <at> vger.kernel.org
+S:	Maintained
+
 F71805F HARDWARE MONITORING DRIVER
 P:	Jean Delvare
 M:	khali <at> linux-fr.org

Gmane