akpm | 1 Apr 2007 07:10

+ revert-retries-in-ext3_prepare_write-violate-ordering-requirements.patch added to -mm tree


The patch titled
     revert "retries in ext3_prepare_write() violate ordering requirements"
has been added to the -mm tree.  Its filename is
     revert-retries-in-ext3_prepare_write-violate-ordering-requirements.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: revert "retries in ext3_prepare_write() violate ordering requirements"
From: Andrew Morton <akpm <at> linux-foundation.org>

Revert e92a4d595b464c4aae64be39ca61a9ffe9c8b278.

Dmitry points out

"When we block_prepare_write() failed while ext3_prepare_write() we jump to
 "failure" label and call ext3_prepare_failure() witch search last mapped bh
 and invoke commit_write untill it.  This is wrong!!  because some bh from
 begining to the last mapped bh may be not uptodate.  As a result we commit to
 disk not uptodate page content witch contains garbage from previous usage."

and

"Unexpected file size increasing."

   Call trace the same as it was in first issue but result is different. 
(Continue reading)

akpm | 1 Apr 2007 07:16

+ revert-retries-in-ext4_prepare_write-violate-ordering-requirements.patch added to -mm tree


The patch titled
     revert "retries in ext4_prepare_write() violate ordering requirements"
has been added to the -mm tree.  Its filename is
     revert-retries-in-ext4_prepare_write-violate-ordering-requirements.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: revert "retries in ext4_prepare_write() violate ordering requirements"
From: Andrew Morton <akpm <at> linux-foundation.org>

Revert b46be05004abb419e303e66e143eed9f8a6e9f3f.  Same reasoning as for ext3.

Cc: Kirill Korotaev <dev <at> openvz.org>
Cc: Ingo Molnar <mingo <at> elte.hu>
Cc: Ken Chen <kenneth.w.chen <at> intel.com>
Cc: Andrey Savochkin <saw <at> sw.ru>
Cc: <linux-ext4 <at> vger.kernel.org>
Cc: Dmitriy Monakhov <dmonakhov <at> openvz.org>
Cc: <stable <at> kernel.org>
Signed-off-by: Andrew Morton <akpm <at> linux-foundation.org>
---

 fs/ext4/inode.c |   85 +++++-----------------------------------------
 1 file changed, 10 insertions(+), 75 deletions(-)

(Continue reading)

akpm | 2 Apr 2007 08:49

[patch 07/12] revert "retries in ext4_prepare_write() violate ordering requirements"

From: Andrew Morton <akpm <at> linux-foundation.org>

Revert b46be05004abb419e303e66e143eed9f8a6e9f3f.  Same reasoning as for ext3.

Cc: Kirill Korotaev <dev <at> openvz.org>
Cc: Ingo Molnar <mingo <at> elte.hu>
Cc: Ken Chen <kenneth.w.chen <at> intel.com>
Cc: Andrey Savochkin <saw <at> sw.ru>
Cc: <linux-ext4 <at> vger.kernel.org>
Cc: Dmitriy Monakhov <dmonakhov <at> openvz.org>
Cc: <stable <at> kernel.org>
Signed-off-by: Andrew Morton <akpm <at> linux-foundation.org>
---

 fs/ext4/inode.c |   85 +++++-----------------------------------------
 1 file changed, 10 insertions(+), 75 deletions(-)

diff -puN fs/ext4/inode.c~revert-retries-in-ext4_prepare_write-violate-ordering-requirements fs/ext4/inode.c
--- a/fs/ext4/inode.c~revert-retries-in-ext4_prepare_write-violate-ordering-requirements
+++ a/fs/ext4/inode.c
 <at>  <at>  -1147,102 +1147,37  <at>  <at>  static int do_journal_get_write_access(h
 	return ext4_journal_get_write_access(handle, bh);
 }

-/*
- * The idea of this helper function is following:
- * if prepare_write has allocated some blocks, but not all of them, the
- * transaction must include the content of the newly allocated blocks.
- * This content is expected to be set to zeroes by block_prepare_write().
- * 2006/10/14  SAW
(Continue reading)

akpm | 2 Apr 2007 08:49

[patch 06/12] revert "retries in ext3_prepare_write() violate ordering requirements"

From: Andrew Morton <akpm <at> linux-foundation.org>

Revert e92a4d595b464c4aae64be39ca61a9ffe9c8b278.

Dmitry points out

"When we block_prepare_write() failed while ext3_prepare_write() we jump to
 "failure" label and call ext3_prepare_failure() witch search last mapped bh
 and invoke commit_write untill it.  This is wrong!!  because some bh from
 begining to the last mapped bh may be not uptodate.  As a result we commit to
 disk not uptodate page content witch contains garbage from previous usage."

and

"Unexpected file size increasing."

   Call trace the same as it was in first issue but result is different. 
   For example we have file with i_size is zero.  we want write two blocks ,
   but fs has only one free block.

   ->ext3_prepare_write(...from == 0, to == 2048)
     retry:
     ->block_prepare_write() == -ENOSPC# we failed but allocated one block here.
     ->ext3_prepare_failure()
       ->commit_write( from == 0, to == 1024) # after this i_size becomes 1024 :)
     if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
        goto retry;

   Finally when all retries will be spended ext3_prepare_failure return
   -ENOSPC, but i_size was increased and later block trimm procedures can't
(Continue reading)

Jan Kara | 2 Apr 2007 14:45
Picon

Re: 64bit inode number and dynamic inode table for ext4

On Thu 29-03-07 10:08:08, Mingming Cao wrote:
> > > To efficiently allocate and deallocate inode structures, we could link
> > > all free/used inode structures within the block group and store the
> > > first free/used inode number in the block group descriptor. 
> >   So you aren't expecting to shrink space allocated to inode, are you?
> > 
> In theory we could shrink space allocated to inodes, but I am not sure
> if this worth the effort.
  Yes, I agree...

> > > There are some safety concern with dynamic inode table allocation in the
> > > case of block group corruption.  This could be addressed by checksuming
> > > the block group descriptor.
> >   But will it help you in finding those lost inodes once the descriptor
> > is corrupted? I guess it would make more sence to checksum each inode
> > separately. Then in case of corruption you could at least search through
> > all blocks (I know this is desperate ;) and find inodes by verifying
> > whether the inode checksum is correct. If it is, you have found an inode
> > block with a high probability (especially if a checksum for most of the
> > inodes in the block is correct). Another option would be to compute some
> > more robust checksum for the whole inode block...
> > 
> 
> If the block group descriptor is corrupted (with checksuming), we could
> locate majority of inodes in the group by scanning the directory
> entries, the inode number directly point to the inode table blocks.
  Yes, you're right that works in case inode numbers are easily translated
into physical location on disk.

> Yeah, adding checksum for the whole inode block(s) is what I am
(Continue reading)

Jean-Pierre Dion | 2 Apr 2007 16:55
Picon

Re: Ext4 benchmarks

Hi Ric,

that may be useful. Checking with the team here and tell you.

Thanks.

jean-pierre

Ric Wheeler wrote:
>
> Jean-Pierre Dion wrote:
>> Hi Jose,
>>
>> thank you for the feedback.
>>
>> We took your remarks into account and we are doing some perfs with
>> iozone (close to desktop activity, mono-thread) and ffsb (allows to 
>> run benchs
>> in a multi-thread activity like a server does, different blocks 
>> sizes...).
>>
>> We compare ext3 and ext4 (with extents, w/ and w/o del alloc...)...
>>
>> We will publish the results on bullopensource.org
>>
>>
>> jean-pierre
>>
> I also have a benchmark that stresses a heavy synchronous write 
> workload that I can run.
(Continue reading)

Jean-Pierre Dion | 2 Apr 2007 16:56
Picon

Re: Ext4 benchmarks

Hi Jose,

I have to check with the team here.
Will tell you.

Thanks.

jean-pierre

Jose R. Santos wrote:
> Jean-Pierre Dion wrote:
>> Hi Jose,
>>
>> thank you for the feedback.
>>
>> We took your remarks into account and we are doing some perfs with
>> iozone (close to desktop activity, mono-thread) and ffsb (allows to 
>> run benchs
>> in a multi-thread activity like a server does, different blocks 
>> sizes...).
>>
>> We compare ext3 and ext4 (with extents, w/ and w/o del alloc...)...
>>
>> We will publish the results on bullopensource.org
>>   
>
> Hi Jean-Pierre,
>
> While it may be to late for the purposes of your OLS paper, one thing 
> that doesn't seem to be getting much attention is the performance of a 
(Continue reading)

Johann Lombardi | 2 Apr 2007 19:47
Picon

[PATCH] jbd2 stats through procfs

Hi all,

The patch below updates the jbd stats patch to 2.6.20/jbd2.
The initial patch was posted by Alex Tomas in December 2005
(http://marc.info/?l=linux-ext4&m=113538565128617&w=2).
It provides statistics via procfs such as transaction lifetime and size.

Johann

Signed-off-by: Johann Lombardi <johann.lombardi <at> bull.net>
--

Index: linux-2.6.20.4/include/linux/jbd2.h
===================================================================
--- linux-2.6.20.4.orig/include/linux/jbd2.h	2007-04-02 13:34:44.000000000 +0200
+++ linux-2.6.20.4/include/linux/jbd2.h	2007-04-02 18:33:17.000000000 +0200
 <at>  <at>  -408,6 +408,16  <at>  <at>  struct handle_s
 };

 
+/*
+ * Some stats for checkpoint phase
+ */
+struct transaction_chp_stats_s {
+	unsigned long		cs_chp_time;
+	unsigned long		cs_forced_to_close;
+	unsigned long		cs_written;
+	unsigned long		cs_dropped;
+};
+
(Continue reading)

Andreas Dilger | 2 Apr 2007 23:37

[PATCH] uninitialized groups kernel patch

This patch (against 2.6.16.27-0.9 = latest SLES10 kernel) adds
support for the uninit_groups feature to the kernel.  This speeds up
e2fsck on most filesystems by a factor of 2-10x in our tests.  It might
speed up the kernel by a tiny fraction because we don't have to read
uninitialized bitmaps from disk, but not a huge amount.  We already skip
reading inode table blocks from disk if they are unused, since 2.6.12 or
so (and that was a major speedup for file creation).

A separate patch will be sent for the e2fsprogs part of the work.

We keep a high water mark of used inodes for each group to improve e2fsck
time.  Block and inode bitmaps can be uninitialized on disk via a flag in
the group descriptor to avoid reading or scanning them at e2fsck time.

A checksum of each group descriptor is used to ensure that corruption in
the group descriptor's bit flags does not cause incorrect operation.  If the
checksum is incorrect, then either we refuse to mount (e2fsck will fix this
at the beginning of pass1), or we mark the group as unusable (0 free inodes
and 0 free blocks) so the kernel won't try to allocate from there (it may
be that the UNINIT flags were "cleared" by corruption and we would further
corrupt the filesystem by using invalid bitmaps).

To enable on a new filesystem:
- mke2fs -O uninit_groups /dev/XXX

To enable on a new filesystem and not zero out the inode table (faster, and
good for loopback filesystems, not so great for real world usage until TODO
is complete):
- mke2fs -O lazy_bg,uninit_groups /dev/XXX

(Continue reading)

Mingming Cao | 2 Apr 2007 23:47
Picon
Favicon

Re: [PATCH] Add i_version_hi for 64-bit version

On Sun, 2007-03-18 at 15:43 +0530, Kalpak Shah wrote:
> Hi,
> 
> This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used for 64-bit inode versions.
This field will store the higher 32 bits of the version, while Jean Noel's patch has added support to store
the lower 32-bits in osd1.linux1.l_i_version.
> 
> Signed-off-by: Andreas Dilger <adilger <at> clusterfs.com>
> Signed-off-by: Kalpak Shah <kalpak <at> clusterfs.com>
> 
> Index: linux-2.6.20/include/linux/ext4_fs.h
> ===================================================================
> --- linux-2.6.20.orig/include/linux/ext4_fs.h
> +++ linux-2.6.20/include/linux/ext4_fs.h
>  <at>  <at>  -336,6 +336,7  <at>  <at>  struct ext4_inode {
>         __le32  i_atime_extra;  /* extra Access time      (nsec << 2 | epoch) */
>         __le32  i_crtime;       /* File Creation time */
>         __le32  i_crtime_extra; /* extra File Creation time (nsec << 2 | epoch) */
> +       __u32   i_version_hi;   /* high 32 bits for 64-bit version */
>  };
> 

Any reason not using __le32 for the i_version_hi?

Thanks,
Mingming


Gmane