Jan Kara | 1 Dec 2009 15:35
Picon

Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > 
> > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in
864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of
a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > 
> > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file
doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it
has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > 
> > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test
case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file
corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to
figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > 
> > > This test case is distilled from an actual application which doesn't even intentionally use writev: it
just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev
under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken
_so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental
core system call....
> > > 
> > > My /tmp is an ext3 filesystem, in case that matters.
> > 
> > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is
run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can
determine, on ext3.
> 
> I bisected it this morning.  Bisected cleanly to...
(Continue reading)

Jan Kara | 1 Dec 2009 17:03
Picon

Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

On Tue 01-12-09 15:35:59, Jan Kara wrote:
> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > > 
> > > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in
864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of
a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > > 
> > > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file
doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it
has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > > 
> > > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test
case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file
corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to
figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > > 
> > > > This test case is distilled from an actual application which doesn't even intentionally use writev:
it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev
under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken
_so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental
core system call....
> > > > 
> > > > My /tmp is an ext3 filesystem, in case that matters.
> > > 
> > > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is
run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can
determine, on ext3.
> > 
(Continue reading)

Mike Galbraith | 1 Dec 2009 17:47
Picon
Picon

Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

On Tue, 2009-12-01 at 17:03 +0100, Jan Kara wrote:

>   The patch below fixes the issue for me...

Ditto.

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Curt Wohlgemuth | 1 Dec 2009 19:17
Picon
Favicon

[PATCH] ext4: Remove blocks from inode prealloc list on failure

This fixes a leak of blocks in an inode prealloc list if device failures
cause ext4_mb_mark_diskspace_used() to fail.

Signed-off-by: Curt Wohlgemuth <curtw <at> google.com>
---
diff -uprN orig/fs/ext4/mballoc.c new/fs/ext4/mballoc.c
--- orig/fs/ext4/mballoc.c	2009-12-01 09:27:25.000000000 -0800
+++ new/fs/ext4/mballoc.c	2009-12-01 09:28:38.000000000 -0800
 <at>  <at>  -3011,6 +3011,22  <at>  <at>  static void ext4_mb_collect_stats(struct
 }

 /*
+ * Called on failure; free up any blocks from the inode PA for this
+ * context.
+ */
+static void ext4_discard_inode_pa(struct ext4_allocation_context *ac)
+{
+	struct ext4_prealloc_space *pa = ac->ac_pa;
+	int len;
+
+	if (pa && pa->pa_type == MB_INODE_PA) {
+		len = ac->ac_b_ex.fe_len;
+		pa->pa_free += len;
+	}
+
+}
+
+/*
  * use blocks preallocated to inode
  */
(Continue reading)

Andreas Dilger | 1 Dec 2009 23:02
Picon

Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

On 2009-12-01, at 09:03, Jan Kara wrote:
> On Tue 01-12-09 15:35:59, Jan Kara wrote:
>> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
>>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
>>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
>>>>> This test case is distilled from an actual application which  
>>>>> doesn't even intentionally use writev: it just uses C++'s  
>>>>> ofstream class to write data to a file. Unfortunately, that  
>>>>> class smart and uses writev under the covers. Unfortunately, I  
>>>>> guess nobody ever tests linux writev behavior, since it's broken  
>>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
>>>>> bad track record for such a fundamental core system call....

I suspect an excellent way of exposing problems with the writev()  
interface would be to wire it into fsx, which is commonly run as a  
stress test for Linux.  I don't know if it would have caught this  
case, but it definitely couldn't hurt to get more testing cycles for it.

>>  Ext4 also has this problem but delayed allocation mitigates the  
>> effect to an error in accounting of blocks reserved for delayed  
>> allocation and thus under normal circumstances nothing bad happens.

It looks like ext4 might still hit this problem, if delalloc is  
disabled.  Could you please submit a similar patch for ext4 also.

>  The patch below fixes the issue for me...
>
> 									Honza
>
> From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
(Continue reading)

Paul E. McKenney | 1 Dec 2009 18:48
Picon

ext3-related deadlock: suggestions for problem isolation?

[Note: resending to proper ext4 list]

Hello!

I have been encountering the deadlock situation shown below during
rcutorture/CPU-hotplug testing.  It is quite rare, being triggered less
than once per hundred hours of intensive testing.  Once the deadlock
occurs, CPU-hotplug operations are stalled, but RCU grace periods
proceed normally.  It is of course quite possible that this deadlock is
due to my recent changes, but I thought I should check with you guys
before committing unnatural debugging acts with the ext3 code.

Any suggestions for things to look at when this error occurs, or
debugging patches to help isolate the problem?  Needless to say, any
debugging patches will need to be -very- selective in their output,
given how rare the problem is.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

INFO: task kjournald:1104 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this  
message.
Call Trace:
[c000000042d67630] [c000000042d676d0] 0xc000000042d676d0 (unreliable)
[c000000042d67800] [c000000000010d34] .__switch_to+0x160/0x1a8
[c000000042d678a0] [c00000000058e71c] .schedule+0x890/0x9f4
[c000000042d67990] [c00000000058e90c] .io_schedule+0x8c/0x118
[c000000042d67a20] [c00000000016acfc] .sync_buffer+0x68/0x80
(Continue reading)

Theodore Ts'o | 2 Dec 2009 02:14
Picon
Picon
Favicon
Gravatar

[PATCH 1/2] jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()

OOM happens.

Signed-off-by: "Theodore Ts'o" <tytso <at> mit.edu>
---
 fs/jbd2/commit.c  |    4 ++++
 fs/jbd2/journal.c |    4 ++++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index d4cfd6d..8896c1d 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
 <at>  <at>  -636,6 +636,10  <at>  <at>  void jbd2_journal_commit_transaction(journal_t *journal)
 		JBUFFER_TRACE(jh, "ph3: write metadata");
 		flags = jbd2_journal_write_metadata_buffer(commit_transaction,
 						      jh, &new_jh, blocknr);
+		if (flags < 0) {
+			jbd2_journal_abort(journal, flags);
+			continue;
+		}
 		set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
 		wbuf[bufs++] = jh2bh(new_jh);

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index af60d98..3f473fa 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
 <at>  <at>  -358,6 +358,10  <at>  <at>  repeat:

 		jbd_unlock_bh_state(bh_in);
(Continue reading)

Theodore Ts'o | 2 Dec 2009 02:14
Picon
Picon
Favicon
Gravatar

[PATCH 2/2] ext4: Use slab allocator for sub-page sized allocations

Now that the SLUB seems to be fixed so that it respects the requested
alignment, use kmem_cache_alloc() to allocator if the block size of
the buffer heads to be allocated is less than the page size.
Previously, we were using 16k page on a Power system for each buffer,
even when the file system was using 1k or 4k block size.

Signed-off-by: "Theodore Ts'o" <tytso <at> mit.edu>
---
 fs/jbd2/journal.c    |  116 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h |   11 +----
 2 files changed, 118 insertions(+), 9 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3f473fa..8e0c0a9 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
 <at>  <at>  -39,6 +39,8  <at>  <at> 
 #include <linux/seq_file.h>
 #include <linux/math64.h>
 #include <linux/hash.h>
+#include <linux/log2.h>
+#include <linux/vmalloc.h>

 #define CREATE_TRACE_POINTS
 #include <trace/events/jbd2.h>
 <at>  <at>  -92,6 +94,7  <at>  <at>  EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);

 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
+static int jbd2_journal_create_slab(size_t slab_size);
(Continue reading)

Angel K Howell | 2 Dec 2009 09:40

We have contact data for the medical community


Make your business more resiliant to the slower economy with more new clients now.
Use our lists to get a successful marketing campaign going.  
We have lists of the Medical Community: Doctors, Dentists, Chiropractors, Hospitals etc.. 

Drop me a line here for more info & samples: htj5668 <at> furtherquicker.info (please include a copy of this
email when replying)

Send an email to rembox <at> furtherquicker.info to be excluded from future mailing.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik | 2 Dec 2009 17:48
Picon
Favicon

Re: [PATCH 2/2] ext4: Use slab allocator for sub-page sized allocations

On Tue, Dec 01, 2009 at 08:14:11PM -0500, Theodore Ts'o wrote:
> Now that the SLUB seems to be fixed so that it respects the requested
> alignment, use kmem_cache_alloc() to allocator if the block size of
> the buffer heads to be allocated is less than the page size.
> Previously, we were using 16k page on a Power system for each buffer,
> even when the file system was using 1k or 4k block size.
> 
> Signed-off-by: "Theodore Ts'o" <tytso <at> mit.edu>

Heh funny you should bring this up, we just fixed a problem in RHEL5 related to
using slabs.  If you have a non-jbd based root filesystem and then mount a bunch
of jbd-based fs's at the same time you will race with the cache creation stuff
and end up getting errors about creating the same slab cache multiple times.
The way this was fixed in RHEL5 was to put a mutex around the slabcache creation
and deletion so this sort of race.  It seems that this patch has the same issue
as the 2.6.18 jbd had.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Gmane