Andreas Dilger | 1 Sep 01:13 2009
Picon

Re: large file system & high object count testing

On Aug 31, 2009  14:25 -0700, Justin Maggard wrote:
> On Aug 31, 2009  13:02 -0400, Ric Wheeler wrote:
> > Mount after fsck:
> > Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> > ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> > Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> > corrupted!
> 
> Ah, so it's not just me.  It looks like you're seeing the exact same
> thing I reported a few days ago in the ">16TB issues" thread.  You
> don't even have to do anything fancy to make this happen.  My test
> case involves simply creating 5 directories on the newly-created
> 64-bit filesystem, and running e2fsck on it immediately after
> unmounting to get the same results.

Justin, could you please replicate this corruption, collecting some
additional information before & after.  My recollection is that the
corruption appears in the first few groups, so 64kB should be plenty
to capture the group descriptor tables (where the checksum is kept).

- mke2fs
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-new.gz
- mkdir ...
- sync
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-mkdir.gz
- umount
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-umount.gz
- e2fsck
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-e2fsck.gz

(Continue reading)

Andreas Dilger | 1 Sep 01:16 2009
Picon

Re: large file system & high object count testing

On Aug 31, 2009  17:01 -0400, Ric Wheeler wrote:
> On 08/31/2009 04:19 PM, Andreas Dilger wrote:
>> Ouch, 4h is a long time, but hopefully not many people have to reformat
>> their 120TB filesystem on a regular basis.
>
> Seems that it should not take longer than fsck in any case? Might be 
> interesting to use bkltrace/seekwatcher to see if it is thrashing these 
> big, slow drives around...

Well, e2fsck + gdt_csum can skip reading large parts of an empty
filesystem, while ironically mke2fs is required to initialize it all.

>>> [root <at> megadeth e2fsck]# time ./e2fsck -f -tt /dev/vg_wdc_disks/lv_wdc_disks
>>> e2fsck 1.41.8 (20-Jul-2009)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
>>> 4630.05/780.40/3580.01
>>
>> Sigh, we need better memory accounting in e2fsck.  Rather than depending
>> on the VM/glibc to track that for us, how hard would it be to just add
>> a counter into e2fsck_{get,free,resize}_mem() to track this?
>
> That second number looks like a bug, not a real memory number. The 
> largest memory allocation I saw while it ran with top was around 6-7GB 
> iirc.

Sure, it is a 32-bit overflow (which is the most this API can provide),
which is why we should fix it.

>> Hmm, is e2fsck computing the 64-byte group descriptor checksum differently
(Continue reading)

Ron Johnson | 1 Sep 01:19 2009
Picon
Picon

Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 2009-08-31 17:26, Jesse Brandeburg wrote:
> On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck <at> debian.org> wrote:
>> also sprach Jesse Brandeburg <jesse.brandeburg <at> gmail.com> [2009.08.31.1949 +0200]:
>>> In the case of a degraded array, could the kernel be more
>>> proactive (or maybe even mdadm) and have the filesystem remount
>>> itself withOUT journalling enabled?  This seems on the surface to
>>> be possible, but I don't know the internal particulars that might
>>> prevent/allow it.
>> Why would I want to disable the filesystem journal in that case?
> 
> I misspoke w.r.t journalling, the idea I was trying to get across was
> to remount with -o sync while running on a degraded array, but given
> some of the other comments in this thread I'm not even sure that would
> help.  the idea was to make writes as safe as possible (at the cost of
> speed) when running on a degraded array, and to have the transition be
> as hands-free as possible, just have the kernel (or mdadm) by default
> remount.

Much better, I'd think, to "just" have it scream out DANGER!! WILL 
ROBINSON!! DANGER!! to syslog and to an email hook.

--

-- 
Brawndo's got what plants crave.  It's got electrolytes!
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jiaying Zhang | 1 Sep 01:33 2009
Picon

Re: Question on fallocate/ftruncate sequence

On Mon, Aug 31, 2009 at 2:56 PM, Andreas Dilger<adilger <at> sun.com> wrote:
> On Aug 31, 2009  12:40 -0700, Jiaying Zhang wrote:
>> > It's better to define the flag as EXT4_KEEPSIZE_FL, and to use it as
>> > EXT4_KEEPSIZE_FL, but make a note of that bitfield position as being
>> > reserved in include/linux/fs.h.
>>
>> Here is the modified patch based on your suggestions. I stick with the
>> KEEPSIZE_FL approach that I think can allow us to handle the special
>> truncation accordingly during fsck. Other file systems can also re-use
>> this flag when they want to support fallocate with KEEP_SIZE. As you
>> suggested, I moved the EXT4_KEEPSIZE_FL checking to ext4_setattr
>> that now calls vmtruncate if the KEEPSIZE flag is set in the i_flag.
>> Please let me know what you think about this proposed patch.
>>
>> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c   2009-08-31
>> 12:08:10.000000000 -0700
>> +++ fs/ext4/extents.c 2009-08-31 12:12:16.000000000 -0700
>>  <at>  <at>  -3095,7 +3095,13  <at>  <at>  static void ext4_falloc_update_inode(str
>>                       i_size_write(inode, new_size);
>>               if (new_size > EXT4_I(inode)->i_disksize)
>>                       ext4_update_i_disksize(inode, new_size);
>> +             inode->i_flags &= ~EXT4_KEEPSIZE_FL;
>
> Note that fallocate can be called multiple times for a file.  The
> EXT4_KEEPSIZE_FL should only be cleared if there were writes to
> the end of the fallocated space.  In that regard, I think the name
> of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
> to indicate that blocks are allocated beyond the end of file (i_size).

Thanks for catching this! I changed the patch to only clear the flag
(Continue reading)

Justin Maggard | 1 Sep 01:37 2009
Picon

Re: large file system & high object count testing

On Mon, Aug 31, 2009 at 4:13 PM, Andreas Dilger<adilger <at> sun.com> wrote:
> Justin, could you please replicate this corruption, collecting some
> additional information before & after.  My recollection is that the
> corruption appears in the first few groups, so 64kB should be plenty
> to capture the group descriptor tables (where the checksum is kept).
>
> - mke2fs
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-new.gz
> - mkdir ...
> - sync
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-mkdir.gz
> - umount
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-umount.gz
> - e2fsck
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-e2fsck.gz
>

No problem.  I just sent you an email with those four attached.  If
anyone would like me to upload them somewhere else, just let me know.

-Justin
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

George Spelvin | 1 Sep 02:56 2009

Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

From david <at> lang.hm Mon Aug 31 15:46:19 2009
Date: Mon, 31 Aug 2009 08:45:38 -0700 (PDT)
From: david <at> lang.hm
X-X-Sender: dlang <at> asgard.lang.hm
To: Pavel Machek <pavel <at> ucw.cz>
cc: George Spelvin <linux <at> horizon.com>, linux-doc <at> vger.kernel.org,
        linux-ext4 <at> vger.kernel.org, linux-kernel <at> vger.kernel.org
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:
In-Reply-To: <20090831105645.GD1353 <at> ucw.cz>
References: <20090831005426.13607.qmail <at> science.horizon.com> <20090831105645.GD1353 <at> ucw.cz>
User-Agent: Alpine 2.00 (DEB 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

>>> That's one thing I really like about ZFS: its policy of "don't trust
>>> the disks."  If nothing else, simply telling you "your disks f*ed up,
>>> and I caught them doing it", instead of the usual mysterious corruption
>>> detected three months later, is tremendoudly useful information.
>>
>> The more I learn about storage, the more I like idea of zfs. Given the
>> subtle issues between filesystem and raid layer, integrating them just
>> makes sense.
> 
> Note that all that zfs does is tell you that you already lost data (and 
> then only if the checksumming algorithm would be invalid on a blank block 
> being returned), it doesn't protect your data.

Obviously, there are limits, but it does provide useful protection:
- You know where the missing data is.
- The error isn't amplified by believing corrupted metadata
(Continue reading)

Akira Fujita | 1 Sep 03:35 2009
Picon

Re: [PATCH]ext4: Enable mount ext4 with AGGRESSIVE_TEST


Hi,
Theodore Tso wrote:
> On Thu, Aug 27, 2009 at 01:54:39PM +0900, Akira Fujita wrote:
>> When the AGGRESSIVE_TEST is enabled,
>> the maximum extent count of inode becomes '3' in ext4_ext_space_root(),
>> but __ext4_ext_check() which is called via ext4_fill_super()
>> checks eh_max is larger '4', therefore we always get -EIO and can not mount ext4.
>> The patch fix this issue.
> 
> This isn't really the right fix for the problem.
> 
> What's actually going on here is that the patch which added which
> sanity checks for extents, which added (among other functions)
> __ext4_ext_check() and ext4_ext_max_entries() doesn't quite work right
> in the presence of AGGRESSIVE_TEST.
> 
> The goal of AGGRESSIVE_TEST is to stress test the extent tree insert
> and deletion code by forcing the size of eh_max to be a smaller value
> than it otherwise would be.  It did this by dropping in limits to the
> values returned by ext4_ext_space_root(), ext4_ext_space_idx(),
> ext4_ext_space_block(), and ext4_ext_block_idx().  This worked all
> very well and coded when these functions were only used to initialize
> the eh_max field.
> 
> The problem is that the extent checking code also used these functions
> to check whether or not the extent tree was sane, and if
> AGGRESSIVE_TEST was enabled, it caused the journal to be considered
> invalid.
> 
(Continue reading)

Theodore Tso | 1 Sep 06:01 2009
Picon
Picon

Re: [PATCH 0/7] Per-bdi writeback flusher threads v15

On Mon, Aug 31, 2009 at 09:41:26PM +0200, Jens Axboe wrote:
> 
> Here's the 15th version of the writeback patches.

The test-bdi branch of ext4.git tree is a combination of the patches I
plan to push once the merge window opens, plus v15 per-bdi patches.
On the ext4 call I've recruited some ext4 developers to do some
testing and benchmarking *before* the merge window opens.

I've found something potentially wrong with the writeback.  It was
careful to sync the disk before shutting down the system, and I got
the following error.   

[  660.499155] ext4_da_writepages: jbd2_start: 32768 pages, ino 41; err -30
[  660.505664] Pid: 1848, comm: flush-8:0 Not tainted 2.6.31-rc8-01474-geef6bd0 #362
[  660.512402] Call Trace:
[  660.514982]  [<c064412f>] ? printk+0x14/0x16
[  660.518851]  [<c023ac4a>] ext4_da_writepages+0x213/0x40e
[  660.523356]  [<c023aa37>] ? ext4_da_writepages+0x0/0x40e
[  660.528383]  [<c01b65a8>] do_writepages+0x28/0x39
[  660.533534]  [<c01f1e07>] writeback_single_inode+0x15c/0x346
[  660.538401]  [<c0165b23>] ? down_read_trylock+0x3e/0x48
[  660.542921]  [<c01f2892>] generic_sync_wb_inodes+0x2ab/0x37b
[  660.547805]  [<c01f2a2a>] wb_writeback+0xc8/0xfa
[  660.551975]  [<c01f2abb>] wb_do_writeback+0x5f/0x118
[  660.556333]  [<c01f2b95>] bdi_writeback_task+0x21/0x92
[  660.560838]  [<c01c0f5d>] bdi_start_fn+0x63/0xb2
[  660.565647]  [<c01c0efa>] ? bdi_start_fn+0x0/0xb2
[  660.569710]  [<c0162363>] kthread+0x73/0x78
[  660.573528]  [<c01622f0>] ? kthread+0x0/0x78
(Continue reading)

Martin K. Petersen | 1 Sep 07:19 2009
Picon

Re: Data integrity built into the storage stack

>>>>> "Greg" == Greg Freemyer <greg.freemyer <at> gmail.com> writes:

Greg> We already have the scsi data integrity patches that went in last
Greg> winter and I believe fit into the storage stack below the
Greg> filesystem layer.

The filesystems can actually use it.  It's exposed at the bio level.

Greg> I do believe there is a patch floating around for device mapper to
Greg> add some integrity capability.

The patch is in mainline.  It allows passthrough so the filesystems can
access the integrity features.  But DM itself doesn't use any of them,
it merely acts as a conduit.

DIF is inherently tied to storage device's logical blocks.  These are
likely to be smaller than the blocks we're interested in protecting.
However, you could conceivably use the application tag space to add a
checksum with filesystem or MD/DM blocking size granularity.  All the
hooks are there.

The application tag space is pretty much only available on disk
drives--array vendors use it for internal purposes.  But in the MD/DM
case we're likely to run on raw disk so that's probably ok.

That said, I really think btrfs is the right answer to many of the
concerns raised in this thread.  Everything is properly checksummed and
can be verified at read time.

The strength of DIX/DIF is that we can detect corruption at write time
(Continue reading)

martin f krafft | 1 Sep 07:45 2009
Picon

Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

also sprach Jesse Brandeburg <jesse.brandeburg <at> gmail.com> [2009.09.01.0026 +0200]:
> I misspoke w.r.t journalling, the idea I was trying to get across
> was to remount with -o sync while running on a degraded array, but
> given some of the other comments in this thread I'm not even sure
> that would help.  the idea was to make writes as safe as possible
> (at the cost of speed) when running on a degraded array, and to
> have the transition be as hands-free as possible, just have the
> kernel (or mdadm) by default remount.

I don't see how that is any more necessary with a degraded array
than it is when you have a fully working array. Sync just ensures
that the data are written and not cached, but that has absolutely
nothing to do with the underlying storage. Or am I failing to see
the link?

--

-- 
 .''`.   martin f. krafft <madduck <at> d.o>      Related projects:
: :'  :  proud Debian developer               http://debiansystem.info
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems

"how do you feel about women's rights?"
"i like either side of them."
                                                       -- groucho marx

Gmane