Andrew Morton | 1 Jun 2007 06:15

Fw: ext3 dir_index causes an error


Ted is dir_index maintainer ;)

That's a nice-looking bug report, btw.  Thanks.

Begin forwarded message:

Date: Fri, 01 Jun 2007 13:01:07 +0900
From: hooanon05 <at> yahoo.co.jp
To: sct <at> redhat.com, akpm <at> linux-foundation.org, adilger <at> clusterfs.com
Subject: ext3 dir_index causes an error

Hello,

First of all, I really appricate your great works.
Now I've found a problem around dir_index feature.
Here is a report following linux/REPORTING-BUGS.

[1.] One line summary of the problem:
ext3 dir_index causes an error

[2.] Full description of the problem/report:
This is my local test program to reproduce this problem. The
readdir1.c calls creat(2), opendir(3) and readdir(3). And the shell
script execute it repeatedly with a brand-new ext3fs image on a
loopback device.
When the script adds '-O dir_index' to mkfs, some errors appear.

On a system with linux-2.6.21.3, ext3fs produces these error message,
and the filesystem seems to be corrupted.
(Continue reading)

Andi Kleen | 1 Jun 2007 10:46

Re: [RFC][PATCH] Multiple mount protection

Kalpak Shah <kalpak <at> clusterfs.com> writes:

> Hi,
> 
> There have been reported instances of a filesystem having been mounted at 2 places at the same time causing
a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding
multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a
block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be
periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it
will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make
sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the
sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and
the time at which the MMP block was last updated will be displa
 ye
>  d. tune2fs can be used to set s_mmp_interval as desired.

That will make laptop users very unhappy if you spin up their disks every 5 seconds.  And
even on other systems it might reduce the MTBF if you write the super block much more
often than before. It might be better to set it up in some way to only increase
that number when the super block is written for some other reason anyways.

-Andi
Kalpak Shah | 1 Jun 2007 10:27

Re: [RFC][PATCH] Multiple mount protection

On Fri, 2007-06-01 at 10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <kalpak <at> clusterfs.com> writes:
> 
> > Hi,
> > 
> > There have been reported instances of a filesystem having been mounted at 2 places at the same time
causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for
adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will
have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be
periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it
will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make
sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the
sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and
the time at which the MMP block was last updated will be disp
 laye
> >  d. tune2fs can be used to set s_mmp_interval as desired.
> 
> That will make laptop users very unhappy if you spin up their disks every 5 seconds.  And
> even on other systems it might reduce the MTBF if you write the super block much more
> often than before. It might be better to set it up in some way to only increase
> that number when the super block is written for some other reason anyways.

The super block only saves the block number of the MMP block. So the
super block is not updated but the contents of the MMP block are updated
every 5 seconds.

If any user is unhappy with the 5 seconds interval, he can modify the
interval to be greater, with the caveat that it will take 2*mmp_interval
seconds during mounting the filesystem.

(Continue reading)

Andreas Dilger | 1 Jun 2007 11:14

Re: [RFC][PATCH] Multiple mount protection

On Jun 01, 2007  10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <kalpak <at> clusterfs.com> writes:
> > There have been reported instances of a filesystem having been
> mounted at 2 places at the same time causing a lot of damage to the
> filesystem....  The superblock will have a block number (s_mmp_block)
> which will hold a MMP structure which has a sequence number which will be
> periodically updated every 5 seconds by a mounted filesystem.
> 
> That will make laptop users very unhappy if you spin up their disks every
> 5 seconds.  And even on other systems it might reduce the MTBF if you
> write the super block much more often than before. It might be better to
> set it up in some way to only increase that number when the super block is
> written for some other reason anyways.

It was mentioned before but deserves mentioning again that this will
be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
might be accessed by multiple nodes at the same time.  That means there
will not be any impact for desktop users waiting 10s for each of their
filesystems to mount.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andi Kleen | 1 Jun 2007 12:56

Re: [RFC][PATCH] Multiple mount protection

> It was mentioned before but deserves mentioning again that this will
> be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
> might be accessed by multiple nodes at the same time.  That means there
> will not be any impact for desktop users waiting 10s for each of their
> filesystems to mount.

A safety feature that is optional? Doesn't sound very safe to me.
If the safety is needed it should be probably default, otherwise
it isn't needed.

-Andi
Theodore Tso | 1 Jun 2007 13:41
Picon
Picon
Favicon
Gravatar

Re: [RFC][PATCH] Multiple mount protection

On Fri, Jun 01, 2007 at 10:46:19AM +0200, Andi Kleen wrote:
> 
> That will make laptop users very unhappy if you spin up their disks
> every 5 seconds.  And even on other systems it might reduce the MTBF
> if you write the super block much more often than before. It might
> be better to set it up in some way to only increase that number when
> the super block is written for some other reason anyways.

You would never want to use this feature on a laptop; it would buy no
benefit for its costs, since with (all common) laptops, their hard
drives can't be shared with other machines in a cluster.

Unfortunately, it's not possible to do what you suggest, since one of
the whole points of increasing the sequence number every 5 seconds is
to act as a keep-alive, so another machine trying to access the shared
hard drive can tell whether or not the machine which currently had the
hard drive mounted is still alive or not.

This is why I and others have been a little worried about implementing
this feature, since it adds complexity which has to be in a proper HA
system anyway, and what is there isn't really an optimal HA solution
(since it lacks STONITH) and so you have to implement the
functionality again _anyway_ using a proper HA solution.

The argument on the other side is that it protects against failed HA
solutions, and against users who are too stupid to know that they need
an HA solution.  It does do the first; the second would only apply if
the users who were too stupid to realize they needed an HA solution,
were smart enough to enable it the MMP feature --- and because of its
many costs, including keeping the disk spun up on laptops, and
(Continue reading)

Andi Kleen | 1 Jun 2007 14:13

Re: [RFC][PATCH] Multiple mount protection

> Unfortunately, it's not possible to do what you suggest, since one of
> the whole points of increasing the sequence number every 5 seconds is
> to act as a keep-alive, so another machine trying to access the shared

Clusters usually have other ways to do this, haven't they? 
Typically they have STONITH too. It's probably too simple minded
to just replace a real cluster setup which also handles split 
brain and other conditions. So it's purely against mistakes.

Besides relying on it would seem dangerous because it is not synchronous
and you could do a lot of damage in 5 seconds. 

The rationale of the lazy writing would be that at least in usual
operation super block should be written regularly and with
5 seconds delay just being a not fully reliable heuristic it probably 
doesn't matter if the possible delay is a little longer even. 

-Andi
Theodore Tso | 1 Jun 2007 15:52
Picon
Picon
Favicon
Gravatar

Re: [RFC][PATCH] Multiple mount protection

On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Unfortunately, it's not possible to do what you suggest, since one of
> > the whole points of increasing the sequence number every 5 seconds is
> > to act as a keep-alive, so another machine trying to access the shared
> 
> Clusters usually have other ways to do this, haven't they? 
> Typically they have STONITH too. It's probably too simple minded
> to just replace a real cluster setup which also handles split 
> brain and other conditions. So it's purely against mistakes.

Yes, it's only real value is to protect against Cluster-HA
malfunctions or misconfiguration.

> Besides relying on it would seem dangerous because it is not synchronous
> and you could do a lot of damage in 5 seconds. 

Well, the MMP feature is assigned an incompatible feature bit, so a
kernel who doesn't know about MMP will refuse to touch it; and a
kernel which does follow the MMP protocol will check the MMP block
(delaying the mount by 10 seconds) to make sure no other system is
using the block.

So aside from being ! <at> #! <at>  annoying (which is why it will never be the
default), it does work, modulo the problem that without STONITH or any
kind of I/O fencing, we do risk the other system coming back to life
and then modifying the filesystem in parallel.  So as everyone has
said, this is not solution that works in isolation, but is really only
a backup.

The question of whether the complexity and then 10 second mount delay
(Continue reading)

Jose R. Santos | 1 Jun 2007 17:52
Picon
Favicon

[RFC][PATCH] Set JBD2_FEATURE_INCOMPAT_64BIT on filesystems larger than 32-bit blocks.

Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos <jrs <at> us.ibm.com>

--- .pc/ext4-set_64-bit_JBD_incompat.patch/fs/ext4/super.c	2007-05-31 22:25:21.843984697 -0500
+++ fs/ext4/super.c	2007-05-31 22:26:14.913670672 -0500
 <at>  <at>  -1824,6 +1824,17  <at>  <at>  static int ext4_fill_super (struct super
 		goto failed_mount3;
 	}

+        /* 
+	 * Make sure to set JBD2_FEATURE_INCOMPAT_64BIT on filesystems
+	 * with more that 32-bit block counts
+	 */
+	if(es->s_blocks_count_hi)
+		if (!jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0, 
+					       JBD2_FEATURE_INCOMPAT_64BIT)){
+			printk(KERN_ERR "ext4: Failed to set 64-bit journal feature\n");
+			goto failed_mount4;
+		}
+
 	/* We have now updated the journal if required, so we can
 	 * validate the data journaling mode. */
 	switch (test_opt(sb, DATA_FLAGS)) {

Thanks
-JRS
(Continue reading)

Andreas Dilger | 1 Jun 2007 20:00

Re: [RFC][PATCH] Multiple mount protection

On Jun 01, 2007  09:52 -0400, Theodore Tso wrote:
> On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Clusters usually have other ways to do this, haven't they? 
> > Typically they have STONITH too. It's probably too simple minded
> > to just replace a real cluster setup which also handles split 
> > brain and other conditions. So it's purely against mistakes.
> 
> Yes, it's only real value is to protect against Cluster-HA
> malfunctions or misconfiguration.

While I agree that HA systems _should_ be enough for this, in our
experience even with an HA system some people get it wrong (e.g.
manually mounting and bypassing HA, HA itself is broken, comms failure,
STONITH failure, etc).

I agree it is not intended to be a replacement for an HA/STONITH
solution, just belt & suspenders that would have saved hundreds of
TB of user data in several cases if it were available.  We will
enable it by default on all of our filesystems, and of course I'd
advise anyone in a SAN environment (whether they _intend_ to have
shared disk access or not) to enable it also.

> > Besides relying on it would seem dangerous because it is not synchronous
> > and you could do a lot of damage in 5 seconds. 
> 
> Well, the MMP feature is assigned an incompatible feature bit, so a
> kernel who doesn't know about MMP will refuse to touch it; and a
> kernel which does follow the MMP protocol will check the MMP block
> (delaying the mount by 10 seconds) to make sure no other system is
> using the block.
(Continue reading)


Gmane