Jörg Sommer | 1 Oct 2006 01:57
Picon

Re: Bug#389772: e2fsprogs: e2fsck produces broken htree on ppc

Hello Theodore,

Theodore Tso schrieb am Fri 29. Sep, 17:15 (-0400):
> If the filesystem is empty (or at least no no hashtree directories),
> then when the kernel creates new directories and expands to the point
> where they become indexed, they will be indexed with the PPC variant
> of the hash algorithm.  This will be self consistent, and everything
> will work fine --- until the filesystem gets corrupted to the point
> where e2fsprogs needs to rebuild one or more hashed directories.  At
> that point the directories will be rebuilt using the same conventions
> used by all other conventions, but the directories will no longer be
> useful on the PPC kernel.
> 
> Joerg, can you confirm this?

Yes.

> On a PPC machine, can you create a smallish ext3 filesystem (say, 4-8
> megabytes), create a directory with enough files in it that it becomes
> indexed (verify using lsattr), and show that it works just fine on a
> PPC.  Now take that image, and transfer it to an x86 machine;

I don't have a x86 machine.

> you should find that the kernel can't look up any of the directories on
> the x86 machine.  If you then run e2fsck -fD on that filesystem
> (running the e2fsck on either x86 or PPC; it shouldn't make a
> difference), then the resulting filesystem should work just fine on the
> x86, and fail on the PPC.

(Continue reading)

Andreas Dilger | 2 Oct 2006 06:34

Re: ext4 compat flag assignments

On Sep 29, 2006  01:06 +0200, Andi Kleen wrote:
> Andreas Dilger wrote:
> > No work has been done on this yet.  Getting checksums to be efficient
> > depends on having a generic callback mechanism from the journal code
> > to avoid repeated checksums on a block while it is being modified.
> 
> You can just do incremental checksumming which is very cheap. 
> 
> Or did you mean the flushing to disk of the checksum?  If it's always in
> the same object that would be free, but that is not possible for bitmaps
> at least.  But I guess the checksum write in the block descriptor 
> could be done very lazily at least, perhaps keeping track on disk if invalid
> checksums are expected or not.

I'm not sure I understand what you mean.  My goal is that the ext4 code
modifies the block as many times as it wants during a transaction (this
may happen from multiple threads for a single block), then just before
the transaction is committed to disk the journal calls a callback for that
block (inode, group descriptor, bitmap, superblock, extent, index, etc) and 
computes the checksum only once for that block.  Then the block is flushed
to filesystem.

I'm not sure I like the idea of writing "this block doesn't have a valid
checksum" to disk, since there is some risk of that block being corrupted
during a crash and then we don't know if the block is valid or not.

> > Finally, the extents format has the capability (though no code is
> > implemented for this yet) to store a checksum in each index and extent
> > block... storing an ext3_extent_tail (checksum, inode+generation
> > backpointer) as the last entry in the block.
(Continue reading)

linux | 3 Oct 2006 02:40

e2fsck 1.39 bug report

Yes, an actual bug.  e2fsck 1.39, taken from Debian binary e2fsprogs_1.39-1.

I had a drive eat itself in a new and interesting way, which corrupted the
first batch of inodes on the disk, leading to losing the root directry
and much gnashing of teeth.  All of the top-level directories ended up
attached under /lost+found.

(I'm a little mystified HOW this happened in the first place, given
that the system has mirrored drives, ext3 journaling, and ECC RAM, but
I neglected to have the logs go to a separate machine, so the history
of what happened after the hard drive started erroring is murky.
I suspect it wrote data to the wrong place.)

Anyway, after some ext2fs runs, I got a clean file system.  Before
starting on the nasty work of piecing together the system from what got
dumped in lost+found and the last backup, I ran e2fsck one more time
(with -f) just to be sure, and got a dozen of these:

Pass 2: Checking directory structure
Entry '..' in /lost+found/#4563658/1148529338.15087.science/usr/src/linux-2.6/ipc/.sem.o.cmd
(1203943) has an incorrect filetype (was 1, should be 2)
Fix<y>? yes
Entry '..' in /lost+found/#4563658/1148490363.19043.science (4563661) has an incorrect filetype
(was 1, should be 2)
Fix<y>? yes
Entry '..' in /lost+found/#4563658 (4563658) has an incorrect filetype (was 1, should be 2)
Fix<y>?

Note that the maildir file (with all the numbers) and .sem.o.cmd were both
originally files before this mess, but they got turned into directories.
(Continue reading)

Vasily Averin | 4 Oct 2006 09:50
Picon

ext2/ext3 errors behaviour fixes

Hello all,

Current error behaviour for ext2 and ext3 filesystems does not fully correspond
to the documentation and should be fixed.

According to man 8 mount, ext2 and ext3 file systems allow to set one of 3
different on-errors behaviours:

---- start of quote man 8 mount ----
errors=continue / errors=remount-ro / errors=panic
    Define the behaviour when an error is encountered. (Either ignore errors and
just mark the file system erroneous and continue, or remount the file system
read-only, or panic and halt the system.) The default is set in the filesystem
superblock, and can be changed using tune2fs(8).
---- end of quote ----

However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
ERRORS_CONT is not saved on the sbi->s_mount_opt. It leads to the incorrect
handle of errors on ext3.

Then we've checked corresponding code in ext2 and discovered that it is
buggy as well:
- EXT2_ERRORS_CONTINUE is not read from the superblock (the same);
- parse_option() does not clean the alternative values and thus
  something like (ERRORS_CONT|ERRORS_RO) can be set;
- if options are omitted, parse_option() does not set any of these options.

Therefore it is possible to set any combination of these options on the ext2:
- none of them may be set:
 EXT2_ERRORS_CONTINUE on superblock / empty mount options;
(Continue reading)

Vasily Averin | 4 Oct 2006 09:50
Picon

[PATCH 2.6.18] ext3: errors behaviour fix

From: Dmitry Mishin <dim <at> openvz.org>

EXT3_ERRORS_CONTINUE should be taken from the superblock as default value for
error behaviour.

Signed-off-by:	Dmitry Mishin <dim <at> openvz.org>
Acked-by:	Vasily Averin <vvs <at> sw.ru>
Acked-by: 	Kirill Korotaev <dev <at> openvz.org>

--- linux-2.6.18/fs/ext3/super.c.e3erop	2006-09-20 07:42:06.000000000 +0400
+++ linux-2.6.18/fs/ext3/super.c	2006-10-03 15:51:44.000000000 +0400
 <at>  <at>  -1466,6 +1466,8  <at>  <at>  static int ext3_fill_super (struct super
 		set_opt(sbi->s_mount_opt, ERRORS_PANIC);
 	else if (le16_to_cpu(sbi->s_es->s_errors) == EXT3_ERRORS_RO)
 		set_opt(sbi->s_mount_opt, ERRORS_RO);
+	else
+		set_opt(sbi->s_mount_opt, ERRORS_CONT);

 	sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vasily Averin | 4 Oct 2006 09:51
Picon

[PATCH 2.6.18] ext2: errors behaviour fix

EXT2_ERRORS_CONTINUE should be read from the superblock as default value for
error behaviour.
parse_option() should clean the alternative options and should not change
default value taken from the superblock.

Signed-off-by:	Vasily Averin <vvs <at> sw.ru>
Acked-by:	Kirill Korotaev <dev <at> openvz.org>

--- linux-2.6.18/fs/ext2/super.c.e2erop	2006-09-20 07:42:06.000000000 +0400
+++ linux-2.6.18/fs/ext2/super.c	2006-10-03 15:50:31.000000000 +0400
 <at>  <at>  -365,7 +365,6  <at>  <at>  static int parse_options (char * options
 {
 	char * p;
 	substring_t args[MAX_OPT_ARGS];
-	unsigned long kind = EXT2_MOUNT_ERRORS_CONT;
 	int option;

 	if (!options)
 <at>  <at>  -405,13 +404,19  <at>  <at>  static int parse_options (char * options
 			/* *sb_block = match_int(&args[0]); */
 			break;
 		case Opt_err_panic:
-			kind = EXT2_MOUNT_ERRORS_PANIC;
+			clear_opt (sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt (sbi->s_mount_opt, ERRORS_RO);
+			set_opt (sbi->s_mount_opt, ERRORS_PANIC);
 			break;
 		case Opt_err_ro:
-			kind = EXT2_MOUNT_ERRORS_RO;
+			clear_opt (sbi->s_mount_opt, ERRORS_CONT);
(Continue reading)

Andreas Dilger | 4 Oct 2006 17:02

Re: ext2/ext3 errors behaviour fixes

On Oct 04, 2006  11:50 +0400, Vasily Averin wrote:
> However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
> ERRORS_CONT is not saved on the sbi->s_mount_opt. It leads to the incorrect
> handle of errors on ext3.
> 
> The following patches fix this.

{patch is missing}

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger | 4 Oct 2006 17:07

Re: ext2/ext3 errors behaviour fixes

On Oct 04, 2006  09:02 -0600, Andreas Dilger wrote:
> On Oct 04, 2006  11:50 +0400, Vasily Averin wrote:
> > However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
> > ERRORS_CONT is not saved on the sbi->s_mount_opt. It leads to the incorrect
> > handle of errors on ext3.
> > 
> > The following patches fix this.
> 
> {patch is missing}

Sorry, nm.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger | 4 Oct 2006 18:56

Directories > 2GB

For ext4 we are exploring the possibility of directories being larger
than 2GB in size.  For ext3/ext4 the 2GB limit is about 50M files, and
the 2-level htree limit is about 25M files (this is a kernel code and not
disk format limit).

Amusingly (or not) some users of very large filesystems hit this limit
with their HPC batch jobs because they have 10,000 or 128,000 processes
creating files in a directory on an hourly basis (job restart files,
data dumps for visualization, etc) and it is not always easy to change
the apps.

My question (esp. for XFS folks) is if anyone has looked at this problem
before, and what kind of problems they might have hit in userspace and in
the kernel due to "large" directory sizes (i.e. > 2GB).  It appears at
first glance that 64-bit systems will do OK because off_t is a long
(for telldir output), but that 32-bit systems would need to use O_LARGEFILE
when opening the file in order to be able to read the full directory
contents.  It might also be possible to return -EFBIG only in the case
that telldir is used beyond 2GB (the LFS spec doesn't really talk about
large directories at all).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
(Continue reading)

Eric Sandeen | 4 Oct 2006 19:37
Picon
Favicon

different defaults in mke2fs.conf vs. the code itself?

from mke2fs.conf in 1.39:

[defaults]
        base_features = sparse_super,filetype,resize_inode,dir_index
        blocksize = 4096

and yet (for example) in PRS():

        if (blocksize <= 0) {
                profile_get_integer(profile, "defaults", "blocksize", 0,
                                    1024, &use_bsize);
                profile_get_integer(profile, "fs_types", fs_type,
                                    "blocksize", use_bsize, &use_bsiz

So the code itself defaults to 1k blocks, but the shipped config file
defaults to 4k blocks?  We noticed this when mke2fs.conf went missing
from the install environment, and formatting large filesystems took a
-very- long time.

Shouldn't these match?  Or for that matter, why should the shipped
config file be either overriding or re-stating defaults in the code
itself?  Seems a bit strange to me.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)


Gmane