Michael Nishimoto | 1 Aug 02:49 2007

Re: RFC: log record CRC validation

David Chinner wrote:
> On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote:
>  > The log checksum code has not been used since the
>  > development phase of xfs.  It did work at one point because I
>  > remember using it and then decided to disable it and use just
>  > the current cycle stamping technique.  The checksum code was
>  > just advisory, so I could see if it ever occurred during
>  > development.
>  >
>  > When a CRC error is found, your suggestion is correct.  Recovery
>  > should backup and process only completely good log records.  The code
>  > backs up in this same fashion when it encounters a region of
>  > missing sector updates because of the async nature of log
>  > writes and disk caches.
> 
> Yes, but that's usually only in the last 8 log-buffers worth of the
> the log that the hole exists in (i.e. 256k by default). However, if
> the tail block has a CRC error, we've got to through away the entire
> log and that, like zeroing a dirty log from xfs_repair, generally
> results in a corrupted filesystem image.
> 
> An example of where this could be a problem is reusing a just-freed
> extent. Before reusing it we force the log to get the transaction on
> disk and rely on log replay to ensure that the block is freed in the
> event of a crash. We then go and write over the contents of the
> block. If that log transaction is not replayed (that freed the
> extent) then we've overwritten the previous contents of that extent
> and so the "current" contents of the extent after log replay are wrong.
> 
> IOWs, I think that if we come across a bad CRC in a log record we
(Continue reading)

William J. Earl | 1 Aug 03:32 2007

Re: RFC: log record CRC validation

David Chinner wrote:
> On Fri, Jul 27, 2007 at 07:00:32PM -0700, William J. Earl wrote:
>   
>> David Chinner wrote:
>>     
>>> ...
>>> The size of high-end filesystems are now at the same order of
>>> magnitude as the bit error rate of the storage hardware. e.g. 1PB =
>>> 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
>>> bits.  For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
>>> SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
>>>       
>>       First, note that the claimed bit error rates are rates of 
>> reported bad blocks, not rates of silent data corruption.   The latter, 
>> while not quoted, are far lower.
>>     
>
> Ok, fair enough, but in the absense of numbers and the fact that
> real world MTBF numbers are lower than what mfg's quote I'm
> always going to assume that this is the ballpark.
>
>   
        The real world MTBF numbers are worse for people who use drives 
outside their specified parameters (as at Google) and better, at least 
in the first few years, for drives which are used inside their 
parameters, as far as I have seen at Agami and my previous company.   
Drive MTBF does not relate, however to data corruption, except in the 
case of a failed RAID reconstruction, and even then the error is 
reported.  
> ...
(Continue reading)

David Chinner | 1 Aug 04:24 2007
Picon

Re: RFC: log record CRC validation

On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
> David Chinner wrote:
> >On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote:
> > > When a CRC error is found, your suggestion is correct.  Recovery
> > > should backup and process only completely good log records.  The code
> > > backs up in this same fashion when it encounters a region of
> > > missing sector updates because of the async nature of log
> > > writes and disk caches.
......
> >IOWs, I think that if we come across a bad CRC in a log record we
> >can replay up to that point but we've still got to abort the mount
> >and ask the user to run repair....
.....
> You are correct.  I didn't need the example -- just a reminder that
> CRC errors can occur anywhere, not just in the last 8 log-buffers
> worth of data. ;-)
> 
> And yes, repair is necessary now because we are no longer replaying
> a transaction which we thought was committed.

Ok, I'll make the mount replay up to previous good record and then
abort with -EUCLEAN.

> I have a request.  Can this be made a mkfs.xfs option, so it can be
> disabled?

It's a definite possibility, because....

> What are your plans for adding CRCs to other metadata objects?

(Continue reading)

Barry Naujok | 1 Aug 04:36 2007
Picon

Re: RFC: log record CRC validation

On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc <at> sgi.com> wrote:

> On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
>
>> What are your plans for adding CRCs to other metadata objects?
>
> Other objects will require some on-disk format change, and that will
> require a feature bit to be set.
>
> Basically, we can add CRCs to individual inodes without a version bump
> as we have some empty space in the v2 inode core. That will make v2  
> inodes
> the default, but we already have a superblock bit for that and all  
> versions
> of linux support v2 inodes so there is no issue there. This will require
> userspace tool changes, though, because tools like repair will revert v2
> inodes back to v1 format if the are not using project id's or the link  
> count
> fits into 16 bits.....

No, repair does not revert v2 inodes back to v1. Currently, inodes
are created as v1 in the kernel and moved to v2 as required.

Barry.

David Chinner | 1 Aug 04:43 2007
Picon

Re: RFC: log record CRC validation

On Wed, Aug 01, 2007 at 12:36:21PM +1000, Barry Naujok wrote:
> On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc <at> sgi.com> wrote:
> 
> >On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
> >
> >>What are your plans for adding CRCs to other metadata objects?
> >
> >Other objects will require some on-disk format change, and that will
> >require a feature bit to be set.
> >
> >Basically, we can add CRCs to individual inodes without a version bump
> >as we have some empty space in the v2 inode core. That will make v2  
> >inodes
> >the default, but we already have a superblock bit for that and all  
> >versions
> >of linux support v2 inodes so there is no issue there. This will require
> >userspace tool changes, though, because tools like repair will revert v2
> >inodes back to v1 format if the are not using project id's or the link  
> >count
> >fits into 16 bits.....
> 
> No, repair does not revert v2 inodes back to v1. Currently, inodes
> are created as v1 in the kernel and moved to v2 as required.

Ah, yes, you're right - it's too easy to make flow mistakes in 1200
line functions....

Cheers,

Dave.
(Continue reading)

David Chinner | 1 Aug 07:41 2007
Picon

Re: [PATCH] stop using uio in the readlink code

On Sun, Jul 29, 2007 at 11:13:54PM +0200, Christoph Hellwig wrote:
> Simplify the readlink code to get rid of the last user of uio.
> 
> 
> Signed-off-by: Christoph Hellwig <hch <at> lst.de>
.....
> --- linux-2.6-xfs.orig/fs/xfs/linux-2.6/xfs_ioctl.c	2007-07-14 16:07:19.000000000 +0200
> +++ linux-2.6-xfs/fs/xfs/linux-2.6/xfs_ioctl.c	2007-07-29 14:28:29.000000000 +0200
>  <at>  <at>  -349,19 +349,44  <at>  <at>  xfs_open_by_handle(
>  	return new_fd;
>  }
>  
> +/*
> + * This is a copy from fs/namei.c:vfs_readlink(), except for removing it's
> + * unused first argument.
> + */
> +STATIC int
> +do_readlink(
> +	char __user		*buffer,
> +	int			buflen,
> +	const char		*link)
> +{
> +        int len;
> +
> +	len = PTR_ERR(link);
> +	if (IS_ERR(link))
> +		goto out;
> +
> +	len = strlen(link);
> +	if (len > (unsigned) buflen)
(Continue reading)

David Chinner | 1 Aug 12:02 2007
Picon

Re: RFC: log record CRC validation

On Tue, Jul 31, 2007 at 06:32:59PM -0700, William J. Earl wrote:
> David Chinner wrote:
> >IMO, continuing down this same "the block device is perfect" path is
> >a "head in the sand" approach.  By ignoring the fact that errors can
> >and do occur, we're screwing ourselves when something does actually
> >go wrong because we haven't put in place the mechanisms to detect
> >errors because we've assumed they will never happen.
> >
> >We've spent 15 years so far trying to work out what has gone wrong
> >in XFS by adding more and more reactive debug into the code without
> >an eye to a robust solution. We add a chunk of code here to detect
> >that problem, a chunk of code there to detect this problem, and so
> >on. It's just not good enough anymore.
> >
> >Like good security, filesystem integrity is not provided by a single
> >mechanism. "Defense in depth" is what we are aiming to provide here
> >and to do that you have to assume that errors can propagate through
> >every interface into the filesystem.
> >
> >
> >  
>          I understand your argument, but why not simply strengthen the 
> block layer, even you do it with an optional XFS-based checksum scheme 
> on all blocks?

You could probably do this using dm-crypt and an integrity hash.

Unfortunately, I don't trust dm-crypt as far as I can throw it
because we've been severely bitten by bugs in dm-crypt. Specifically
the bug that existed from 2.6.14 through to 2.6.19 where it reported
(Continue reading)

Andi Kleen | 1 Aug 14:11 2007

Re: RFC: log record CRC validation

David Chinner <dgc <at> sgi.com> writes:

> I haven't looked at great depth into other structures in terms of
> implementation details. I know that if we use a 16 bit CRC on
> directories we can get away without a on-disk format change as the
> xfs_da_blkinfo structure has 16 bits of padding. However, given that
> directory block size can reach 64k, a CRC16 check is really only
> capable of single bit error detection. Hence I think we really need
> CRC32 here which means an on-disk format change.

When the directory format is changed it would be nice to also support
DT_* types at the same time. They can speed up some operations nicely
because file system walkers can avoid a stat() (and seek to the inode)
just to find out if a name is a directory or not. Right now there is
no space for this unfortunately.

-Andi

David Chinner | 1 Aug 14:17 2007
Picon

Re: Proper method of snapshotting XFS with external log using LVM2

On Mon, Jul 30, 2007 at 07:32:18AM +1200, Mario Becroft wrote:
> I am using XFS and LVM2 under Linux. My XFS filesystems have an external
> log.
> 
> What is the correct method of creating an LVM2 snapshot of volumes
> containing an XFS filesystem with external log?

Not sure. I've never tried it.

> Do I need to snapshot the main filesystem volume and the log volume?

Yes.

> Should I be using xfs_freeze to ensure that the filesystem is not
> modified between when I create the filesystem snapshot and the log
> snapshot?

Yes.

> Or is there a special procedure for atomically creating both snapshots?
>
> Or should I not be using an external log? If not, won't this decrease
> performance?

If it doesn't work, then don't use an external log and yes, it will
decrease performance.

> In the past I have successfully used xfs_freeze and created separate
> snapshots of the filesystem and log volumes. However, I am not sure
> whether this is the recommended approach.
(Continue reading)

Mario Becroft | 1 Aug 14:23 2007
Picon

Re: Proper method of snapshotting XFS with external log using LVM2

Hi David,

Thanks very much for your clear explanation of how to snapshot XFS
filesystems with external log. It is good to know that I am doing it
right.

I found that in the latest Linux kernel version 2.6.22.1, you cannot use
the method I have been using, and which you confirmed is ok. The problem
is that if you xfs_freeze the filesystem, then the LVM snapshot command
hangs forever. (Back in kernel version 2.6.16 this did not happen.) I
guess the snapshot command is also attempting to freeze the filesystem,
or something, which doesn't work when it is already frozen. I suppose
how to fix this is a question for the linux-lvm mailing list.

Thanks again for your detailed answer.

--

-- 
Mario Becroft <mb <at> gem.win.co.nz>


Gmane