Steve Listopad | 1 Jan 2005 18:24
Picon
Favicon

Advice for dealing with bad sectors on /

All,

Trying to figure out how to deal with, I assume, a dying disk that's
unfortunately on / (ext3).

Getting errors similar to:

Dec 31 20:44:30 mybox kernel: hdb: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 31 20:44:30 mybox kernel: hdb: dma_intr: error=0x40 { UncorrectableError },
LBAsect=163423, high=0, low=163423, sector=163360
Dec 31 20:44:30 mybox kernel: end_request: I/O error, dev 03:41 (hdb), sector
163360

When I rebooted, the system threw me into a shell, to get me to "fix" things. 
So, I did an e2fsck -c -v /dev/hdb1 to attempt to fix things.  The badblocks
checking took 20 hours (it's a 200GB disk).  Then I went through the
question/answer session, hoping to get through the problems...

I rebooted after this, and the machine is running, but I'm betting that failure
is close.  I've been trying to use smartctl to see if the bad locations were
actually remapped, but the Current_Pending_Sector count is 227.  I think this
means that these are sectors that are queued to be remapped, but have not been.

So...  I have a / disk that's flaky, and believe it or not, it's under
warranty, so I can replace it. 

Some questions:

1)  Is there a way to copy data off of this disk, so that I could simply swap
(Continue reading)

Joseph D. Wagner | 1 Jan 2005 20:28

RE: Advice for dealing with bad sectors on /

> Getting errors similar to:
> 
> Dec 31 20:44:30 mybox kernel: hdb: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 31 20:44:30 mybox kernel: hdb: dma_intr: error=0x40 {
> UncorrectableError },
> LBAsect=163423, high=0, low=163423, sector=163360
> Dec 31 20:44:30 mybox kernel: end_request: I/O error, dev 03:41 (hdb),
> sector
> 163360

This may not be the disk; it could also be the controller.  I've seen it go both ways.  Any problems on hda?

Try adding ide=nodma to the kernel parameters.  If the problem goes away, the problem is in the kernel driver
for the controller or motherboard chipset.

> When I rebooted, the system threw me into a shell, to get me to "fix"
> things. So, I did an e2fsck -c -v /dev/hdb1 to attempt to fix things.
> The badblocks checking took 20 hours (it's a 200GB disk).  Then I went
> through the question/answer session, hoping to get through the problems...

A better way to go about this is booting off the rescue CD and doing the e2fsck scan there.  Otherwise, there
could be leftover problems from running the scan off of the partition you are scanning.

> Some questions:

Best advice to all 3 questions: get some sort of disk imaging software.

The disk imaging software may copy the bad sectors (i.e. sectors marked bad now may also be marked bad on the
new drive), but you can force e2fsck to rescan bad sectors.
(Continue reading)

Larry McVoy | 1 Jan 2005 20:33
Favicon

Re: Advice for dealing with bad sectors on /

The one thing I'd add to Joseph's good advice is that when I see stuff like
this (which I do, I manage a lot of Linux boxes) I tend to start swapping
things.  Put the drive in a known good system with a known good cable on
the cable by itself and then see if you get errors.  If you don't get
errors in that situation it is likely your drive is fine and you have
some bad hardware elsewhere.

Hardware debugging is basically swapping parts until you find the guilty
party.

On Sat, Jan 01, 2005 at 01:28:39PM -0600, Joseph D. Wagner wrote:
> > Getting errors similar to:
> > 
> > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: status=0x51 { DriveReady
> > SeekComplete Error }
> > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: error=0x40 {
> > UncorrectableError },
> > LBAsect=163423, high=0, low=163423, sector=163360
> > Dec 31 20:44:30 mybox kernel: end_request: I/O error, dev 03:41 (hdb),
> > sector
> > 163360
> 
> This may not be the disk; it could also be the controller.  I've seen it go both ways.  Any problems on hda?
> 
> Try adding ide=nodma to the kernel parameters.  If the problem goes away, the problem is in the kernel
driver for the controller or motherboard chipset.
> 
> > When I rebooted, the system threw me into a shell, to get me to "fix"
> > things. So, I did an e2fsck -c -v /dev/hdb1 to attempt to fix things.
> > The badblocks checking took 20 hours (it's a 200GB disk).  Then I went
(Continue reading)

Stephen C. Tweedie | 1 Jan 2005 23:19
Picon
Favicon

Re: ext3 journal on software raid

Hi,

On Fri, 2004-12-31 at 02:49, Andy Smith wrote:

> Are the comments made in this posting from the linux-raid list
> correct?

> http://marc.theaimsgroup.com/?l=linux-raid&m=110444288429682&w=2

No.  :-)  Some of the facts are.  The conclusions are not.

> There is nothing that attempts expliciitly to maintain the ordering in
> RAID (talking about mirroring here).

Disks and IO subsystems in general don't preserve IO ordering.  ext3 is
designed not to care.  As long as the raid device tells the truth about
when the data is actually committed to disk (all of the mirror volumes
are uptodate) for a given IO, ext3 should be quite happy.

> What's wrong is that the journal will be mirrored (if it's a mirror).
> That means that (1) its data will be written twice, which is a big deal
> since ALL the i/o goes through the journal first

Not true; by default, only metadata goes through the journal, not data.

> and (2) the journal
> is likely to be inconsistent (since it is so active) if you get one of
> those creeping invisible RAID corruptions that can crop up inevitably
> in RAID normal use.

(Continue reading)

sam | 2 Jan 2005 05:06

numerous errors on newly built md volume

I'm constructing a raid 5 of 3 160gb sata drives.  They're connected to 
a promise sx4 150 card, using the kernel drives (not promise's binary 
only drivers).

the raid builds fine, as does creation of the ext3 filesystem.  
however, if i then run e2fsck (with the -f option since the filesystem 
is brand new) i get thousands of these type of errors:

Inode 15461 has imagic flag set.  Clear<y>? no

Inode 15462 is in use, but has dtime set.  Fix<y>? no

Inode 15463 is in use, but has dtime set.  Fix<y>? no

Inode 15464 is in use, but has dtime set.  Fix<y>? no

Inode 15465 is in use, but has dtime set.  Fix<y>? no

Inode 15466 is in use, but has dtime set.  Fix<y>? no

Inode 15466 has imagic flag set.  Clear<y>? no

Inode 15467 is in use, but has dtime set.  Fix<y>? no

Inode 15468 is in use, but has dtime set.  Fix<y>? no

Inode 15331 has compression flag set on filesystem without compression 
support.  Clear<y>? no

Inode 15331, i_size is 15275251971216207939, should be 0.  Fix<y>? no
(Continue reading)

Steve Listopad | 2 Jan 2005 05:16
Picon
Favicon

Re: Advice for dealing with bad sectors on /

All,

Comments in-line.

--- Larry McVoy <lm <at> bitmover.com> wrote:

> The one thing I'd add to Joseph's good advice is that when I see stuff like
> this (which I do, I manage a lot of Linux boxes) I tend to start swapping
> things.  Put the drive in a known good system with a known good cable on
> the cable by itself and then see if you get errors.  If you don't get
> errors in that situation it is likely your drive is fine and you have
> some bad hardware elsewhere.
> 
> Hardware debugging is basically swapping parts until you find the guilty
> party.

Thanks to both you and Joseph for making me think about things that I simply
wouldn't have (or, at least not without first fixing something that wasn't
broke).  I would have immediately suspected the hard drive, not cables or other
hardware.  But, I guess that comes with experience, so thanks for sharing.  The
first thing I'll do is make sure that the cables are secure, and swapping
cables as a quick test is easy enough to do.  The controller is integrated into
the MB, so that would be more problematical :)

> On Sat, Jan 01, 2005 at 01:28:39PM -0600, Joseph D. Wagner wrote:
> > > Getting errors similar to:
> > > 
> > > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: status=0x51 { DriveReady
> > > SeekComplete Error }
> > > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: error=0x40 {
(Continue reading)

John Freer | 3 Jan 2005 03:46
Picon
Favicon

Attempting To Recover, fsck infinite looping on me

Hey all,

Had a power failure and subsequent ext3 disk corruptions.  Attempting
to fix, but not working.

Its a 120 gig IDE disk, 3 partitions.  /boot, /, and swap.  

Basically can't boot up since the box can't get to the system files
in /usr/ or anything.

So I'm booting off of a FC2 disk 1 in recovery mode and trying to fix
the filesystem with e2fsck

The boot partition cleaned up fine.

The swap partition came back up when I did "/sbin/mkswap /dev/hda3"
and "/sbin/swapon -a"

I cannot, however, get fsck to run fully on /dev/hda2 (the main part
of the drive).  I run it and it goes through the first few recovery
steps (1A, 1B, 1C), and then it comes to the problem.

It lists a bunch of inodes (like 150) and asks:

"Clone duplicate/bad blocks? (y)"  I get about 6 of these messages
with varying lists of inodes.

I say yes, and fsck continues.  But, after about 6 of these messages,
the cycle of "Clone duplicate/bad blocks?" repeats.  I looked closely
at the listing of inodes, and the same 6 or so groupings of blocks
(Continue reading)

Jérôme Petazzoni | 3 Jan 2005 09:45

Re: Attempting To Recover, fsck infinite looping on me


> I cannot, however, get fsck to run fully on /dev/hda2 (the main part
> of the drive).  I run it and it goes through the first few recovery
> steps (1A, 1B, 1C), and then it comes to the problem.

I had the same problem with debian's e2fsck 1.35 (28-Feb-2004) with a 1 
terabyte volume (left it fscking for 1 week, it was still looping). I 
could still mount the volume however, so I mounted it read-only, and 
copied all the data on another disk. Some files were corrupted of 
course, and some of them had a wrong size (like 10 GB or even some TB), 
so I built an exclusion list when copying the data. After that, I had 
some checksums of the important files so I could find out which were 
corrupted and which were not.

I *think* that the problem was, that the volume was full, and fsck 
needed free space to clone blocks or something like that. I couldn't 
free up space on the volume however, since when I tried it sometimes 
caused kernel oopses, sometimes made the volume go read-only, and when 
it didn't, it didn't free a single block (from du's point of view anyway).

I still have an image of the damaged filesystem here (but I think I will 
destroy it soon, since 1 TB isn't that cheap, as you can guess).
Hifumi Hisashi | 4 Jan 2005 12:34
Picon

[PATCH] BUG on error handlings in Ext3 under I/O failure condition

	Hello.

	I found bugs on error handlings in the functions arround the ext3 file
system, which cause inadequate completions of synchronous write I/O operations
when disk I/O failures occur.  Both 2.4 and 2.6 have this problem.

	I carried out following experiment:

1.  Mount a ext3 file system on a SCSI disk with ordered mode.
2.  Open a file on the file system with O_SYNC|O_RDWR|O_TRUNC|O_CREAT flag.
3.  Write 512 bytes data to the file by calling write() every 5 seconds, and
     examine return values from the syscall.
     from write().
4.  Disconnect the SCSI cable,  and examine messages from the kernel.

	After the SCSI cable is disconnected, write() must fail.  But the
result was different: write() succeeded for a while even though messages of
the kernel notified SCSI I/O error.

	By applying following modifications, the above problem was solved.
Please consider applying this patch to the mainline kernels.

Signed-off-by: Hisashi Hifumi  <hifumi.hisashi <at> lab.ntt.co.jp>

diff -Nru linux-2.6.10-bk6/fs/buffer.c linux-2.6.10-bk6_fix/fs/buffer.c
--- linux-2.6.10-bk6/fs/buffer.c	2004-12-25 06:34:58.000000000 +0900
+++ linux-2.6.10-bk6_fix/fs/buffer.c	2005-01-04 19:58:48.000000000 +0900
 <at>  <at>  -311,10 +311,10  <at>  <at> 
  {
  	struct inode * inode = dentry->d_inode;
(Continue reading)

Joseph Wagner | 5 Jan 2005 11:51

Re: Attempting To Recover, fsck infinite looping on me

> I *think* that the problem was, that the volume was full, and fsck
> needed free space to clone blocks or something like that. I couldn't
> free up space on the volume however, since when I tried it sometimes
> caused kernel oopses, sometimes made the volume go read-only, and when
> it didn't, it didn't free a single block (from du's point of view
> anyway).

I'm sorry, but I cannot reproduce such an error in the mannor in which you 
describe; hence, I am unable to create a bug report for it.  Screen output 
from my attempt to reproduce it is below.

I will continue to try several other scenarios before giving up completely, 
but I thought you should know someone is looking into this problem (even 
though I can't yet find it).

If you have more details, like screen output from during the error, feel 
free to contact myself or anyone else on the e2fsprogs team.

Joseph D. Wagner
wagnerjd <at> users.sourceforge.net

## Here's the status of the disk before I do anything to it.
## Notice the file system only has 3 free blocks.

# e2fsck -f /dev/hdb11
e2fsck 1.35 (28-Feb-2004)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
(Continue reading)


Gmane