Re: 'smartctl -t long /dev/hdh' killed my Samsung SV1604N
Fredrik Persson <frepe <at> bredband.net>
2004-05-03 20:51:28 GMT
Hello, and thanks for your quick reply.
Short: it came back to life! How? I shut it down in the evening and started it
again about 12 hours later, it there the disk was, alive and kicking. So the
case went like this: booted the machine, ran the long self test, got the
errors I described below, rebooted the machine to see if that got the drive
working. It didn't, it got worse, the drive didn't exist at all
(no /dev/hdh). Turned it off, waited 12 hours, turned it on and everything
was back to normal.
Before you dismiss me as a nutcase, please read the comments below. However,
what I'd *really* like to know is this: would '-F samsung' have made any
difference when I ran the long selftest?
On Monday 03 May 2004 17.26, Bruce Allen wrote:
> Hi Fredrik,
>
> On Sun, 2 May 2004, Fredrik Persson wrote:
> > I'm new to this list, but I've browsed the archive for my particular
> > problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm)
> > that I ran the long test on. (Like so: 'smartctl -t long', perhaps I
> > should've included '-F samsung'?)
> >
> > It completely KILLED the HD!
>
> I'm sorry to hear this. If it's any consolation, the disk would have died
> anyway -- the long self-test was simply the little bit of extra load that
> pushed the disk past its failure point.
>
> Was there any prior sign that the disk was 'in trouble'?
Maybe. This is what I get from 'smartctl -a -F samsung /dev/hdh': (sorry about
the linebreaks, I hope it's still readable.)
----------------------------------------------
smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG SV1604N
Serial Number: S01FJ10X102037
Firmware Version: TR100-24
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Mon May 3 22:32:06 2004 CEST
==> WARNING: Contact developers; may need -F samsung enabled.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Off-line data collection status: (0x00) Offline data collection activity was
never started.
Auto Off-line Data Collection:
Disabled.
Self-test execution status: ( 39) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete off-line
data collection: (7200) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Automatic timer ON/OFF support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 120) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always
- 0
3 Spin_Up_Time 0x0007 073 070 000 Pre-fail Always
- 4864
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 171
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always
- 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline
- 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 123448
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 101
194 Temperature_Celsius 0x0022 169 115 000 Old_age Always
- 23
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always
- 11375294
196 Reallocated_Event_Count 0x0012 253 253 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always
- 0
198 Offline_Uncorrectable 0x0031 253 253 010 Pre-fail Offline
- 0
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always
- 1
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always
- 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always
- 0
SMART Error Log Version: 1
Warning: ATA error count 1 inconsistent with error log pointer 5
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Timestamp = decimal seconds since the previous disk power-on.
Note: timestamp "wraps" after 2^32 msec = 49.710 days.
Error 1 occurred at disk power-on lifetime: 0 hours
When the command that caused the error occurred, the device was active or
idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 01 00 00 a0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name
-- -- -- -- -- -- -- -- --------- --------------------
b1 c0 00 01 00 00 a0 00 1663959.040 DEVICE CONFIGURATION RESTORE
ec 00 03 01 00 00 a0 00 1663959.040 IDENTIFY DEVICE
91 00 3f 01 00 00 af 00 1663959.040 INITIALIZE DEVICE PARAMETERS [OBS-6]
10 00 00 01 00 00 a0 00 1663959.040 RECALIBRATE [OBS-4]
ec 00 01 01 00 00 a0 00 623771.648 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
No self-tests have been logged
----------------------------------------------
I think there are a few interesting things to note here:
1. The self-test execution status. It says it was interrupted by the with a
hard or soft reset after 39 minutes, which sounds correct according to what I
saw when it happened. So the disk acknowledges that something went wrong, the
question is what?
2. There's a SMART attribute called "Hardware_ECC_Recovered", with the value
11375294. I'm not sure what this means, but ECC should be some kind of error
correction, and the value is high.
3. The "UDMA_CRC_Error_Count" is 1. Could this have happened during the failed
self-test, or even be the cause of it? If so, what could have triggered this
error?
4. There is one error in the log, which seems to have occured the first time
the disk was powered up.
Apart from this, I cannot see anything that could've caused this error.
> The long self-test read scans the entire disk surface. If the disk has an
> electronic or mechanical problem, then this extended read scan can provoke
> failure. (This type of failure is also commonly seen when people backup
> disks. Because the load of reading all the data from the disk is a heavy
> one, it often leads to catastrophic failure in the middle of the backup.
> This is why you should always have a PAIR of backups, an over-write the
> older of the two, but preserve the newer of the two.)
>
> Before you give up on the disk, double check the power and signal cabling
> to be sure that nothing has worked loose. Additional comments below.
Power and and signal cabling are untouched, and the disk is working again. I
didn't even open the machine.
> > After about an hour, this started to turn up when doing 'dmesg':
> >
> > May 2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy }
> > May 2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May 2 13:23:31 rostig kernel: hdh: drive not ready for command
> > May 2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May 2 13:23:32 rostig kernel: hdh: drive not ready for command
> > May 2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May 2 13:23:33 rostig kernel: hdh: drive not ready for command
>
> The drive simply stopped responding to commands.
>
> > Not good. I've also configured SMART to send me emails. I received four
> > of those, within a four-second period starting at 13:23:30.
> >
> > First:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, not capable of SMART self-check
> >
> > Second:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, failed to read SMART Attribute Data
> >
> > Third:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, Read SMART Error Log Failed
> >
> > Fourth:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, Read SMART Self Test Log Failed
>
> These four messages are because the disk wasn't reachable any more.
>
> > After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to
> > do SMART-communication. I then rebooted the machine. Now, the drive wont
> > even show up. 'dmesg' shows this:
> >
> > hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive
> > hdc: SAMSUNG SV1204H, ATA DISK drive
> > hde: WDC WD1200AB-00CBA1, ATA DISK drive
> > hdf: WDC WD1200AB-00CBA1, ATA DISK drive
> > hdg: Maxtor 6Y120L0, ATA DISK drive
> >
> > No hdh anywhere.
>
> As I said, double check the power and signal cabling. But they are
> probably OK -- this looks like a straighfoward electronic (not
> mechanical) drive failure.
Cabling untouched, and the disk works again as it has for months.
I'm curious; does this happen often? I mean, where the disk gets an error like
this and then works again after 12 hours switched off?
> > Disaster. What can possibly have happened here? The HD was fairly new
> > (just a few months old) has NOT been running 24/7 or anything like
> > that although it's been running for 5-8 hours every day.
>
> Really there are just three possibilities. (1) The additional load of a
> self-test provoked catastrophic failure (would have happened anyway, when
> the disk was under load in the future) (2) sudden electrical failure
> unrelated to self-test (eg, voltage spike killed a chip in the disk) or
> (3) cabling problems (do double check to eliminate this possiblity).
I did run selftests on three other disks simultaneously, and the finished
fine. Cabling problem is not very probable, and voltage spikes are extremely
rare here. (Sweden)
> > Any help or hints about this problem would be greatly appreciated,
>
> If the disk has failed (and its just a few months old) it should still be
> under warranty. Hopefully you can re-create the data that was on it.
The disk is alive so I can take a backup now. However, won't I have a
difficult time claiming warranty since it is fully functional now? Would you
have tried to get a new disk if you were in my shoes?
>
> Cheers,
> Bruce
>
Bruce, thank you very much for this very extensive reply!
Best Regards
Fredrik Persson
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click