american eagle | 2 Dec 2008 18:20
Picon
Favicon

help needed

Dear readers,

First, I want to thank you for this website because it is user interactive.my problem is as follows: I have an external hard disk of size 80 GB of type toshiba. I am experiencing a bad sector situation on it and it seems bad becuase if any operation comes near it like scan, it starts making a lot of noise and stuck on it. I don't know what to do,but I am thinking to partition it 2 partitions without the badsectors.I think that I will remove it(am I right?).so, I want a program that locates where the bad sectors are so I don't use them in gparted.I tried to test it with the smartmontools test but every test gives me the following result: Short offline self test failed [unsupported scsi opcode].I need all the help that you can provide.

Thank you very much.


Suspicious message? There’s an alert for that. Get your Hotmail® account now.
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Smartmontools-support mailing list
Smartmontools-support <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Bruce Allen | 2 Dec 2008 23:30
Picon
Favicon

Re: smartd/smartctl offlining drives

Hi Richard,

I don't know what change might be responsible for this.  I have not seen 
other reports of similar behavior.

Sorry that I can't be of more help.

Cheers,
 	Bruce

On Mon, 1 Dec 2008, Richard Scobie wrote:

> Bruce,
>
> Just re-visiting this - see:
>
> http://marc.info/?l=linux-scsi&m=122492169723132&w=2
>
> After finding that the CVS version did not cause trouble during bootup, I 
> decided to revert to smartmontools-5.38-2.fc9.x86_64 when the machine was put 
> into production, on the basis that I had seen no problems with it as long as 
> smartd was started after bootup and I was unsure whether there were new 
> features in the cvs version that might possibly bite.
>
> Anyway, for the last month it has performed fine, running smartd with weekly 
> full offline checking and quite regular checks with smartctl -a -d sat 
> /dev/sdx, until Saturday when this smartctl command was run and the drive was 
> again offlined by the SAS controller.
>
> I am suspecting that there are certain conditions where the controller does 
> not like have SMART commands fed to it, but I was wondering if you knew of a 
> definite change made in the CVS version that may have fixed this, in light of 
> my previous testing? Maybe I was just lucky or some timing issue has changed.
>
> I ask, as this causes so much disruption to the array that I am reluctant to 
> use smartmontools at all now unless something definite has changed.
>
> Unfortunately I no longer have a system available for testing with now.
>
> Regards,
>
> Richard
>
> Bruce Allen wrote:
>>  OK, so to summarize:
>>
>>  smartmontools-5.38-2.fc9.x86_64: does not work correctly
>>  CVS HEAD: works correctly
>>
>>  Is that right?
>> 
>
> That is correct - smartmontools-5.38-2.fc9.x86_64 faults consistently
> and CVS has not, over 4 reboots.
>
> Regards,
>
> Richard
>
>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Richard Hartmann | 3 Dec 2008 15:56
Picon
Gravatar

Problems with two broken disks

Hi all,

I have two disks which used to be in a RAID which managed to die at the
same time. I will paste the relevant dd & smartctl output below.
Does anyone have any ideas how I could get more data off those disks? Is
professional help our only chance? A huge thanks in advance!

If you need any other information, please do not hesitate to contact me.

Richard

Disk 3LJ33MQ7 :

root <at> grml ~ # dd if=/dev/sda of=/mnt/sdc1/3LJ33MQ7.img bs=64k
dd: reading `/dev/sda': Input/output error
49164+1 records in
49164+1 records out
3222016000 bytes (3.2 GB) copied, 210.116 s, 15.3 MB/s
dd if=/dev/sda of=/mnt/sdc1/3LJ33MQ7.img bs=64k  0.06s user 9.42s
system 4% cpu 3:30.12 total
1 root <at> grml ~ #smartctl -a /dev/sda
 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
 Allen
 Home page is http://smartmontools.sourceforge.net/

 === START OF INFORMATION SECTION ===

 Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
 Device Model: ST3200822AS
 Serial Number: 3LJ33MQ7
 Firmware Version: 3.01
 User Capacity: 200,049,647,616 bytes
 Device is: In smartctl database [for details use: -P show]
 ATA Version is: 6
 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
 Local Time is: Wed Dec 3 15:19:28 2008 UTC
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===

 SMART overall-health self-assessment test result: PASSED

 General SMART Values:
 Offline data collection status: (0x82) Offline data collection
 activity
 was completed without error.
 Auto Offline Data Collection:
 Enabled.
 Self-test execution status: ( 0) The previous self-test routine
 completed
 without error or no self-test
 has ever
 been run.
 Total time to complete Offline
 data collection: ( 430) seconds.
 Offline data collection
 capabilities: (0x5b) SMART execute Offline
 immediate.
 Auto Offline data collection
 on/off support.
 Suspend Offline collection
 upon new
 command.
 Offline surface scan
 supported.
 Self-test supported.
 No Conveyance Self-test
 supported.
 Selective Self-test supported.
 SMART capabilities: (0x0003) Saves SMART data before
 entering
 power-saving mode.
 Supports SMART auto save
 timer.
 Error logging capability: (0x01) Error logging supported.
 No General Purpose Logging
 support.
 Short self-test routine
 recommended polling time: ( 1) minutes.
 Extended self-test routine
 recommended polling time: ( 111) minutes.

 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
 UPDATED WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000f 052 049 006 Pre-fail
 Always - 168888113
 3 Spin_Up_Time 0x0003 096 096 000 Pre-fail
 Always - 0
 4 Start_Stop_Count 0x0032 100 100 020 Old_age
 Always - 46
 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail
 Always - 40
 7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail
 Always - 22034494182
 9 Power_On_Hours 0x0032 064 064 000 Old_age
 Always - 31926
 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
 Always - 0
 12 Power_Cycle_Count 0x0032 100 100 020 Old_age
 Always - 56
 194 Temperature_Celsius 0x0022 040 053 000 Old_age
 Always - 40
 195 Hardware_ECC_Recovered 0x001a 052 049 000 Old_age
 Always - 168888113
 197 Current_Pending_Sector 0x0012 100 100 000 Old_age
 Always - 7
 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
 Offline - 7
 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
 Always - 0
 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age
 Offline - 0
 202 TA_Increase_Count 0x0032 100 253 000 Old_age
 Always - 0

 SMART Error Log Version: 1
 ATA Error Count: 13 (device log contains only the most recent five
 errors)
 CR = Command Register [HEX]
 FR = Features Register [HEX]
 SC = Sector Count Register [HEX]
 SN = Sector Number Register [HEX]
 CL = Cylinder Low Register [HEX]
 CH = Cylinder High Register [HEX]
 DH = Device/Head Register [HEX]
 DC = Device Command Register [HEX]
 ER = Error register [HEX]
 ST = Status register [HEX]
 Powered_Up_Time is measured from power on, and printed as
 DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
 SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 Error 13 occurred at disk power-on lifetime: 31924 hours (1330 days +
 4 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 0f 0f 06 60 e0 Error: UNC 15 sectors at LBA = 0x0060060f =
 6293007

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 25 00 00 00 05 60 e0 00 00:18:03.931 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:18:03.926 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:59.952 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:59.947 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:56.048 READ DMA EXT

 Error 12 occurred at disk power-on lifetime: 31924 hours (1330 days +
 4 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 0f 0f 06 60 e0 Error: UNC 15 sectors at LBA = 0x0060060f =
 6293007

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 25 00 00 00 05 60 e0 00 00:18:03.931 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:18:03.926 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:59.952 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:59.947 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:56.048 READ DMA EXT

 Error 11 occurred at disk power-on lifetime: 31924 hours (1330 days +
 4 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 0f 0f 06 60 e0 Error: UNC 15 sectors at LBA = 0x0060060f =
 6293007

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 25 00 00 00 05 60 e0 00 00:17:47.917 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:47.911 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:59.952 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:59.947 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:56.048 READ DMA EXT

 Error 10 occurred at disk power-on lifetime: 31924 hours (1330 days +
 4 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 0f 0f 06 60 e0 Error: UNC 15 sectors at LBA = 0x0060060f =
 6293007

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 25 00 00 00 05 60 e0 00 00:17:47.917 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:47.911 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:47.906 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:47.899 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:56.048 READ DMA EXT

 Error 9 occurred at disk power-on lifetime: 31924 hours (1330 days +
 4
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 0f 0f 06 60 e0 Error: UNC 15 sectors at LBA = 0x0060060f =
 6293007

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 25 00 00 00 05 60 e0 00 00:17:47.917 READ DMA EXT
 ec 00 0f 0f 06 60 a0 00 00:17:47.911 IDENTIFY DEVICE
 25 00 00 00 05 60 e0 00 00:17:47.906 READ DMA EXT
 25 00 00 00 03 60 e0 00 00:17:47.899 READ DMA EXT
 25 00 00 00 01 60 e0 00 00:17:45.030 READ DMA EXT

 SMART Self-test log structure revision number 1
 No self-tests have been logged. [To run self-tests, use: smartctl -t]

 SMART Selective self-test log data structure revision number 1
 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
 1 0 0 Not_testing
 2 0 0 Not_testing
 3 0 0 Not_testing
 4 0 0 Not_testing
 5 0 0 Not_testing
 Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
 If Selective self-test is pending on power-up, resume after 0 minute
 delay.

 64 root <at> grml ~ #

====================
====================
====================

Second disk, 3LJ2Y6CG :

root <at> grml ~ # dd if=/dev/sdb of=/mnt/sdd1/3LJ2Y6CG.img bs=64k
dd: reading `/dev/sdb': Input/output error
208847+1 records in
208847+1 records out
13687058432 bytes (14 GB) copied, 487.497 s, 28.1 MB/s
dd if=/dev/sdb of=/mnt/sdd1/3LJ2Y6CG.img bs=64k  0.27s user 38.35s
system 7% cpu 8:07.52 total
1 root <at> grml ~ # smartctl -a /dev/sdb
 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
 Allen
 Home page is http://smartmontools.sourceforge.net/

 === START OF INFORMATION SECTION ===

 Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
 Device Model: ST3200822AS
 Serial Number: 3LJ2Y6CG
 Firmware Version: 3.01
 User Capacity: 200,049,647,616 bytes
 Device is: In smartctl database [for details use: -P show]
 ATA Version is: 6
 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
 Local Time is: Wed Dec 3 15:19:49 2008 UTC
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===

 SMART overall-health self-assessment test result: PASSED

 General SMART Values:
 Offline data collection status: (0x82) Offline data collection
 activity
 was completed without error.
 Auto Offline Data Collection:
 Enabled.
 Self-test execution status: ( 0) The previous self-test routine
 completed
 without error or no self-test
 has ever
 been run.
 Total time to complete Offline
 data collection: ( 430) seconds.
 Offline data collection
 capabilities: (0x5b) SMART execute Offline
 immediate.
 Auto Offline data collection
 on/off support.
 Suspend Offline collection
 upon new
 command.
 Offline surface scan
 supported.
 Self-test supported.
 No Conveyance Self-test
 supported.
 Selective Self-test supported.
 SMART capabilities: (0x0003) Saves SMART data before
 entering
 power-saving mode.
 Supports SMART auto save
 timer.
 Error logging capability: (0x01) Error logging supported.
 No General Purpose Logging
 support.
 Short self-test routine
 recommended polling time: ( 1) minutes.
 Extended self-test routine
 recommended polling time: ( 111) minutes.

 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
 UPDATED WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000f 051 048 006 Pre-fail
 Always - 32469081
 3 Spin_Up_Time 0x0003 096 096 000 Pre-fail
 Always - 0
 4 Start_Stop_Count 0x0032 100 100 020 Old_age
 Always - 49
 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
 Always - 11
 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail
 Always - 531666625
 9 Power_On_Hours 0x0032 064 064 000 Old_age
 Always - 31923
 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
 Always - 0
 12 Power_Cycle_Count 0x0032 100 100 020 Old_age
 Always - 57
 194 Temperature_Celsius 0x0022 044 053 000 Old_age
 Always - 44
 195 Hardware_ECC_Recovered 0x001a 051 048 000 Old_age
 Always - 32469081
 197 Current_Pending_Sector 0x0012 100 100 000 Old_age
 Always - 1
 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
 Offline - 1
 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
 Always - 0
 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age
 Offline - 0
 202 TA_Increase_Count 0x0032 100 253 000 Old_age
 Always - 0

 SMART Error Log Version: 1
 ATA Error Count: 6 (device log contains only the most recent five
 errors)
 CR = Command Register [HEX]
 FR = Features Register [HEX]
 SC = Sector Count Register [HEX]
 SN = Sector Number Register [HEX]
 CL = Cylinder Low Register [HEX]
 CH = Cylinder High Register [HEX]
 DH = Device/Head Register [HEX]
 DC = Device Command Register [HEX]
 ER = Error register [HEX]
 ST = Status register [HEX]
 Powered_Up_Time is measured from power on, and printed as
 DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
 SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 Error 6 occurred at disk power-on lifetime: 31920 hours (1330 days +
 0
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 88 f8 e7 97 e1 Error: UNC 136 sectors at LBA = 0x0197e7f8 =
 26732536

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 90 70 e7 97 e1 00 01:32:13.111 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:13.106 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.328 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:09.327 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.325 READ DMA

 Error 5 occurred at disk power-on lifetime: 31920 hours (1330 days +
 0
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 88 f8 e7 97 e1 Error: UNC 136 sectors at LBA = 0x0197e7f8 =
 26732536

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 90 70 e7 97 e1 00 01:32:13.111 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:13.106 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.328 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:09.327 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.325 READ DMA

 Error 4 occurred at disk power-on lifetime: 31920 hours (1330 days +
 0
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 88 f8 e7 97 e1 Error: UNC 136 sectors at LBA = 0x0197e7f8 =
 26732536

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 90 70 e7 97 e1 00 01:32:13.111 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:13.106 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.328 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:09.327 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.325 READ DMA

 Error 3 occurred at disk power-on lifetime: 31920 hours (1330 days +
 0
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 88 f8 e7 97 e1 Error: UNC 136 sectors at LBA = 0x0197e7f8 =
 26732536

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 90 70 e7 97 e1 00 01:32:13.111 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:13.106 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.328 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:09.327 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.325 READ DMA

 Error 2 occurred at disk power-on lifetime: 31920 hours (1330 days +
 0
 hours)
 When the command that caused the error occurred, the device was
 active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 88 f8 e7 97 e1 Error: UNC 136 sectors at LBA = 0x0197e7f8 =
 26732536

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 90 70 e7 97 e1 00 01:32:13.111 READ DMA
 ec 00 88 f8 e7 97 a0 00 01:32:13.106 IDENTIFY DEVICE
 c8 00 90 70 e7 97 e1 00 01:32:09.328 READ DMA
 c8 00 70 00 e7 97 e1 00 01:32:09.327 READ DMA
 c8 00 98 68 e6 97 e1 00 01:32:09.325 READ DMA

 SMART Self-test log structure revision number 1
 No self-tests have been logged. [To run self-tests, use: smartctl -t]

 SMART Selective self-test log data structure revision number 1
 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
 1 0 0 Not_testing
 2 0 0 Not_testing
 3 0 0 Not_testing
 4 0 0 Not_testing
 5 0 0 Not_testing
 Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
 If Selective self-test is pending on power-up, resume after 0 minute
 delay.

 64 root <at> grml ~ #

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
sunny.song | 3 Dec 2008 16:06
Favicon

Sunny Song/Singapore/HGST is out of the office.


I will be out of the office starting  12/01/2008 and will not return until
12/08/2008.

If there is any urgent issues, pls contact my backup Mr. Kenny Lim at
"kenny.lim <at> hitachigst.com " or Mr. Roger Kwan at
"roger.kwan <at> hitachigst.com" .

Thanks.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Justin Piszcz | 3 Dec 2008 16:07

Re: Problems with two broken disks


On Wed, 3 Dec 2008, Richard Hartmann wrote:

> Hi all,
>
> I have two disks which used to be in a RAID which managed to die at the
> same time. I will paste the relevant dd & smartctl output below.
> Does anyone have any ideas how I could get more data off those disks? Is
> professional help our only chance? A huge thanks in advance!
>
> If you need any other information, please do not hesitate to contact me.

One option before going to the professionals is to mount the RAID1 if at
all possible and use rsync to try and read all of the files off and then
skip the bad files (where rsync dies), saved a few disks like this in the
past.  What others may also mention is dd_rescue but I have not tried that
myself.

Justin.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Jeremy James | 3 Dec 2008 16:23
Picon

Re: Problems with two broken disks

Justin Piszcz wrote:
> On Wed, 3 Dec 2008, Richard Hartmann wrote:
> 
>> Hi all,
>>
>> I have two disks which used to be in a RAID which managed to die at the
>> same time. I will paste the relevant dd & smartctl output below.
>> Does anyone have any ideas how I could get more data off those disks? Is
>> professional help our only chance? A huge thanks in advance!
>>
>> If you need any other information, please do not hesitate to contact me.
> 
> One option before going to the professionals is to mount the RAID1 if at
> all possible and use rsync to try and read all of the files off and then
> skip the bad files (where rsync dies), saved a few disks like this in the
> past.  What others may also mention is dd_rescue but I have not tried that
> myself.

Indeed. dd_rescue is a great tool for this to skip over bad sectors.

Basically, you should be attempting to create a third disk with the data
from both - run dd_rescue from one, then use the other disk for the
chunks where there were failures. Once done, you could use mdadm to zero
the disk superblock, and create a new raid1 (possibly with a missing
devices if you intend to keep on using this as a new raid) and mount
your new RAID device - having run a fsck to see if anything (and what)
is broken first.

I'm assuming linux/md RAID - if you're on a hardware card then the
procedure should be similar, although more care might be needed around
the relevant equivalent to a superblock.

You're lucky to be working with RAID1 - this would be far more messy on
a RAID5!

-jeremy

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Richard Hartmann | 3 Dec 2008 17:38
Picon
Gravatar

Re: Problems with two broken disks

On Wed, Dec 3, 2008 at 16:23, Jeremy James <jbj <at> forbidden.co.uk> wrote:
> Justin Piszcz wrote:

>>  What others may also mention is dd_rescue but I have not tried that
>> myself.
>
> Indeed. dd_rescue is a great tool for this to skip over bad sectors.

I am using it atm. Seems to be exactly what I wanted. Thanks :)

> I'm assuming linux/md RAID - if you're on a hardware card then the
> procedure should be similar, although more care might be needed around
> the relevant equivalent to a superblock.

It was a hardware RAID. Thankfully, it was a mirrored one, so I am
relatively fine. Once I know how much extra fat the 3ware controller
added to the beginning of the disks, I will trim that off of the images
I created and run normal recovery tools on it.
I will treat both images idendently, as if there never was any RAID.
That's probably the easiest thing to do.

In related news, what should I use to restore data from NTFS partitions
in whatever NTFS version Win Server 2003 uses? My expertise with
Windows is rather limited, I fear.

Richard

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Justin Piszcz | 5 Dec 2008 14:19

Velociraptor drive related to the use of NCQ.

I swapped out my power supply, changed ALL cables and bought a $1000 raid
controller with BBU, the drives are still having problems, when writing to
them in a RAID10 configuration, it locks up the card:

SMART shows the same thing on many of the disks in the RAID10:

Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 hours)
   When the command that caused the error occurred, the device was active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   10 51 00 00 8d b8 40

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   61 80 18 00 6c 91 19 08      00:57:50.230  WRITE FPDMA QUEUED
   61 80 18 00 cd b7 19 08      00:57:50.148  WRITE FPDMA QUEUED
   61 80 e8 80 cc b7 19 08      00:57:50.147  WRITE FPDMA QUEUED
   61 80 18 00 cc b7 19 08      00:57:50.147  WRITE FPDMA QUEUED
   61 80 e8 80 cb b7 19 08      00:57:50.146  WRITE FPDMA QUEUED

The card has the latest 3ware BIOS/Firmware etc, the diag output from the 
card itself:

DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)

E=1019 T=19:57:26     : Drive removed
task file written out : cd dh ch cl sn sc ft
                       : 61 59 B8 8E 00 80 80
E=1019 T=19:57:26 P=Bh: Hard reset drive
P=Bh: HardResetDriveWait
   task file read back : st dh ch cl sn sc er
                       : 50 00 00 00 01 01 01
E=1019 T=19:57:26 P=B : Soft reset drive
E=0207 T=19:57:26 P=B : ResetDriveWait
E=1019 T=19:57:26 P=B : Inserting Set UDMA command
E=1019 T=19:57:26 P=B : Check power mode, active
E=1019 T=19:57:26 P=B : Check drive swap, same drive
E=1019 T=19:57:26 P=B : Check power cycles, initial=57, current=57
E=1019 T=19:57:26 P=Bh: exitCode = 0
Retrying chain
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)

Hm the last thing I will try I suppose is disabling NCQ and see if the problem
recurs.  So I did this, after 3-4 times of running the following with 
NCQ enabled, I would try it once more, the final time with NCQ disabled.

dd if=/dev/zero of=bigfile bs=1M on the raid10 and having it crash everytime
when NCQ was enabled.

--

I don't want to get too excited yet but after disabling NCQ I was able to
write to the RAID10 - over the entire array without it crashing!

Where before it would get to 700-985GiB/1.4TiB and then all processes 
would go into D-state and I could not even echo b > sysrq-trigger to bring
the host back up, it required a manual reboot.

--

I will let it run a few more times before making any further comments though.

echo "writing to raid10"
writing to raid10
dd if=/dev/zero of=file2 bs=1M
dd: writing `file2': No space left on device
1430328+0 records in
1430327+0 records out
1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s

Just as with Linux-- when using NCQ Western Digital Raptor Drives (150) or
Velociraptor Drives (300s) the drives in RAID; whether in Linux SW RAID or
3ware HW RAID, the result is the same, unstable drives, they appear to 
'reset' or timeout and this obviously will cause problems with any RAID 
implementation.

NCQ+Velociraptor => Bad in raid configuration, in non-raid it may be OK, I
have not tested.

I also have another host here that uses the P35 chipset and uses raptors in sw
raid 1-- when NCQ is enabled (and with the 750s as well) it gets nasty NCQ
errors and drive timeouts etc.

With this data, essentially the NCQ implementation on raptors or WD 750s
in a RAID configuration has problems.  I also have a 750 which I have used for
1-2 years now (using NCQ) in a single-disk configuration with NO issues, so
whatever the problem is-- only relates to when the disk is in a RAID
configuration and in my case in Linux SW RAID or 3ware HW RAID.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Justin Piszcz | 5 Dec 2008 14:59

Re: Velociraptor drive related to the use of NCQ.


On Fri, 5 Dec 2008, Justin Piszcz wrote:

> I swapped out my power supply, changed ALL cables and bought a $1000 raid
> controller with BBU, the drives are still having problems, when writing to
> them in a RAID10 configuration, it locks up the card:
>
> SMART shows the same thing on many of the disks in the RAID10:
>
> Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 hours)
>  When the command that caused the error occurred, the device was active or 
> idle.

I spoke to soon, turning off NCQ helped dramatically, it worked three times!

writing to raid10
dd: writing `file2': No space left on device
1430328+0 records in
1430327+0 records out
1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s
Fri Dec  5 06:00:25 EST 2008
writing to raid10
dd: writing `file2': No space left on device
1430328+0 records in
1430327+0 records out
1499806973952 bytes (1.5 TB) copied, 4063.25 s, 369 MB/s
Fri Dec  5 07:08:11 EST 2008
writing to raid10
dd: writing `file2': No space left on device
1430328+0 records in
1430327+0 records out
1499806973952 bytes (1.5 TB) copied, 3926.71 s, 382 MB/s
Fri Dec  5 08:13:41 EST 2008

Then it crashed again, with NCQ enabled, it would not even complete one test,
So basically a new system, new PSU, new cables, its on a new APC UPS and the 
problem persists even when all disks are on a RAID card, SW raid, it does not 
matter, Velociraptors have problems, I think its time for me to get regular
1TiB disks and be done with it.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Justin Piszcz | 6 Dec 2008 10:26

Re: Velociraptor drive related to the use of NCQ.


On Fri, 5 Dec 2008, Justin Piszcz wrote:

>
>
> On Fri, 5 Dec 2008, Justin Piszcz wrote:
>
>> I swapped out my power supply, changed ALL cables and bought a $1000 raid
>> controller with BBU, the drives are still having problems, when writing to
>> them in a RAID10 configuration, it locks up the card:
>> 
>> SMART shows the same thing on many of the disks in the RAID10:
>> 
>> Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 
>> hours)
>>  When the command that caused the error occurred, the device was active or 
>> idle.
>
> I spoke to soon, turning off NCQ helped dramatically, it worked three times!
>
> writing to raid10
> dd: writing `file2': No space left on device
> 1430328+0 records in
> 1430327+0 records out
> 1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s
> Fri Dec  5 06:00:25 EST 2008
> writing to raid10
> dd: writing `file2': No space left on device
> 1430328+0 records in
> 1430327+0 records out
> 1499806973952 bytes (1.5 TB) copied, 4063.25 s, 369 MB/s
> Fri Dec  5 07:08:11 EST 2008
> writing to raid10
> dd: writing `file2': No space left on device
> 1430328+0 records in
> 1430327+0 records out
> 1499806973952 bytes (1.5 TB) copied, 3926.71 s, 382 MB/s
> Fri Dec  5 08:13:41 EST 2008
>
> Then it crashed again, with NCQ enabled, it would not even complete one test,
> So basically a new system, new PSU, new cables, its on a new APC UPS and the 
> problem persists even when all disks are on a RAID card, SW raid, it does not 
> matter, Velociraptors have problems, I think its time for me to get regular
> 1TiB disks and be done with it.
>
> Justin.
>
>

I have swapped my disks:

Removed my 12 velociraptors.
Inserted my old 12 raptor150s (all ADFD).

The 12 velociraptors are now in a test system using md/raid.
The 12 raptor 150s are now in my main machine w/3ware.

I am going to run the same tests, benchmarks, etc and see if any problems 
repeat with the 150s and I will also run a bunch more tests on the 
velociraptors as well.

So far no problems with the good ol' raptor 150s and I have been running 
the same dd test for the past 8 hours+ on the same raid type and settings 
as I had with the velociraptors:

Fri Dec  5 21:07:43 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2747.77 s, 273 MB/s
Fri Dec  5 21:53:32 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2742.29 s, 273 MB/s
Fri Dec  5 22:39:17 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2685.21 s, 279 MB/s
Fri Dec  5 23:24:05 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2600.05 s, 288 MB/s
Sat Dec  6 00:07:28 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2550.45 s, 294 MB/s
Sat Dec  6 00:50:02 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2507.31 s, 299 MB/s
Sat Dec  6 01:31:52 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2247.31 s, 334 MB/s
Sat Dec  6 02:09:22 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2248.2 s, 334 MB/s
Sat Dec  6 02:46:53 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2245.41 s, 334 MB/s
Sat Dec  6 03:24:22 EST 2008
dd: writing `/t/bigfile': No space left on device
715067+0 records in
715066+0 records out
749801271296 bytes (750 GB) copied, 2494.1 s, 301 MB/s
Sat Dec  6 04:05:59 EST 2008

So far no problems, just like before when I used to use these drives in 
the past, they are rock solid (will continue to test, the VRs are another 
story)-- more info soon to follow.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Gmane