Mr. James W. Laferriere | 1 Oct 2008 05:50
Gravatar

Re: exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen

 	Hello Justin ,

On Tue, 30 Sep 2008, Justin Piszcz wrote:
> On Tue, 30 Sep 2008, Tom Mortensen wrote:
>
>> Don't know if this is the original poster's problem, but if the drive
>> is spun down, then enabling SMART or trying to read SMART attributes
>> causes the drive to spin up and the command is delayed until this has
>> occurred.
>> 
>> The fix is to increase the timeout given to scsi_execute() in
>> drivers/ata/libata-scsi.c.
>> 
>> ie, current code (2.6.26.5) is:
>>
>>        /* Good values for timeout and retries?  Values below
>>           from scsi_ioctl_send_command() for default case... */
>>        cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, 
>> argsize,
>>                                  sensebuf, (10*HZ), 5, 0);
>> 
>> Should be changed to:
>>
>>        /* Good values for timeout and retries?  Values below
>>           from scsi_ioctl_send_command() for default case... */
>>        cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, 
>> argsize,
>>                                  sensebuf, (30*HZ), 5, 0);
>> 
>> Using a 1TB Hitachi hard drive, this command times out because it
(Continue reading)

Justin Piszcz | 1 Oct 2008 10:06

Re: exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen


On Tue, 30 Sep 2008, Mr. James W. Laferriere wrote:

> 	Hello Justin ,
>
>> 
>> Justin.
> 	I take it you've tried differant drive manufacturers ?
It happens across 12-14 velociraptors.

> 	Or even a differant drive of same manuf. ?
It also occurs (I have seen it) on WD 750GiB drives on a different motherboard
and chipset (P35).

> 	Seeing as you've moved this same drive(?) across several chipsets & 
> possibly mother boards ,  Leads me to beleive that the difficulty is either 
> with the driver or the drive (if it is always the same drive or drive model) 
Other people have the same problem with Seagate and / other drives.
This also occurs on Raptor 150s.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Danilo Godec | 1 Oct 2008 11:25
Picon
Favicon

SATA errors?

Hi,

I've been searching the web, Google, mailing lists for a while now, but
can't really find the answer - so I'm hoping for a 'SATA guru' here...

On one of my server, I use 4 drive RAID5 Linux raid. One of the drives
keeps reporting these errors:

> Oct  1 10:11:29 bigxen2 kernel: ata1.00: exception Emask 0x10 SAct 0x0
> SErr 0x10002 action 0x2 frozen
> Oct  1 10:11:29 bigxen2 kernel: ata1.00: (irq_stat 0x04400000, PHY RDY
> changed)
> Oct  1 10:11:29 bigxen2 kernel: ata1.00: cmd
> 25/00:08:5c:be:60/00:00:14:00:00/e0 tag 0 cdb 0x0 data 4096 in
> Oct  1 10:11:29 bigxen2 kernel:          res
> 50/00:00:6b:6a:60/00:00:14:00:00/e0 Emask 0x10 (ATA bus error)
> Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up (7
> secs)
> Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS failed)
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in 5 secs
> Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Oct  1 10:11:47 bigxen2 kernel: ata1.00: configured for UDMA/133
> Oct  1 10:11:47 bigxen2 kernel: ata1: EH complete
> Oct  1 10:11:47 bigxen2 kernel: SCSI device sda: 976771055 512-byte
> hdwr sectors (500107 MB)
> Oct  1 10:11:47 bigxen2 kernel: sda: Write Protect is off
> Oct  1 10:11:47 bigxen2 kernel: sda: Mode Sense: 00 3a 00 00
(Continue reading)

Wolfgang Denk | 1 Oct 2008 12:08
Picon
Picon
Favicon

Re: SATA errors?

Dear Danilo,

In message <48E34221.1000008 <at> agenda.si> you wrote:
> 
> I've been searching the web, Google, mailing lists for a while now, but
> can't really find the answer - so I'm hoping for a 'SATA guru' here...

I'm ot a guru, but I had similar previous experience.

> As far as I can tell these happen randomly - sometimes it's 8 hours
> between two, sometimes it's a couple of minutes. There are about 5-20 of
> those per day, however Linux raid never kicks the drive out of the
> array. There are also no other signs of drive not functioning properly
> (such as filesystem corruption or similar).
> 
> Any ideas? Can anyone 'decode' the above errors?

In my experience, problems like this are often casued by
broken/unreliable cables / connectors / backplanes.

As a first measure, try replugging the SATA cables.

If this doesn't help, try swapping arount the disks and cables to see
if the problem is with the cable (sticks with the disk) or  with  the
backplance (sticks with a physical port).

Then replace the faulty components.

Best regards,

(Continue reading)

Danilo Godec | 1 Oct 2008 12:41
Picon
Favicon

Re: SATA errors?

Wolfgang Denk pravi:
> In my experience, problems like this are often casued by
> broken/unreliable cables / connectors / backplanes.
>
> As a first measure, try replugging the SATA cables.
>
> If this doesn't help, try swapping arount the disks and cables to see
> if the problem is with the cable (sticks with the disk) or  with  the
> backplance (sticks with a physical port).
>   
That has been one of my ideas too and I have already checked and swapped
cables - but no success. I couldn't change the backplane as I don't have
a spare one.

But there is one thing though that crossed my mind just seconds after
I've hit the 'send' button!

I periodically check (every minute)  the drive temperature using
'smartctl -a' and I only query one drive - '/dev/sda'! So I re-checked
my log files and indeed - the SATA error ALWAYS happens within one
second of the 'smartctl' command (but not every time).

So now I changed the scripts to query a different drive ('/dev/sdb'). In
a couple of hours I should know if that was it...

  Thanks for the help, Danilo

PS: Oh, one more thing - the 'sda' drive is a WDC, while all the others
are Seagate.

(Continue reading)

Justin Piszcz | 1 Oct 2008 13:12

Re: exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen


On Wed, 1 Oct 2008, Justin Piszcz wrote:

>
>
> On Tue, 30 Sep 2008, Mr. James W. Laferriere wrote:
>
>> 	Hello Justin ,
>> 
>>> 
>>> Justin.
>> 	I take it you've tried differant drive manufacturers ?
> It happens across 12-14 velociraptors.
>
>> 	Or even a differant drive of same manuf. ?
> It also occurs (I have seen it) on WD 750GiB drives on a different 
> motherboard
> and chipset (P35).
>
>> 	Seeing as you've moved this same drive(?) across several chipsets & 
>> possibly mother boards ,  Leads me to beleive that the difficulty is either 
>> with the driver or the drive (if it is always the same drive or drive 
>> model) 
> Other people have the same problem with Seagate and / other drives.
> This also occurs on Raptor 150s.
>
> Justin.
>
>
>
(Continue reading)

David Lethe | 1 Oct 2008 13:48
Favicon

RE: SATA errors?


> -----Original Message-----
> From: linux-raid-owner <at> vger.kernel.org [mailto:linux-raid-
> owner <at> vger.kernel.org] On Behalf Of Danilo Godec
> Sent: Wednesday, October 01, 2008 5:41 AM
> To: Wolfgang Denk
> Cc: Linux RAID Mailing List
> Subject: Re: SATA errors?
> 
> Wolfgang Denk pravi:
> > In my experience, problems like this are often casued by
> > broken/unreliable cables / connectors / backplanes.
> >
> > As a first measure, try replugging the SATA cables.
> >
> > If this doesn't help, try swapping arount the disks and cables to
see
> > if the problem is with the cable (sticks with the disk) or  with
the
> > backplance (sticks with a physical port).
> >
> That has been one of my ideas too and I have already checked and
> swapped cables - but no success. I couldn't change the backplane as I
> don't have a spare one.
> 
> But there is one thing though that crossed my mind just seconds after
> I've hit the 'send' button!
> 
> I periodically check (every minute)  the drive temperature using
> 'smartctl -a' and I only query one drive - '/dev/sda'! So I re-checked
(Continue reading)

David Greaves | 1 Oct 2008 14:17
Favicon
Gravatar

Re: SATA errors?

David Lethe wrote:
> There is no cause of concern. The 0x25 command translates to
> READ_CAPACITY10.  (i.e., how many blocks does the disk hold).  This
> command is emulated because the disk doesn't natively speak SCSI
> commands, which is how your specific hardware/driver/controller
> combination configures such things.

and yet look at the timestamps...

> Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up (7
> secs)
> Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS failed)
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in 5 secs
> Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Oct  1 10:11:47 bigxen2 kernel: ata1.00: configured for UDMA/133
> Oct  1 10:11:47 bigxen2 kernel: ata1: EH complete

That looks to me like 15-17 seconds of unresponsive disk; certainly the time
around the resets are times when the driver isn't allowing disk access.

I'd say there was cause for something; although I'd cc the linux-ide group for
real insight, not linux-raid :)

David - maybe the response from the 0x25 command should not result in a reset -
or maybe the 0x25 should not be issued if it causes a state that does require a
reset.

(Continue reading)

David Lethe | 1 Oct 2008 15:11
Favicon

RE: SATA errors?


> -----Original Message-----
> From: David Greaves [mailto:david <at> dgreaves.com]
> Sent: Wednesday, October 01, 2008 7:18 AM
> To: David Lethe
> Cc: Danilo Godec; Wolfgang Denk; Linux RAID Mailing List
> Subject: Re: SATA errors?
> 
> David Lethe wrote:
> > There is no cause of concern. The 0x25 command translates to
> > READ_CAPACITY10.  (i.e., how many blocks does the disk hold).  This
> > command is emulated because the disk doesn't natively speak SCSI
> > commands, which is how your specific hardware/driver/controller
> > combination configures such things.
> 
> and yet look at the timestamps...
> 
> 
> > Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up
> (7
> > secs)
> > Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> > Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS
> failed)
> > Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in
5
> secs
> > Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> > Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> > 123 SControl 300)
(Continue reading)

Bill Davidsen | 1 Oct 2008 17:09

Re: exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen

Tejun Heo wrote:
> Bill Davidsen wrote:
>   
>> Gwendal Grignou wrote:
>>     
>>> About ata1:0 problem, as reported in the bugzilla bug: I would try to
>>> disable NCQ to see if it helps. Your disks firmware might not fully
>>> support it.
>>>
>>> You can either add the parameter "libata.force=noncq" when loading
>>> your kernel, or set queue_depth to 1 for all the Seagate drives behind
>>> the Marvell MV88SX6081 controller.
>>>
>>> About ata5:0 , someone - in user space probably - is trying to do a
>>> SMART ENABLE operation, but the device ignores it. I don't know which
>>> device you are using, but I assume it does not support ATA SMART
>>> feature set. Timeout is an acceptable but not a nice way to answer, a
>>> cancel would have been better; check if there is a firmware upgrade
>>> for your device.
>>>   
>>>       
>> You certainly called the SMART issue, I was wondering why a new
>> distribution install on some older hardware was getting all the errors,
>> clearly the Fedora "smartd" doesn't check SMART capability before trying
>> to enable the feature. Oddly the drive on which I see this does reply to
>> SMART requests, so the firmware must be "semi-functional." Not a
>> problem, in my case the drive is just used for testing handling of hot
>> swap, and has no data of any value.
>>     
>
(Continue reading)


Gmane