Jonathan Bell | 1 Oct 2007 02:30

Strange arbitrary port resets on ICH9R with Seagate drives

Hello

I've just purchased a brand spanking new G33/ICH9R based system for use as  
a home fileserver with 4x ST3750840AS Seagate SATA drives as the main  
grunt drives.

The problem is that all of the seagate drives keep resetting, as this  
dmesg excerpt shows:

[ 2114.613486] ata5: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action  
0x2 frozen
[ 2114.613494] ata5: (irq_stat 0x00400040, connection status changed)
[ 2115.188869] ata5: waiting for device to spin up (8 secs)
[ 2116.832307] ata6: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action  
0x2 frozen
[ 2116.832314] ata6: (irq_stat 0x00400040, connection status changed)
[ 2117.405372] ata6: waiting for device to spin up (8 secs)
[ 2123.316046] ata5: soft resetting port
[ 2123.487789] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 2123.529172] ata5.00: ata_hpa_resize 1: sectors = 1465149168,  
hpa_sectors = 1465149168
[ 2123.587389] ata5.00: ata_hpa_resize 1: sectors = 1465149168,  
hpa_sectors = 1465149168
[ 2123.587395] ata5.00: configured for UDMA/133
[ 2123.587400] ata5: EH complete
[ 2123.587628] SCSI device sdb: 1465149168 512-byte hdwr sectors (750156  
MB)
[ 2123.587862] sdb: Write Protect is off
[ 2123.587866] sdb: Mode Sense: 00 3a 00 00
[ 2123.588054] SCSI device sdb: write cache: enabled, read cache: enabled,  
(Continue reading)

su henry | 1 Oct 2007 06:18
Picon

[patch] Add more device IDs for supporting ATI SB800 SATA controller

From: henry su<henry.su.ati <at> gmail.com>

ATI/AMD SB800 shares some device IDs with SB700,
and SB800 adds two more device IDs:0x4394,0x4395.

Signed-off-by:  henry su<henry.su.ati <at> gmail.com>
---------------------------
diff -Nur a/drivers/ata/ahci.c b/drivers/ata/ahci.c
--- a/drivers/ata/ahci.c        2007-09-17 22:07:04.000000000 +0800
+++ b/drivers/ata/ahci.c        2007-09-17 22:10:46.000000000 +0800
 <at>  <at>  -399,10 +399,12  <at>  <at> 

        /* ATI */
        { PCI_VDEVICE(ATI, 0x4380), board_ahci_sb600 }, /* ATI SB600 */
-       { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb600 }, /* ATI SB700 IDE */
-       { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb600 }, /* ATI SB700 AHCI */
-       { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb600 }, /* ATI SB700 nraid5 */
-       { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb600 }, /* ATI SB700 raid5 */
+       { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb600 }, /* ATI SB700/800 */
+       { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb600 }, /* ATI SB700/800 */
+       { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb600 }, /* ATI SB700/800 */
+       { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb600 }, /* ATI SB700/800 */
+       { PCI_VDEVICE(ATI, 0x4394), board_ahci_sb600 }, /* ATI SB700/800 */
+       { PCI_VDEVICE(ATI, 0x4395), board_ahci_sb600 }, /* ATI SB800 */

        /* VIA */
        { PCI_VDEVICE(VIA, 0x3349), board_ahci_vt8251 }, /* VIA VT8251 */
Attachment (sb800sata.patch): text/x-patch, 1060 bytes
Alexander Sabourenkov | 1 Oct 2007 09:04
Favicon

Promise SATA300 TX4: errors, oops in ext3 code

Hardware:  Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10 
320G, jumper-limited to SATA150.
Kernel : 2.6.22.9 amd64

Problem:
Heavy load causes errors and triggers oops.

History:
Problems were first encountered on kernel 2.6.19, both i686 ("old" 
system) and amd64 (gentoo installation CD).
Can't say anything about older kernels. Most probably they have same 
issues (or worse).

Problems were blamed:
   - SATA300 being too 'hot'  (jumpered the drives)
   - cables (work perfectly on onboard controller)
   - interrupt sharing (found the only slot which does not share 
interrupt line)
   - cooling (3 fans installed, smartctl-reported temperature at max 
load dropped to 35C)
   - weak PSU (installed 600W FSP)
   - kernel bugs (upgraded to 2.6.22.9)

All those measures significantly dropped error rate (from about 20 to 
2-4 per mirror rebuild) but did not eliminate the problem.

Errors are easily reproduced by performing resync on a md RAID-1. 
Raising overall system load (compilation, copy operations on other HDDs) 
makes errors happen sooner.

(Continue reading)

Clemens Koller | 1 Oct 2007 11:09
Picon

Re: Promise SATA300 TX4: errors, oops in ext3 code

Alexander Sabourenkov schrieb:
 > Hardware:  Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
 > 320G, jumper-limited to SATA150.
 > Kernel : 2.6.22.9 amd64
 >
 > Problem:
 > Heavy load causes errors and triggers oops.

Have you checked your memory already (memtest86)?

We have several applications with Promise controllers on strange
hardware and we never had integrity problems with i.e. not so standard
SATA connections over custom vaccum-tight connectors.

 > Problems were blamed:
 >   - SATA300 being too 'hot'  (jumpered the drives)

Is this a common known problem with your harddrives or controller?
(ask google) Otherwise, it sounds like a problem with broken hardware.

 >   - cables (work perfectly on onboard controller)
 >   - interrupt sharing (found the only slot which does not share
 > interrupt line)
 >   - cooling (3 fans installed, smartctl-reported temperature at max load
 > dropped to 35C)

Try to heat up your memory a little (your wife's hair blower).
If it fails more often, your memory is most likely broken.

 >   - weak PSU (installed 600W FSP)
(Continue reading)

Alexander Sabourenkov | 1 Oct 2007 12:26
Favicon

Re: Promise SATA300 TX4: errors, oops in ext3 code

Clemens Koller wrote:
> Alexander Sabourenkov schrieb:
>  > Hardware:  Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
>  > 320G, jumper-limited to SATA150.
>  > Kernel : 2.6.22.9 amd64
>  >
>  > Problem:
>  > Heavy load causes errors and triggers oops.
> 
> Have you checked your memory already (memtest86)?

Last run was about a year ago.

This box gets regularly updated (rebuild of all installed software),
so I'm reasonably certain that memory is ok - gcc being almost as 
sensitive as memtest.

Will recheck anyway.

> 
> We have several applications with Promise controllers on strange
> hardware and we never had integrity problems with i.e. not so standard
> SATA connections over custom vaccum-tight connectors.

Judging from linux and freebsd mailing lists, the TX4 is now quite 
well-known for
intermittent problems, which are hard to reproduce on different hardware.

I have two machines with those controllers, one FreeBSD-6.2 on MSI 
K8Neo2 motherboard (ATI chipset),
(Continue reading)

Jeff Garzik | 1 Oct 2007 14:28
Favicon

Re: Polling (was Re: [PATCHSET 2/2] implement PMP support, take 6)

Alan Cox wrote:
> ty today, with 3+ month kernel release cycles (ugh!!).
>> I'm very much interested in hearing suggestions and comments.
> 
> I think PMP should go in for 2.6.24 and then get revised over time to not
> poll and to go via qc_issue. Which seems to be the consensus of everyone
> but you on this one.

It's not an "over time" issue.  SAS drivers that will be ready for 
2.6.24 (broadsas, mvsas) have both been coded to support PMP 
transparently -- but the current libata PMP code requires that PMP 
support in SAS be TURNED OFF, because it is fundamentally incompatible 
by its design.

This is an issue that has been present and known for many, many months 
-- as long as drivers/scsi/libsas/sas_ata.c has been in the tree.

The more we walk down the polling path, the more incompatible we become 
with SATA-capable controllers that are in users' hands today.

	Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeff Garzik | 1 Oct 2007 14:38
Favicon

Re: Polling (was Re: [PATCHSET 2/2] implement PMP support, take 6)

Mark Lord wrote:
> Linux kernel development is supposed to happen incrementally nowadays.
> Get a nice working solution in place, and then enhance/tune it.

It's not about enhancing and tuning.

It's about me (and/or James B) having to __undo__ the current code, just 
to get things working on an entire class of SATA-capable controllers out 
in the field.

	Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Frans Pop | 1 Oct 2007 14:43
Picon

ata1.00: spurious completions during NCQ

Hi,

On 15 August 2007 Tejun Heo wrote:
> You don't need to worry too much as long as errors are properly,
> recovered. All commands are retried and you won't lose any data.
> Please report if the spurious NCQ problem happens again.  Thanks.  

I've just received a logcheck mail with another one. I have not seen any in
the time between my initial mail [1] and now, so this is the second
occurrence in 1.5 months. The message is slightly different this time.

I'm currently running 2.6.23-rc8 + CFS patchset.

kernel: ata1.00: spurious completions during NCQ issue=0x0 SAct=0x8 FIS=005040a1:00000004
kernel: ata1.00: cmd 60/58:18:b3:56:bd/00:00:01:00:00/40 tag 3 cdb 0x0 data 45056 in
kernel:          res 50/00:58:b3:56:bd/00:00:01:00:00/40 Emask 0x2 (HSM violation)
kernel: ata1: soft resetting port
kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
kernel: ata1.00: configured for UDMA/133
kernel: ata1: EH complete
kernel: sd 0:0:0:0: [sda] 321672960 512-byte hardware sectors (164697 MB)
kernel: sd 0:0:0:0: [sda] Write Protect is off
kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Cheers,
Frans Pop

[1] http://marc.info/?l=linux-ide&m=118703097726277&w=2
-
(Continue reading)

Sergei Shtylyov | 1 Oct 2007 15:12

Re: [PATCH] pata_hpt3x2n: Clean up DPLL stuff

Hello.

Alan Cox wrote:

> Nobody commented when I asked for review earlier so it must be ok  8)

    It's not that I've seen this before

 > so it must be ok  8)

    Not by me at least, so let me NAK it. 8-)

> Lets stick it in -mm to be sure

    Now let's unstick it. :-)

> Signed-off-by: Alan Cox <alan <at> redhat.com>
> 
> diff -u --new-file --exclude-from /usr/src/exclude --recursive
linux.vanilla-2.6.23rc8-mm1/drivers/ata/pata_hpt37x.c linux-2.6.23rc8-mm1/drivers/ata/pata_hpt37x.c
> --- linux.vanilla-2.6.23rc8-mm1/drivers/ata/pata_hpt37x.c	2007-09-26 16:46:48.000000000 +0100
> +++ linux-2.6.23rc8-mm1/drivers/ata/pata_hpt37x.c	2007-09-18 16:44:32.000000000 +0100

    Wait, I thought you're patching pata_hpt3x2n!

>  <at>  <at>  -844,6 +844,46  <at>  <at> 
>  	/* Never went stable */
>  	return 0;
>  }
> +
(Continue reading)

Jeff Garzik | 1 Oct 2007 15:31
Favicon

Re: Polling (was Re: [PATCHSET 2/2] implement PMP support, take 6)

Tejun Heo wrote:
> Jeff Garzik wrote:
>> Polling ALREADY makes the job of fixing SAS/SATA exception handling
>> difficult.  Expanding polling to something SAS/SATA controllers treat as
>> fundamentally irq-driven and integrated with the rest of the command
>> flow is moving in the wrong direction.
>>
>> To re-re-re-summarize, polling in PMP is fundamentally broken for an
>> ENTIRE CLASS OF HARDWARE that we actively support today.  And
>> jgarzik/misc-2.6.git#sas is adding two more controllers to that list.
> 
> As an interim solution, it doesn't make anything worse tho.  Those
> drivers don't support PMP anyway.  After rc1 merge, polling PMP access
> can be replaced with new qc_issue (probably ata_exec_internal) based code.
> 
> The question here is whether it's worth to include PMP support with
> polling PMP register access as an interim solution for 2.6.24.  I think
> it will be beneficial for both user convenience and testing as long as
> the said change is made soon after -rc1.

Polling PMP 2.6.24 is completely unacceptable.  It screws the 2.6.24 SAS 
driver releases out of PMP.

I pulled your last PMP patchset, and will now endeavor to fix the API 
prior to 2.6.24 merge window opening.

Linux high level message-submit / message-complete APIs should never 
_require_ polling, even if its 100% polling under the hood.  There are 
far too many cases in the field where you don't have direct access to 
hardware registers to poll.  Or such polling would interfere with the 
(Continue reading)


Gmane