Michal Soltys | 1 Oct 06:45 2007

Re: Backups w/ rsync

Wolfgang Denk wrote:
> Dear Bill,
> 
> in message <46FD1442.70707 <at> tmr.com> you wrote:
>>
>> Be aware that rsync is useful for making a *copy* of your files, which 
>> isn't always the best backup. If the goal is to preserve data and be 
>> able to recover in time of disaster, it's probably not optimal, while if 
>> you need frequent access to old or deleted files it's fine.
> 
> If you want to do real backups you should use real tools, like bacula
> etc.
> 

I wouldn't agree here. All depends on how you organize yuor things, write
scripts, and so on. It isn't any less "real" solution than amanda or bacula.
It's much more DIY solution though, so not everyone will be inclined to use it.

ps.
Sorry for offtopic. Last in this subject from me.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dale Dunlea | 1 Oct 11:16 2007
Picon

RAID5 lockup with AMCC440 and async-tx

Hi,

I have a board with an AMCC440 processor, running RAID5 using the
async-tx interface. In general, it works well, but I have found a test
case that consistently causes a hard lockup of the entire system.

What makes this case odd is that I have only been able to generate it
when accessing disks that are on two separate HBAs - in my case
mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
trivial to repeat. I simply create a RAID5 using disks from each HBA,
wait for it to resync, and then run

"dd if=/dev/zero of=/dev/md0 bs=512 count=100000".

By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
away, but then so does my performance.

Any pointers on how to debug this? It feels like a race condition of
some description, but any serial port printing I enable causes the
problem to go away, and I can't print silently to /var/log/messages as
the system hangs before it can flush.

Regards,
Dale
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Justin Piszcz | 1 Oct 12:13 2007

Re: RAID5 lockup with AMCC440 and async-tx


On Mon, 1 Oct 2007, Dale Dunlea wrote:

> Hi,
>
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.
>
> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run
>
> "dd if=/dev/zero of=/dev/md0 bs=512 count=100000".
>
> By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
> away, but then so does my performance.
>
> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.
>
> Regards,
> Dale
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo <at> vger.kernel.org
(Continue reading)

Wolfgang Denk | 1 Oct 12:32 2007
Picon
Picon

Re: RAID5 lockup with AMCC440 and async-tx

Dear Dale,

in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564 <at> mail.gmail.com> you wrote:
> 
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.

Please make sure to use latest code - we found a bug recently.

> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run

We saw similar problems, in our case they showed up only with a large
number of disks in combination with big kernel pages sizes (64 kB).

> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.

See above - please try current code.

Best regards,

Wolfgang Denk

(Continue reading)

Dale Dunlea | 1 Oct 13:02 2007
Picon

Re: RAID5 lockup with AMCC440 and async-tx

On 01/10/2007, Wolfgang Denk <wd <at> denx.de> wrote:
> Dear Dale,
>
> in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564 <at> mail.gmail.com> you wrote:
> >
> > I have a board with an AMCC440 processor, running RAID5 using the
> > async-tx interface. In general, it works well, but I have found a test
> > case that consistently causes a hard lockup of the entire system.
>
> Please make sure to use latest code - we found a bug recently.

Latest code from Dan or latest code from denx.de? I grabbed the latest
code from Dan, but I'm having trouble cloning denx.de:

"remote: error: object directory /home/git/linux-2.6/.git/objects does
not exist; check .git/objects/info/alternates."
>
> > What makes this case odd is that I have only been able to generate it
> > when accessing disks that are on two separate HBAs - in my case
> > mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> > trivial to repeat. I simply create a RAID5 using disks from each HBA,
> > wait for it to resync, and then run
>
> We saw similar problems, in our case they showed up only with a large
> number of disks in combination with big kernel pages sizes (64 kB).
>
The problem occurs for me with both 4k and 64k pages.

Regards,
Dale
(Continue reading)

Daniel Santos | 1 Oct 13:04 2007
Picon

problem killing raid 5

Hello,

I had a raid 5 array on three disks. Because of a hardware problem two 
disks dissapeared one after the other. I have since been trying to 
create a new array with them.

Between the degradation of the two disks I tryied removing one of the 
failed disks and re-adding it to the array. When the second disk failed 
I noticed the drive numbers on the broken array, and misteriously a 
fourth drive appeared on it. Now I have numbers 0,1 and 3, but no number 2.
mdadm tells me that number 3 is a spare.

Now I want to start all over again, but even after zeroing the 
superblocks on all three disks, and creation of a new array, 
/proc/mdstat shows the same drive numbers, while reconstructing the 
third drive.
What should I do ?

Daniel Santos
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wolfgang Denk | 1 Oct 19:39 2007
Picon
Picon

Re: RAID5 lockup with AMCC440 and async-tx

Dear Dale,

in message <8a24fb800710010402u5aa0187bq4f850b8cb71483c9 <at> mail.gmail.com> you wrote:
>
> Latest code from Dan or latest code from denx.de? I grabbed the latest

From linux-2.6-denx

> code from Dan, but I'm having trouble cloning denx.de:
> 
> "remote: error: object directory /home/git/linux-2.6/.git/objects does
> not exist; check .git/objects/info/alternates."

Argh.. Stupid me.

Please try again - this one is fixed now.

> > We saw similar problems, in our case they showed up only with a large
> > number of disks in combination with big kernel pages sizes (64 kB).
> >
> The problem occurs for me with both 4k and 64k pages.

Probably using more than one controller adds to the likelyhood of
being hit by this race condition.

Best regards,

Wolfgang Denk

--

-- 
(Continue reading)

Daniel Santos | 1 Oct 20:20 2007
Picon

Re: problem killing raid 5

I retried rebuilding the array once again from scratch, and this time 
checked the syslog messages. The reconstructions process is getting 
stuck at a disk block that it can't read. I double checked the block 
number by repeating the array creation, and did a bad block scan. No bad 
blocks were found. How could the md driver be stuck if the block is fine ?

Supposing that the disk has bad blocks, can I have a raid device on 
disks that have badblocks ? Each one of the disks is 400 GB.

Probably not a good idea because if a drive has bad blocks it probably 
will have more in the future. But anyway, can I ?
The bad blocks would have to be known to the md driver.

Daniel Santos wrote:
> Hello,
>
> I had a raid 5 array on three disks. Because of a hardware problem two 
> disks dissapeared one after the other. I have since been trying to 
> create a new array with them.
>
> Between the degradation of the two disks I tryied removing one of the 
> failed disks and re-adding it to the array. When the second disk 
> failed I noticed the drive numbers on the broken array, and 
> misteriously a fourth drive appeared on it. Now I have numbers 0,1 and 
> 3, but no number 2.
> mdadm tells me that number 3 is a spare.
>
> Now I want to start all over again, but even after zeroing the 
> superblocks on all three disks, and creation of a new array, 
> /proc/mdstat shows the same drive numbers, while reconstructing the 
(Continue reading)

Michael Tokarev | 1 Oct 20:47 2007
Picon

Re: problem killing raid 5

Daniel Santos wrote:
> I retried rebuilding the array once again from scratch, and this time
> checked the syslog messages. The reconstructions process is getting
> stuck at a disk block that it can't read. I double checked the block
> number by repeating the array creation, and did a bad block scan. No bad
> blocks were found. How could the md driver be stuck if the block is fine ?
> 
> Supposing that the disk has bad blocks, can I have a raid device on
> disks that have badblocks ? Each one of the disks is 400 GB.
> 
> Probably not a good idea because if a drive has bad blocks it probably
> will have more in the future. But anyway, can I ?
> The bad blocks would have to be known to the md driver.

Well, almost all modern drives can remap bad blocks (at least I know no
drive that can't).  Most of the time it happens on write - becaue if such
a bad block is found during read operation and the drive really can't
read the content of that block, it can't remap it either without losing
data.  From my expirience (about 20 years, many 100s of drives, mostly
(old) SCSI but (old) IDE too), it's pretty normal for a drive to develop
several bad blocks, especially during first year of usage.  Sometimes
however, number of bad blocks grows quite rapidly and such a drive
definietely should be replaced - at least Seagate drives are covered
by warranty in this case.

SCSI drives has 2 so-called "defect lists", stored somewhere inside the
drive - factory-preset list (bad blocks found during internal testing
when producing a drive), and grown list (bad blocks found by drive
during normal usage).  Factory-preset list can contain from 0 to about
1000 entries or even more (depending on the size too), grown list can
(Continue reading)

Patrik Jonsson | 1 Oct 20:58 2007

Re: problem killing raid 5

Michael Tokarev wrote:
> Daniel Santos wrote:
>> I retried rebuilding the array once again from scratch, and this time
>> checked the syslog messages. The reconstructions process is getting
>> stuck at a disk block that it can't read. I double checked the block
>> number by repeating the array creation, and did a bad block scan. No bad
>> blocks were found. How could the md driver be stuck if the block is fine ?
>>
>> Supposing that the disk has bad blocks, can I have a raid device on
>> disks that have badblocks ? Each one of the disks is 400 GB.
>>
>> Probably not a good idea because if a drive has bad blocks it probably
>> will have more in the future. But anyway, can I ?
>> The bad blocks would have to be known to the md driver.
> 
> Well, almost all modern drives can remap bad blocks (at least I know no
> drive that can't).  Most of the time it happens on write - becaue if such
> a bad block is found during read operation and the drive really can't
> read the content of that block, it can't remap it either without losing
> data.  From my expirience (about 20 years, many 100s of drives, mostly
> (old) SCSI but (old) IDE too), it's pretty normal for a drive to develop
> several bad blocks, especially during first year of usage.  Sometimes
> however, number of bad blocks grows quite rapidly and such a drive
> definietely should be replaced - at least Seagate drives are covered
> by warranty in this case.
> 
> SCSI drives has 2 so-called "defect lists", stored somewhere inside the
> drive - factory-preset list (bad blocks found during internal testing
> when producing a drive), and grown list (bad blocks found by drive
> during normal usage).  Factory-preset list can contain from 0 to about
(Continue reading)


Gmane