Andy Smith | 3 Jan 01:30
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> Stephen Tweedie wrote:
> > Umm, if soft raid is expected to have silent invisible corruptions in
> > normal use,
> 
> It is, just as is all types of RAID.  This is a very strange thing for
> Stephen to say - I cannot believe that he is as naive as he makes
> himself out to be about RAID here and I don't know why he should say
> that (presuming that he really knows better).
> 
> > then you shouldn't be using it, period.  That's got zero to
> > do with journaling.
> 
> It implies that one should not be doing journalling on top of it.
> 
> (The logic for why RAID corrupts silently is that errors accumulate at
> n times the normal rate per sector, but none of them are detected by
> RAID (no crc), and when a disk drops out then you get a good chance of
> picking up a corrupted copy instead of a good copy, because nobody
> has checked the copy meanwhiles to see if it matches the original).

I have no idea which of you to believe now. :(

I currently only have one system using software raid, and several of
my employer's machines using hardware raid, all of which have
various raid-1, -5 and -10 setups and all use only ext3.

Let's focus on the personal machine of mine for now since it uses
Linux software RAID and therefore on-topic here.  It has /boot on a
small RAID-1, and the rest of the system is on RAID-5 with an
(Continue reading)

Neil Brown | 3 Jan 06:49
X-Face
Picon
Picon
Favicon

Re: Regarding RAID0 Code

On Saturday January 1, poonamsbox <at> yahoo.com wrote:
> Respected Sir,
>  
>             I am a student from India. I am studying the RAID 0 code
>             in the linux kernel 2.4.20. I got some problems in
>             understanding the code, especially in the
>             raid0_make_request function. I didnt understand the
>             following statements.. like why the anding is done, and
>             what is sect_in_chunk. Please if you could tell me the
>             details, it would be a very great help for me. Please
>             sir, hoping for some help. 
>  
> 258         sect_in_chunk = bh->b_rsector & ((chunk_size<<1) -1);
> 259         chunk = (block - zone->zone_offset) / (zone->nb_dev << chunksize_bits);
> 260         tmp_dev = zone->dev[(block >> chunksize_bits) % zone->nb_dev];
> 261         rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> 262                 + sect_in_chunk;

"sect_in_chunk" is the number of the target sector within it's chunk.
i.e. the remainder when the target sector is divided by the chunk
size.

If it still doesn't make sense, try writing the code for mapping an
array sector number to a (device number + device sector number).  Then
ask specifically about differences.

NeilBrown

> 
>                Thanking you in anticipation.
(Continue reading)

Neil Brown | 3 Jan 07:41
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday January 3, andy <at> strugglers.net wrote:
> 
> I have no idea which of you to believe now. :(

Well, how about I wade in.....

(almost*) No block storage device will guarantee that write ordering
is maintained.  Neither will read requests necessarily be ordered.

Any SCSI, IDE, or similar disc drive in Linux (or any other non-toy
OS) will have requests managed by an "elevator algorithm" which
coalesces adjacent blocks and  tries to re-order requests to make
optimal use of the device.

A RAID controller, whether software, firmware, or hardware, will also
re-order requests to make best use of the devices.

Any filesystem that assumes that requests will not be re-ordered is
broken, as the assumption is wrong.
I would be *very* surprised if Reiserfs makes this assumption.

Until relatively recently, the only assumption that could be made is
that a write request will be handled sometime between when it is made,
and when the request completes (i.e. the end_io callback is called).
If several requests are concurrent they could commit in any order.

With only this guarantee, the simplest approach for a journalling
filesystem is to write the content of a journal entry, wait for the
writes to complete, and then write a single block "header" which
describes and hence commits that journal entry.  The journal entry is
(Continue reading)

Peter T. Breuer | 3 Jan 09:03
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Andy Smith <andy <at> strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 45 lines --]
> 
> On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> > Stephen Tweedie wrote:
> > > Umm, if soft raid is expected to have silent invisible corruptions in
> > > normal use,
> > 
> > It is, just as is all types of RAID.  This is a very strange thing for
> > Stephen to say - I cannot believe that he is as naive as he makes
> > himself out to be about RAID here and I don't know why he should say
> > that (presuming that he really knows better).
> > 
> > > then you shouldn't be using it, period.  That's got zero to
> > > do with journaling.
> > 
> > It implies that one should not be doing journalling on top of it.
> > 
> > (The logic for why RAID corrupts silently is that errors accumulate at
> > n times the normal rate per sector, but none of them are detected by
> > RAID (no crc), and when a disk drops out then you get a good chance of
> > picking up a corrupted copy instead of a good copy, because nobody
> > has checked the copy meanwhiles to see if it matches the original).
> 
> I have no idea which of you to believe now. :(

Both of us. We have not disagreed fundamentally. Read closely! Stephen
says "IF (my caps) soft raid is expected to have ...". Well, it is,
just like any RAID.

(Continue reading)

Peter T. Breuer | 3 Jan 09:37
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> Well, how about I wade in.....

Sure! :-)

> A RAID controller, whether software, firmware, or hardware, will also
> re-order requests to make best use of the devices.

Possibly.  I have written block device drivers that maintain write
order, however (or at least do so if you ask them to, with the right
switch), because ...

> Any filesystem that assumes that requests will not be re-ordered is
> broken, as the assumption is wrong.
> I would be *very* surprised if Reiserfs makes this assumption.

.. because that is EXACTLY what Hans Reiser has said to me. I don't
think I've kept the mail, but I remember it.  a quick google for
reiserfs + write ordering shows up some suggestive quotes:

  > We cannot use the buffer.c dirty list anyway because bdflush can write
  > those buffers to disk at any time.  Transactions have to control the
  > write ordering  ...

(hey, that was Hans quoting Stephen). From the Linux High Availability
website (http://linuxha.trick.ca/DataRedundancyByDrbd):

   Since later WRITE requests might depend on successful finished
   previous ones, this is needed to assure strict write ordering on
   both nodes. ...
(Continue reading)

Guy | 3 Jan 09:58

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

"Well, you can make somewhere. You only require an 8MB (one cylinder)
partition."

So, it is ok for your system to fail when this disk fails?
I don't want system failures when a disk fails, so mirror (or RAID5)
everything required to keep your system running.

"And there is a risk of silent corruption on all raid systems - that is
well known."
I question this....
I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1, you
know when a difference occurs, if you want.

Guy

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Monday, January 03, 2005 3:03 AM
To: linux-raid <at> vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Andy Smith <andy <at> strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 45 lines --]
> 
> On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> > Stephen Tweedie wrote:
> > > Umm, if soft raid is expected to have silent invisible corruptions in
> > > normal use,
(Continue reading)

Peter T. Breuer | 3 Jan 10:30
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

"Also sprach Guy:"
> "Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition."
> 
> So, it is ok for your system to fail when this disk fails?

You lose the journal.  I don't remember what ext3fs does in that case.
I seem to recall that it feels ill and goes into read-only mode, but it
may feel sicker and just go toes up.  I don't recall.  That journals
kept dying on me for systems like /var taught me to put them on
disposable media when I did use to use them there ...

You can definitely react with a simple tune2fs -O ^journal or whatever
is appropriate. I know that because Ted Tso added the "force" option to
tune2fs at my request, when I showed him a system that had lost its
journal and wouldn't remove it from its metadata BECAUSE the journal
was not there to be removed.

> I don't want system failures when a disk fails,

Your scenario seems to be that you have the disks of your mirror on the
ame physical system.  That's fundamentally dangerous - they're both
subject to damage when the system blows up.  I instead have an array
node (where the journal is kept), and a local mirror component and a
remote mirror component.

That system is doubled, and each half of the double hosts the others
remote mirror component. Each half fails over to the other.

There it makes sense to have the journal separate, but local on the
(Continue reading)

Guy | 3 Jan 11:17

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

See notes below with **.

Guy

-----Original Message-----
From: ptb <at> inv.it.uc3m.es [mailto:ptb <at> inv.it.uc3m.es] 
Sent: Monday, January 03, 2005 4:17 AM
To: Guy
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

"Also sprach Guy:"
> "Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition."
> 
> So, it is ok for your system to fail when this disk fails?

You lose the journal, that's all.  You can react with a simple tune2fs
-O ^journal or whatever is appropriate.  And a journal is ONLY there in
order to protect you against crashes of the SYSTEM (not the disk), so
what was the point of having the journal in the first place? 

** When you lose the journal, does the system continue without it?
** Or does it require user intervention?

> I don't want system failures when a disk fails,

"don't use a journal then" seems to be the easy answer for you, but
probably "put it on an ultrasafe medium like gold-plated persistent ram"
works better!
(Continue reading)

Brad Campbell | 3 Jan 11:18
Picon

Partiy error detection - was Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:

> And there is a risk of silent corruption on all raid systems - that is
> well known. DIfferent raid systems do different thigs to compensate,
> such as periodically recalculating the parity on everything. But when
> you have redundant data and a corruption occurs, which of the two
> datasets do you believe? You have to choose one of them! You guess
> wrong half the time, if you guess ("you" is a raid system). Hence
> "silent corruption".

Just on this point. With RAID-6 I guess you can get a majority rules type of ruling on which data 
block is the dud. Unless you come up with 3 different results in which case something is really not 
right in lego land.

I wonder perhaps about a userspace app that you can run to check all the parity blocks. On RAID-5 it 
should be able to tell you, you have a naff stripe, but on RAID-6 in theory if it's only a single 
drive that is a problem you should be able to correct the block.

Regards,
Brad
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 3 Jan 12:31
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Guy <bugzilla <at> watkins-home.com> wrote:
> "Also sprach Guy:"
> > "Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition."
> > 
> > So, it is ok for your system to fail when this disk fails?
> 
> You lose the journal, that's all.  You can react with a simple tune2fs
> -O ^journal or whatever is appropriate.  And a journal is ONLY there in
> order to protect you against crashes of the SYSTEM (not the disk), so
> what was the point of having the journal in the first place? 
> 
> ** When you lose the journal, does the system continue without it?
> ** Or does it require user intervention?

I don't recall. It certainly at least puts itself into read-only mode
(if that's the error mode specified via tune2fs). And the situation
probably changes from version t version.

On a side note, I don't know why you think user intervention is not
required when a raid system dies.  As a matter of liklihoods, I have
never seen a disk die while IN a working soft (or hard) raid system, and
the system continue working afterwards, instead the normal disaster
sequence as I have experienced it is:

   1) lightning strikes rails, or a/c goes out and room full of servers
      overheats. All lights go off.

   2) when sysadmin arrives to sort out the smoking wrecks, he finds
      that 1 in 3 random disks are fried - they're simply the points
(Continue reading)


Gmane