Andy Smith | 2 Jan 20:42 2005
Picon

ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Thu, Dec 30, 2004 at 10:39:42PM +0100, Peter T. Breuer wrote:
> In gmane.linux.raid Michael Tokarev <mjt <at> tls.msk.ru> wrote:
> > Peter T. Breuer wrote:
> > > In gmane.linux.raid Georg C. F. Greve <greve <at> fsfeurope.org> wrote:
> > > 
> > > Yes, well, don't put the journal on the raid partition. Put it
> > > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > > is not - deliberately - preserved in raid, as far as I can tell).
> > 
> > This is a sort of a nonsense, really.  Both claims, it seems.
> 
> It's perfectly correct, as far as I know!

Not really wishing to get into the middle of a flame war, but I
didn't really see how this could be true so I asked for more info on
ext3-users.

I got the following response:

https://listman.redhat.com/archives/ext3-users/2005-January/msg00003.html

Peter T. Breuer | 2 Jan 21:18 2005
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Andy Smith <andy <at> strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 22 lines --]
> 
> On Thu, Dec 30, 2004 at 10:39:42PM +0100, Peter T. Breuer wrote:
> > In gmane.linux.raid Michael Tokarev <mjt <at> tls.msk.ru> wrote:
> > > Peter T. Breuer wrote:
> > > > In gmane.linux.raid Georg C. F. Greve <greve <at> fsfeurope.org> wrote:
> > > > 
> > > > Yes, well, don't put the journal on the raid partition. Put it
> > > > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > > > is not - deliberately - preserved in raid, as far as I can tell).
> > > 
> > > This is a sort of a nonsense, really.  Both claims, it seems.
> > 
> > It's perfectly correct, as far as I know!
> 
> Not really wishing to get into the middle of a flame war, but I
> didn't really see how this could be true so I asked for more info on
> ext3-users.
> 
> I got the following response:
> 
> https://listman.redhat.com/archives/ext3-users/2005-January/msg00003.html

Interesting - I'll post it (there is no flame war):

>     * From: "Stephen C. Tweedie" <sct redhat com>
>     * To: Andy Smith <andy lug org uk>
>     * Cc: Stephen Tweedie <sct redhat com>, ext3 users list <ext3-users
>     * redhat com>
(Continue reading)

Andy Smith | 3 Jan 01:30 2005
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> Stephen Tweedie wrote:
> > Umm, if soft raid is expected to have silent invisible corruptions in
> > normal use,
> 
> It is, just as is all types of RAID.  This is a very strange thing for
> Stephen to say - I cannot believe that he is as naive as he makes
> himself out to be about RAID here and I don't know why he should say
> that (presuming that he really knows better).
> 
> > then you shouldn't be using it, period.  That's got zero to
> > do with journaling.
> 
> It implies that one should not be doing journalling on top of it.
> 
> (The logic for why RAID corrupts silently is that errors accumulate at
> n times the normal rate per sector, but none of them are detected by
> RAID (no crc), and when a disk drops out then you get a good chance of
> picking up a corrupted copy instead of a good copy, because nobody
> has checked the copy meanwhiles to see if it matches the original).

I have no idea which of you to believe now. :(

I currently only have one system using software raid, and several of
my employer's machines using hardware raid, all of which have
various raid-1, -5 and -10 setups and all use only ext3.

Let's focus on the personal machine of mine for now since it uses
Linux software RAID and therefore on-topic here.  It has /boot on a
small RAID-1, and the rest of the system is on RAID-5 with an
(Continue reading)

Neil Brown | 3 Jan 07:41 2005
X-Face
Picon
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday January 3, andy <at> strugglers.net wrote:
> 
> I have no idea which of you to believe now. :(

Well, how about I wade in.....

(almost*) No block storage device will guarantee that write ordering
is maintained.  Neither will read requests necessarily be ordered.

Any SCSI, IDE, or similar disc drive in Linux (or any other non-toy
OS) will have requests managed by an "elevator algorithm" which
coalesces adjacent blocks and  tries to re-order requests to make
optimal use of the device.

A RAID controller, whether software, firmware, or hardware, will also
re-order requests to make best use of the devices.

Any filesystem that assumes that requests will not be re-ordered is
broken, as the assumption is wrong.
I would be *very* surprised if Reiserfs makes this assumption.

Until relatively recently, the only assumption that could be made is
that a write request will be handled sometime between when it is made,
and when the request completes (i.e. the end_io callback is called).
If several requests are concurrent they could commit in any order.

With only this guarantee, the simplest approach for a journalling
filesystem is to write the content of a journal entry, wait for the
writes to complete, and then write a single block "header" which
describes and hence commits that journal entry.  The journal entry is
(Continue reading)

Peter T. Breuer | 3 Jan 09:03 2005
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Andy Smith <andy <at> strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 45 lines --]
> 
> On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> > Stephen Tweedie wrote:
> > > Umm, if soft raid is expected to have silent invisible corruptions in
> > > normal use,
> > 
> > It is, just as is all types of RAID.  This is a very strange thing for
> > Stephen to say - I cannot believe that he is as naive as he makes
> > himself out to be about RAID here and I don't know why he should say
> > that (presuming that he really knows better).
> > 
> > > then you shouldn't be using it, period.  That's got zero to
> > > do with journaling.
> > 
> > It implies that one should not be doing journalling on top of it.
> > 
> > (The logic for why RAID corrupts silently is that errors accumulate at
> > n times the normal rate per sector, but none of them are detected by
> > RAID (no crc), and when a disk drops out then you get a good chance of
> > picking up a corrupted copy instead of a good copy, because nobody
> > has checked the copy meanwhiles to see if it matches the original).
> 
> I have no idea which of you to believe now. :(

Both of us. We have not disagreed fundamentally. Read closely! Stephen
says "IF (my caps) soft raid is expected to have ...". Well, it is,
just like any RAID.

(Continue reading)

Peter T. Breuer | 3 Jan 09:37 2005
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> Well, how about I wade in.....

Sure! :-)

> A RAID controller, whether software, firmware, or hardware, will also
> re-order requests to make best use of the devices.

Possibly.  I have written block device drivers that maintain write
order, however (or at least do so if you ask them to, with the right
switch), because ...

> Any filesystem that assumes that requests will not be re-ordered is
> broken, as the assumption is wrong.
> I would be *very* surprised if Reiserfs makes this assumption.

.. because that is EXACTLY what Hans Reiser has said to me. I don't
think I've kept the mail, but I remember it.  a quick google for
reiserfs + write ordering shows up some suggestive quotes:

  > We cannot use the buffer.c dirty list anyway because bdflush can write
  > those buffers to disk at any time.  Transactions have to control the
  > write ordering  ...

(hey, that was Hans quoting Stephen). From the Linux High Availability
website (http://linuxha.trick.ca/DataRedundancyByDrbd):

   Since later WRITE requests might depend on successful finished
   previous ones, this is needed to assure strict write ordering on
   both nodes. ...
(Continue reading)


Gmane