Dave Chinner | 1 May 11:36 2011

Re: RAID6 r-m-w, op-journaled fs, SSDs

On Sat, Apr 30, 2011 at 04:27:48PM +0100, Peter Grandi wrote:
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations,

XFS will write log-stripe-unit sized records to disk. If the log
buffers are not full, it pads them. Supported log-sunit sizes are up
to 256k.

> even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
> 
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set? I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.

Not bad at all, because the journal writes are sequential, and XFS
can have multiple log IOs in progress at once (up to 8 x 256k =
2MB). So in general while metadata operations are in progress, XFS
will fill full stripes with log IO and you won't get problems with
RMW.

> Where are studies or even just impressions of anedoctes on how
> bad this is?

Just buy decent RAID hardware with a BBWC and journal IO does not
(Continue reading)

Brad Campbell | 1 May 16:50 2011

Re: High mismatch count on root device - how to best handle?

On 01/05/11 06:51, Mark Knecht wrote:
> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk<wd <at> denx.de>  wrote:
>> Dear Phil Turmel,
>>
>> In message<4DB8BEFE.3020009 <at> turmel.org>  you wrote:
>>>
>>> Hmmm.  Since its not swap, this would make me worry about the hardware.  Have you considered shuffling
SATA port assignments to see if a pattern shows up?  Also consider moving some of the drive power load to
another PS.
>>
>> I do not think this is hardware related.  I see this behaviour on at
>> least 5 different machines which show no other problems except for the
>> mismatch count in the RAID 1 partitins that hold the /boot partition.

root <at> srv:/server# grep . /sys/block/md?/md/mismatch_cnt
/sys/block/md0/md/mismatch_cnt:0
/sys/block/md1/md/mismatch_cnt:128
/sys/block/md2/md/mismatch_cnt:0
/sys/block/md3/md/mismatch_cnt:41728
/sys/block/md4/md/mismatch_cnt:896
/sys/block/md5/md/mismatch_cnt:0
/sys/block/md6/md/mismatch_cnt:4352

root <at> srv:/server# cat /proc/mdstat | grep md[1346]
md6 : active raid1 sdp6[0] sdo6[1]
md4 : active raid1 sdp3[0] sdo3[1]
md3 : active raid1 sdp2[0] sdo2[1]
md1 : active raid1 sdp1[0] sdo1[1]

root <at> srv:/server# cat /etc/fstab | grep md[1346]
(Continue reading)

David Brown | 1 May 17:24 2011
Picon

Re: RAID6 r-m-w, op-journaled fs, SSDs

On 01/05/11 00:27, NeilBrown wrote:
> On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2 <at> xf2.for.sabi.co.UK (Peter Grandi)
> wrote:
>
>> While I agree with BAARF.com arguments fully, I sometimes have
>> to deal with legacy systems with wide RAID6 sets (for example 16
>> drives, quite revolting) which have op-journaled filesystems on
>> them like XFS or JFS (sometimes block-journaled ext[34], but I
>> am not that interested in them for this).
>>
>> Sometimes (but fortunately not that recently) I have had to deal
>> with small-file filesystems setup on wide-stripe RAID6 setup by
>> morons who don't understand the difference between a database
>> and a filesystem (and I have strong doubts that RAID6 is
>> remotely appropriate to databases).
>>
>> So I'd like to figure out how much effort I should invest in
>> undoing cases of the above, that is how badly they are likely to
>> be and degrade over time (usually very badly).
>>
>> First a couple of question purely about RAID, but indirectly
>> relevant to op-journaled filesystems:
>>
>>    * Can Linux MD do "abbreviated" read-modify-write RAID6
>>      updates like for RAID5? That is where not the whole stripe
>>      is read in, modified and written, but just the block to be
>>      updated and the parity wblocks.
>
> No.  (patches welcome).

(Continue reading)

Peter Grandi | 1 May 17:31 2011
Picon

Re: RAID6 r-m-w, op-journaled fs, SSDs

[ ... ]

>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>> updates like for RAID5? [ ... ]

> No. (patches welcome).

Ahhhm, but let me dig a bit deeper, even if it may be implied in
the answer: would it be *possible*?

That is, is the double parity scheme used in MS such that it is
possible to "subtract" the old content of a page and "add" the
new content of that page to both parity pages?

[ ... ]

> The ideal config for a journalled filesystem is for put the
> journal on a separate smaller lower-latency device.  e.g. a
> small RAID1 pair.

> In a previous work place I had good results with:
>   RAID1 pair of small disks with root, swap, journal
>   Large RAID5/6 array with bulk of filesystem.

Sound reasonable, except that I am allergic to RAID5 (except in
two cases) and RAID6 (in general). :-), but would work equally
well I guess with RAID10 and its delightful MD implementation.

[ ... ]

(Continue reading)

Christoph Hellwig | 1 May 18:48 2011

Re: RAID6 r-m-w, op-journaled fs, SSDs

On Sun, May 01, 2011 at 05:24:09PM +0200, David Brown wrote:
> I suppose it also makes sense to put the write-intent bitmap for md
> raid on such a raid1 pair (typically SSD's).

Note that right now you can't actually put the bitmap on a device,
but it requires a file on a filesystem, which seems rather confusing.

Also make sure that you don't already max out the solid state devices
IOP rate with the log writes, in which case putting the bitmap on
it as well will slow it down.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Knecht | 1 May 19:13 2011
Picon

Re: High mismatch count on root device - how to best handle?

On Sun, May 1, 2011 at 7:50 AM, Brad Campbell <lists2009 <at> fnarfbargle.com> wrote:
> On 01/05/11 06:51, Mark Knecht wrote:
>>
>> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk<wd <at> denx.de>  wrote:
>>>
>>> Dear Phil Turmel,
>>>
>>> In message<4DB8BEFE.3020009 <at> turmel.org>  you wrote:
>>>>
>>>> Hmmm.  Since its not swap, this would make me worry about the hardware.
>>>>  Have you considered shuffling SATA port assignments to see if a pattern
>>>> shows up?  Also consider moving some of the drive power load to another PS.
>>>
>>> I do not think this is hardware related.  I see this behaviour on at
>>> least 5 different machines which show no other problems except for the
>>> mismatch count in the RAID 1 partitins that hold the /boot partition.
>
> root <at> srv:/server# grep . /sys/block/md?/md/mismatch_cnt
> /sys/block/md0/md/mismatch_cnt:0
> /sys/block/md1/md/mismatch_cnt:128
> /sys/block/md2/md/mismatch_cnt:0
> /sys/block/md3/md/mismatch_cnt:41728
> /sys/block/md4/md/mismatch_cnt:896
> /sys/block/md5/md/mismatch_cnt:0
> /sys/block/md6/md/mismatch_cnt:4352
>
> root <at> srv:/server# cat /proc/mdstat | grep md[1346]
> md6 : active raid1 sdp6[0] sdo6[1]
> md4 : active raid1 sdp3[0] sdo3[1]
> md3 : active raid1 sdp2[0] sdo2[1]
(Continue reading)

David Brown | 1 May 20:32 2011
Picon

Re: RAID6 r-m-w, op-journaled fs, SSDs

On 01/05/11 17:31, Peter Grandi wrote:
> [ ... ]
>
>>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>>> updates like for RAID5? [ ... ]
>
>> No. (patches welcome).
>
> Ahhhm, but let me dig a bit deeper, even if it may be implied in
> the answer: would it be *possible*?
>
> That is, is the double parity scheme used in MS such that it is
> possible to "subtract" the old content of a page and "add" the
> new content of that page to both parity pages?
>

If I've understood the maths correctly, then yes it would be possible. 
But it would involve more calculations, and it is difficult to see where 
the best balance lies between cpu demands and IO demands.  In general, 
calculating the Q parity block for raid6 is processor-intensive - 
there's a fair amount of optimisation done in the normal calculations to 
keep it reasonable.

Basically, the first parity P is a simple calculation:

P = D_0 + D_1 + .. + D_n-1

But Q is more difficult:

Q = D_0 + g.D_1 + g².D_2 + ... + g^(n-1).D_n-1
(Continue reading)

NeilBrown | 2 May 00:01 2011
Picon

Re: RAID6 r-m-w, op-journaled fs, SSDs

On Sun, 01 May 2011 17:24:09 +0200 David Brown <david.brown <at> hesbynett.no>
wrote:

> On 01/05/11 00:27, NeilBrown wrote:
> > On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2 <at> xf2.for.sabi.co.UK (Peter Grandi)
> > wrote:
> >

> >>    * Can Linux MD do "abbreviated" read-modify-write RAID6
> >>      updates like for RAID5? That is where not the whole stripe
> >>      is read in, modified and written, but just the block to be
> >>      updated and the parity wblocks.
> >
> > No.  (patches welcome).
> 
> As far as I understand the raid6 mathematics, it shouldn't be too hard 
> to do such abbreviated updates, but that it could quickly lead to 
> complex code if you are trying to update more than a couple of blocks at 
> a time.

The RAID5 code already handle some of this complexity.

It would be quite easy to modify the code so that we have
 - list of 'old' data blocks
 - P and Q blocks
 - list of 'new' data blocks.

We then just need to (optimised) maths to deduce the new P and Q given all
of that.
Of course you would only bother with a 7-or-more disk array..
(Continue reading)

Jameson Graef Rollins | 2 May 00:06 2011
Picon

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "bio too big device md0 (248 > 240)" in kern.log

On Fri, 29 Apr 2011 05:39:40 +0100, Ben Hutchings <ben <at> decadent.org.uk> wrote:
> On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
> > I run what I imagine is a fairly unusual disk setup on my laptop,
> > consisting of:
> > 
> >   ssd -> raid1 -> dm-crypt -> lvm -> ext4
> > 
> > I use the raid1 as a backup.  The raid1 operates normally in degraded
> > mode.  For backups I then hot-add a usb hdd, let the raid1 sync, and
> > then fail/remove the external hdd. 
> 
> Well, this is not expected to work.  Possibly the hot-addition of a disk
> with different bio restrictions should be rejected.  But I'm not sure,
> because it is safe to do that if there is no mounted filesystem or
> stacking device on top of the RAID.

Hi, Ben.  Can you explain why this is not expected to work?  Which part
exactly is not expected to work and why?

> I would recommend using filesystem-level backup (e.g. dirvish or
> backuppc).  Aside from this bug, if the SSD fails during a RAID resync
> you will be left with an inconsistent and therefore useless 'backup'.

I appreciate your recommendation, but it doesn't really have anything to
do with this bug report.  Unless I am doing something that is
*expressly* not supposed to work, then it should work, and if it doesn't
then it's either a bug or a documentation failure (ie. if this setup is
not supposed to work then it should be clearly documented somewhere what
exactly the problem is).

(Continue reading)

Ben Hutchings | 2 May 02:00 2011
Picon

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "bio too big device md0 (248 > 240)" in kern.log

On Sun, 2011-05-01 at 15:06 -0700, Jameson Graef Rollins wrote:
> On Fri, 29 Apr 2011 05:39:40 +0100, Ben Hutchings <ben <at> decadent.org.uk> wrote:
> > On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
> > > I run what I imagine is a fairly unusual disk setup on my laptop,
> > > consisting of:
> > > 
> > >   ssd -> raid1 -> dm-crypt -> lvm -> ext4
> > > 
> > > I use the raid1 as a backup.  The raid1 operates normally in degraded
> > > mode.  For backups I then hot-add a usb hdd, let the raid1 sync, and
> > > then fail/remove the external hdd. 
> > 
> > Well, this is not expected to work.  Possibly the hot-addition of a disk
> > with different bio restrictions should be rejected.  But I'm not sure,
> > because it is safe to do that if there is no mounted filesystem or
> > stacking device on top of the RAID.
> 
> Hi, Ben.  Can you explain why this is not expected to work?  Which part
> exactly is not expected to work and why?

Adding another type of disk controller (USB storage versus whatever the
SSD interface is) to a RAID that is already in use.

> > I would recommend using filesystem-level backup (e.g. dirvish or
> > backuppc).  Aside from this bug, if the SSD fails during a RAID resync
> > you will be left with an inconsistent and therefore useless 'backup'.
> 
> I appreciate your recommendation, but it doesn't really have anything to
> do with this bug report.  Unless I am doing something that is
> *expressly* not supposed to work, then it should work, and if it doesn't
(Continue reading)


Gmane