Vincent Fox | 1 Dec 2007 06:15
Picon
Favicon

Best stripe-size in array for ZFS mail storage?

We will be using Cyrus to store mail on 2540 arrays.

We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected to same host, and mirror and
stripe the LUNs.  So a ZFS RAID-10 set composed of 4 LUNs.  Multi-pathing also in use for redundancy.

My question is any guidance on best choice in CAM for stripe size in the LUNs?

Default is 128K right now, can go up to 512K, should we go higher?

Cyrus stores mail messages as many small files, not big mbox files.  But there are so many layers in action
here it's hard to know what is best choice.

 
This message posted from opensolaris.org
Louwtjie Burger | 1 Dec 2007 07:43
Picon

Re: Best stripe-size in array for ZFS mail storage?

On Dec 1, 2007 7:15 AM, Vincent Fox <vincent_b_fox <at> yahoo.com> wrote:
> We will be using Cyrus to store mail on 2540 arrays.
>
> We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected to same host, and mirror and
stripe the LUNs.  So a ZFS RAID-10 set composed of 4 LUNs.  Multi-pathing also in use for redundancy.

Any reason why you are using a mirror of raid-5 lun's?

I can understand that perhaps you want ZFS to be in control of
rebuilding broken vdev's, if anything should go wrong ... but
rebuilding RAID-5's seems a little over the top.

How about running a ZFS mirror over RAID-0 luns? Then again, the
downside is that you need intervention to fix a LUN after a disk goes
boom! But you don't waste all that space :)

PS: It would be nice to know what the LSI firmware does (after 15
years of evolution) to writes into the controller... it might have
been better to buy JOBD's ... I see Sun will be releasing some soon
(rumour?)
Marcin Woźniak | 1 Dec 2007 10:57
Picon

zfs export/import problem in cluster env.

i did some test lately with zfs, env is:
2 node veritas cluster 5.0 on solaris 8/07 with recommended patches, 2 machines v440 & v480, shared storage
through switch on 6120 array.
2 luns from array, on every zfs pool. problem is, after installing oracle db on one of the luns, zpool import /
export did not work. bad news for us is, that when i run oracle db on one node, all is ok, then i export zpool
from node one and import on node 2, import did not work, and solaris system panic and reboot.

thank you for some help

 
This message posted from opensolaris.org
can you guess? | 1 Dec 2007 11:59

Re: Best stripe-size in array for ZFS mail storage?

> We will be using Cyrus to store mail on 2540 arrays.
> 
> We have chosen to build 5-disk RAID-5 LUNs in 2
> arrays which are both connected to same host, and
> mirror and stripe the LUNs.  So a ZFS RAID-10 set
> composed of 4 LUNs.  Multi-pathing also in use for
> redundancy.

Sounds good so far:  lots of small files in a largish system with presumably significant access parallelism
makes RAID-Z a non-starter, but RAID-5 should be OK, especially if the workload is read-dominated.  ZFS
might aggregate small writes such that their performance would be good as well if Cyrus doesn't force them
to be performed synchronously (and ZFS doesn't force them to disk synchronously on file close); even
synchronous small writes could perform well if you mirror the ZFS small-update log:  flash - at least the
kind with decent write performance - might be ideal for this, but if you want to steer clear of a specialized
configuration just carving one small LUN for mirroring out of each array (you could use a RAID-0 stripe on
each array if you were compulsive about keeping usage bal
 anced; it would be nice to be able to 'center' it on the disks, but probably not worth the management overhead
unless the array makes it easy to do so) should still offer a noticeable improv
 ement over just placing the ZIL on the RAID-5 LUNs.

> 
> My question is any guidance on best choice in CAM for
> stripe size in the LUNs?
> 
> Default is 128K right now, can go up to 512K, should
> we go higher?

By 'stripe size' do you mean the size of the entire stripe (i.e., your default above reflects 32 KB on each
data disk, plus a 32 KB parity segment) or the amount of contiguous data on each disk (i.e., your default
above reflects 128 KB on each data disk for a total of 512 KB in the entire stripe, exclusive of the 128 KB
(Continue reading)

can you guess? | 1 Dec 2007 12:31

Re: Best stripe-size in array for ZFS mail storage?

> Any reason why you are using a mirror of raid-5
> lun's?

Some people aren't willing to run the risk of a double failure - especially when recovery from a single
failure may take a long time.  E.g., if you've created a disaster-tolerant configuration that separates
your two arrays and a fire completely destroys one of them, you'd really like to be able to run the survivor
without worrying too much until you can replace its twin (hence each must be robust in its own right).

The above situation is probably one reason why 'RAID-6' and similar approaches (like 'RAID-Z2') haven't
generated more interest:  if continuous on-line access to your data is sufficiently critical to consider
them, then it's also probably sufficiently critical to require such a disaster-tolerant approach
(which dual-parity RAIDs can't address).

It would still be nice to be able to recover from a bad sector on the single surviving site, of course, but you
don't necessarily need full-blown RAID-6 for that:  you can quite probably get by with using large blocks
and appending a private parity sector to them (maybe two private sectors just to accommodate a situation
where a defect hits both the last sector in the block and the parity sector that immediately follows it; it
would also be nice to know that the block size is significantly smaller than a disk track size, for similar
reasons).  This would, however, tend to require file-system involvement such that all data was organized
into such large blocks:  otherwise, all writes for smaller blocks would turn into read/modify/writes.

Panasas (I always tend to put an extra 's' into that name, and to judge from Google so do a hell of a lot of other
people:  is it because of the resemblance to 'parnassas'?) has been crowing about something that it calls
'tiered parity' recently, and it may be something like the above.

...

> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
(Continue reading)

can you guess? | 1 Dec 2007 14:21

Re: x4500 w/ small random encrypted text files

> If it's just performance you're after for small
> writes, I wonder if you've considered putting the ZIL
> on an NVRAM card?  It looks like this can give
> something like a 20x performance increase in some
> situations:
> 
> http://blogs.sun.com/perrin/entry/slog_blog_or_bloggin
> g_on

That's certainly interesting reading, but it may be just a tad optimistic.  For example, it lists a
throughput of 211 MB/sec with only *one* disk in the main pool - which unless that's also a solid-state disk
is clearly unsustainable (i.e., you're just seeing the performance while the solid-state log is filling
up, rather than what the performance will eventually stabilize at:  my guess is that the solid-state log
may be larger than the file being updated, in which case updates just keep accumulating there without
*ever* being forced to disk, which is unlikely to occur in most normal environments).

The numbers are a bit strange in other areas as well.  In the case of a single pool disk and no slog, 11 MB/sec
represents about 1400 synchronous 8 KB updates per second on a disk with only about 1/10th that IOPS
capacity even with queuing enabled (and when you take into account the need to propagate each such
synchronous update all the way back to the superblock it begins to look somewhat questionable even from
the bandwidth point of view).  One might suspect that what's happening is that once the first synchronous
write has been submitted a whole bunch of additional ones accumulate while waiting for the disk to finish
the first, and that ZFS is smart enough not to queue them up to the disk (which would require full-path
updates for every one of them) but instead to gather them in its own
  cache and write them all back at once in one fell swoop (including a single update for the ancestor path) when
the disk is free again.  This would explain not only the otherwise suspicious 
 performance but also why adding the slog provides so little improvement; it's also a tribute to the care
that the ZFS developers put into this aspect of their implementation.

On the other hand, when an slog is introduced performance actually *declines* in systems with more than one
(Continue reading)

can you guess? | 1 Dec 2007 15:20

Re: Yager on ZFS

[Zombie thread returns from the grave...]

> > Getting back to 'consumer' use for a moment,
> though,
> > given that something like 90% of consumers entrust
> > their PC data to the tender mercies of Windows, and
> a
> > large percentage of those neither back up their
> data,
> > nor use RAID to guard against media failures, nor
> > protect it effectively from the perils of Internet
> > infection, it would seem difficult to assert that
> > whatever additional protection ZFS may provide
> would
> > make any noticeable difference in the consumer
> space
> > - and that was the kind of reasoning behind my
> > comment that began this sub-discussion.
> 
> As a consumer at home, IT guy at work and amateur
> photographer, I think ZFS will help change that.

Let's see, now:

Consumer at home?  OK so far.

IT guy at work?  Nope, nothing like a mainstream consumer, who doesn't want to know about anything like the
level of detail under discussion here.

Amateur photographer?  Well, sort of - except that you seem to be claiming to have reached the *final* stage
(Continue reading)

max@bruningsystems.com | 1 Dec 2007 15:34

Re: Best stripe-size in array for ZFS mail storage?

Hi Bill,
can you guess? wrote:
>> We will be using Cyrus to store mail on 2540 arrays.
>>
>> We have chosen to build 5-disk RAID-5 LUNs in 2
>> arrays which are both connected to same host, and
>> mirror and stripe the LUNs.  So a ZFS RAID-10 set
>> composed of 4 LUNs.  Multi-pathing also in use for
>> redundancy.
>>     
>
> Sounds good so far:  lots of small files in a largish system with presumably significant access
parallelism makes RAID-Z a non-starter,
Why does "lots of small files in a largish system with presumably 
significant access parallelism makes RAID-Z a non-starter"?
thanks,
max
can you guess? | 1 Dec 2007 15:53

Re: Best stripe-size in array for ZFS mail storage?

> Hi Bill,

...

  lots of small files in a
> largish system with presumably significant access
> parallelism makes RAID-Z a non-starter,
> Why does "lots of small files in a largish system
> with presumably 
> significant access parallelism makes RAID-Z a
> non-starter"?
> thanks,
> max

Every ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe - so not only do N + 1 disks get
written for every block update, but N disks get *read* on every block *read*.

Normally, small files can be read in a single I/O request to one disk (even in conventional parity-RAID
implementations).  RAID-Z requires N I/O requests spread across N disks, so for parallel-access reads to
small files RAID-Z provides only about 1/Nth the throughput of conventional implementations unless the
disks are sufficiently lightly loaded that they can absorb the additional load that RAID-Z places on them
without reducing throughput commensurately.

- bill

 
This message posted from opensolaris.org
Vincent Fox | 1 Dec 2007 18:57
Picon
Favicon

Re: Best stripe-size in array for ZFS mail storage?

> On Dec 1, 2007 7:15 AM, Vincent Fox
> 
> Any reason why you are using a mirror of raid-5
> lun's?
> 
> I can understand that perhaps you want ZFS to be in
> control of
> rebuilding broken vdev's, if anything should go wrong
> ... but
> rebuilding RAID-5's seems a little over the top.

Because the decision of our technical leads was that a straight
ZFS RAID-10 set made up of individual disks from the 2540 was
more risky.  A double-disk failure in a mirror pair would hose the
pool and when the pool contains email for >10K people this was not
acceptable. Another possibility is one of the arrays goes offline
now you are running a RAID-0 stripe set and a single-disk fails
then you are again dead.

The setup we have can survive quite multiple failures
and we have seen enough weird events in our career that
wee decided to do this. YMMV.  Let's move on, I just wanted
to describe our setup not start an argument about it.

> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
> after a disk goes
> boom! But you don't waste all that space :)
> 
(Continue reading)


Gmane