Jim Klimov | 2 Dec 05:42 2012

Digging in the bowels of ZFS

My problematic home-NAS (that old list readers might still remember
about from a year-two ago) is back online, thanks to a friend who
fixed and turned it on. I'm going to try some more research on that
failure I had with the 6-disk raidz2 set, when it suddenly couldn't
read and recover some blocks (presumed scratched or otherwise damaged
on all disks at once while the heads hovered over similar locations).

My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question's coffin, or I'd nail a bug to
yell about ;)

So... here are some applied questions:

1) With DD I found the logical offset after which it fails to read
    data from a damaged file, because ZFS doesn't trust the block.
    That's 3840 sectors ( <at> 512b), or 0x1e0000. With ZDB I listed the
    file inode's block tree and got this, in particular:

# zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \
   > /var/tmp/brokenfile.ZDBreport 2>&1
           1c0000   L0 DVA[0]=<0:acbc2a46000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum)

           1e0000   L0 DVA[0]=<0:acbc2a4f000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P
(Continue reading)

Fung Zheng | 1 Dec 15:05 2012

6Tb Database with ZFS


Im about to migrate a 6Tb database from Veritas Volume Manager to ZFS, I want to set arc_max parameter so ZFS cant use all my system's memory, but i dont know how much i should set, do you think 24Gb will be enough for a 6Tb database? obviously the more the better but i cant set too much memory. Have someone implemented succesfully something similar?

We ran some test and the usage of memory was as follow:

(With Arc_max at 30Gb)
Kernel = 18Gb
Anon = 90Gb
Page Cache = 10Gb
Free = 25Gb

My system have 192Gb RAM and the database SGA = 85Gb.

I would appreciate if someone could tell me about their experience.

Best Regards

zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
Albert Shih | 30 Nov 15:25 2012

Remove disk

Hi all,

I would like to knwon if with ZFS it's possible to do something like that :


meaning : 

I have a zpool with 48 disks with 4 raidz2 (12 disk). Inside those 48 disk
I've 36x 3T and 12 x 2T.
Can I buy new 12x4 To disk put in the server, add in the zpool, ask zpool
to migrate all data on those 12 old disk on the new and remove those old
disk ? 



Albert SHIH
DIO bâtiment 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
Téléphone : 01 45 07 76 26/06 86 69 95 71
xmpp: jas <at> obspm.fr
Heure local/Local time:
ven 30 nov 2012 15:18:32 CET

query re disk mirroring

Say I have an ldoms guest that is using zfs root pool that is mirrored, 
and the two sides of the mirror are coming from two separate vds 
servers, that is

where c3d0s0 is served by one vds server, and c4d0s0 is served by 
another vds server.

Now if for some reason, this physical rig loses power, then how do I 
know which side of the mirror to boot off, ie which side is most recent.

As an example ( contrived now mind you )

I shutdown the IO server domain serving c3d0s0, I then copy a large file 
into my root pool ( goes to vds serving c4d0s0 ), then shut down the 
guest and the other service domain gracefully, then boot guest off 
c3d0s0 ( have restarted the service domain there ), the large file is 
obviously missing now.

Is there any way if the guest is stopped, that I can know which side of 
the mirror to boot off that was most recent?
Jim Klimov | 29 Nov 10:56 2012

ZFS QoS and priorities

I've heard a claim that ZFS relies too much on RAM caching, but
implements no sort of priorities (indeed, I've seen no knobs to
tune those) - so that if the storage box receives many different
types of IO requests with different "administrative weights" in
the view of admins, it can not really throttle some IOs to boost
others, when such IOs have to hit the pool's spindles.

For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.

AFAIK, now such requests would hit the ARC, then the disks if
needed - in no particular order. Well, can the order be made
"particular" with current ZFS architecture, i.e. by setting
some datasets to have a certain NICEness or another priority

Thanks for info/ideas,

Question about degraded drive



I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace?


Thank you!

zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
Eugen Leitl | 26 Nov 23:13 2012

zfs on SunFire X2100M2 with hybrid pools

Dear internets,

I've got an old SunFire X2100M2 with 6-8 GBytes ECC RAM, which
I wanted to put into use with Linux, using the Linux
VServer patch (an analogon to zones), and 2x 2 TByte 
nearline (WD RE4) drives. It occured to me that the
1U case had enough space to add some SSDs (e.g.
2-4 80 GByte Intel SSDs), and the power supply
should be able to take both the 2x SATA HDs as well 
as 2-4 SATA SSDs, though I would need to splice into
existing power cables.

I also have a few LSI and an IBM M1015 (potentially 
reflashable to IT mode) adapters, so having enough ports
is less an issue (I'll probably use an LSI
with 4x SAS/SATA for 4x SSD, and keep the onboard SATA
for HDs, or use each 2x for SSD and HD).

Now there are multiple configurations for this. 
Some using Linux (roof fs on a RAID10, /home on
RAID 1) or zfs. Now zfs on Linux probably wouldn't
do hybrid zfs pools (would it?) and it wouldn't
be probably stable enough for production. Right?

Assuming I wont't have to compromise CPU performance 
(it's an anemic Opteron 1210 1.8 GHz, dual core, after all, and
it will probably run several 10 of zones in production) and
sacrifice data integrity, can I make e.g. LSI SAS3442E
directly do SSD caching (it says something about CacheCade,
but I'm not sure it's an OS-side driver thing), as it
is supposed to boost IOPS? Unlikely shot, but probably
somebody here would know.

If not, should I go directly OpenIndiana, and use
a hybrid pool?

Should I use all 4x SATA SSDs and 2x SATA HDs to
do a hybrid pool, or would this be an overkill?
The SSDs are Intel SSDSA2M080G2GC 80 GByte, so no speed demons
either. However, they've seen some wear and tear and
none of them has keeled over yet. So I think they'll
be good for a few more years.

How would you lay out the pool with OpenIndiana
in either case to maximize IOPS and minimize CPU
load (assuming it's an issue)? I wouldn't mind
to trade 1/3rd to 1/2 of CPU due to zfs load, if
I can get decent IOPS.

This is terribly specific, I know, but I figured
somebody had tried something like that with an X2100 M2,
it being a rather popular Sun (RIP) Solaris box at
the time. Or not.

Thanks muchly, in any case.

-- Eugen
John Baxter | 23 Nov 16:49 2012

dm-crypt + ZFS on Linux

After searching for dm-crypt and ZFS on Linux and finding too little information, I shall ask here. Please keep in mind this in the context of running this in a production environment.

We have the need to encypt our data, approximately 30TB on three ZFS volumes under Solaris 10. The volumes currently reside on iscsi sans connected via 10Gb/s ethernet. We have tested Solaris 11 with ZFS encrypted volumes and found the performance to be very poor and have an open bug report with Oracle.

We are a Linux shop and since performance is so poor and still no resolution, we are considering ZFS on Linux with dm-crypt.
I have read once or twice that if we implemented ZFS + dm-crypt we would loose features, however which features are not specified.
We currently mirror the volumes across identical iscsi sans with ZFS and we use hourly ZFS snapshots to update our DR site.

Which features of ZFS are lost if we use dm-crypt? My guess would be they are related to raidz but unsure.

zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
Sami Tuominen | 23 Nov 12:08 2012

Re: Directory is not accessible

How can one remove a directory containing corrupt files or a corrupt file itself? For me rm just gives input/output error.


"Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) " <opensolarisisdeadlongliveopensolaris <at> nedharvey.com> wrote:

> From: zfs-discuss-bounces <at> opensolaris.org [mailto:zfs-discuss-
> bounces <at> opensolaris.org] On Behalf Of Sami Tuominen
> Unfortunately there aren't any snapshots.
> The version of zpool is 15. Is it safe to upgrade that?
> Is zpool clear -F supported or of any use here?

The only thing that will be of use to restore your data will be a backup.

To forget about the lost data and make the error message go away, simply rm the bad directory (and/or its parent).

You're probably wondering, you have redundancy and no faulted devices, so how could this happen?  There are a few possible explanations, but they're all going to have one thing in common:  At some point, something got corrupted before it was written corrupted and the redundant copy also written corrupted.  It might be you had a CPU error, or some parity error in non-ECC ram, or a bus glitch or bad firmware in the HBA, for example.  The fact remains, something was written corrupted, and the redundant copy was also written corrupted.  All you can do is restore from a snapshot, restore from a backup, or accept it for what it is and make the error go away.

Sorry to hear it...

zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
Jim Klimov | 22 Nov 17:24 2012

ZFS Appliance as a general-purpose server question

A customer is looking to replace or augment their Sun Thumper
with a ZFS appliance like 7320. However, the Thumper was used
not only as a protocol storage server (home dirs, files, backups
over NFS/CIFS/Rsync), but also as a general-purpose server with
unpredictably-big-data programs running directly on it (such as
corporate databases, Alfresco for intellectual document storage,
etc.) in order to avoid the networking transfer of such data
between pure-storage and compute nodes - this networking was
seen as both a bottleneck and a possible point of failure.

Is it possible to use the ZFS Storage appliances in a similar
way, and fire up a Solaris zone (or a few) directly on the box
for general-purpose software; or to shell-script administrative
tasks such as the backup archive management in the global zone
(if that concept still applies) as is done on their current
Solaris-based box?

Is it possible to run VirtualBoxes in the ZFS-SA OS, dare I ask? ;)

//Jim Klimov
Thomas Gouverneur | 22 Nov 00:17 2012

Changing a VDEV GUID?

Hi ZFS fellows,

I already seen on the archive of the list some of you doing some GUID
change of a pool's VDEV, to allow a cloned disk to be imported on the
same system as the source.

Can someone explain in detail how to achieve that?
Has already someone invented the wheel so I would not have to rewrite a
tool to do it?

Subsidiary: Is there an official response of Oracle in front of such
case? How do they "officially" deal with Binary Copied disks, as it's
common to do such copy with UFS to copy SAP environment or Databases...

Thanks in advance,