Randy Rue | 1 Dec 18:14 2011

3170 NFS weirdness

Hello All,

We run a VMWare farm that mounts its datastores via NFS on two different
NetApp clusters, one a 3020 HA pair running 7.2.4 and the other a 3170
running 7.3.5P5.

We also mount those datastore exports to two linux (CentOS 5.x) hosts,
which have the CommVault/Simpana backup agents installed. We effectively
back up a snapshot of those datastores.

We've been testing restores of files back to the datastores on the 3170
and they've been failing. We get a linux Error 13, basically "permission
denied," in spite of the fact that the exports are configured for root
access to both the ESX hosts and the backup hosts. We've confirmed that
the backup hosts are mounting as root.

This works on the 3020 pair. Can't spot any meaningful differences in
"options NFS" between the two NetApp systems, or in any other setting.

Thought it might be a posix permissions issue as a CV restore runs under
the local linux group "simpana." The files are written with ownership of
root:simpana and then permissions are changed when the data is all there.

But if we try to rsync any test data to the 3170 datastores we get the
same error, even if we're logged in as root and copying from a local
directory on the backup host. Weirder, we CAN write to the datastore
folder using cp, rm, mkdir, rmdir, but NOT using rsync.

Help?

(Continue reading)

Learmonth, Peter | 1 Dec 18:36 2011
Picon

RE: 3170 NFS weirdness

Hi Randy
In your exports, are you using root=<hosts> or anon=0?  I've seen cases
where anon=0 didn't work 100%.

Peter

-----Original Message-----
From: Randy Rue [mailto:rrue <at> fhcrc.org] 
Sent: Thursday, December 01, 2011 9:14 AM
To: toasters <at> teaparty.net
Subject: 3170 NFS weirdness

Hello All,

We run a VMWare farm that mounts its datastores via NFS on two different
NetApp clusters, one a 3020 HA pair running 7.2.4 and the other a 3170
running 7.3.5P5.

We also mount those datastore exports to two linux (CentOS 5.x) hosts,
which have the CommVault/Simpana backup agents installed. We effectively
back up a snapshot of those datastores.

We've been testing restores of files back to the datastores on the 3170
and they've been failing. We get a linux Error 13, basically "permission
denied," in spite of the fact that the exports are configured for root
access to both the ESX hosts and the backup hosts. We've confirmed that
the backup hosts are mounting as root.

This works on the 3020 pair. Can't spot any meaningful differences in
"options NFS" between the two NetApp systems, or in any other setting.
(Continue reading)

Stefan Funke | 1 Dec 19:40 2011
Picon

Re: 3170 NFS weirdness

Am 01.12.2011 18:14, schrieb Randy Rue:
> Weirder, we CAN write to the datastore folder using cp, rm, mkdir, rmdir, but NOT using rsync.

Could be a time problem. Do all filers and client hosts sync against a 
time source (ntp?) and did you check that syncing is enabled? Is rsync 
-v (or more -vvvvv) reporting any errors?

-Stefan
_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Nick Bernstein | 1 Dec 19:44 2011

RE: 3170 NFS weirdness


A couple of things struck me off the bat: 

you say, "effectively back up the snapshot of the datastore", what does
effectively, and snapshot mean in this context? 
You also say, "Can't spot any _meaningful_ differences in "options NFS"
between the two NetApp systems, or in any other setting."

Can you verify that all the nfs options on the machine that isn't working
are identical to the ones that is? That's where I'd start. Secondly, I
would simplify the situation, and then add complexity. Mimic what the
backup software is doing by hand. By copying a file there are root and
seeing if it works, and gradually adding complexity back in until it
breaks. 

Good luck!

On Thu, 1 Dec 2011 09:36:34 -0800, "Learmonth, Peter"
<Peter.Learmonth <at> netapp.com> wrote:
> Hi Randy
> In your exports, are you using root=<hosts> or anon=0?  I've seen cases
> where anon=0 didn't work 100%.
> 
> Peter
> 
> -----Original Message-----
> From: Randy Rue [mailto:rrue <at> fhcrc.org] 
> Sent: Thursday, December 01, 2011 9:14 AM
> To: toasters <at> teaparty.net
> Subject: 3170 NFS weirdness
(Continue reading)

Alon Zeltser | 6 Dec 12:00 2011
Picon

limit i/o for lun or volume

Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of 
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors 
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems 
to help
this application don't need so much i/o and can be run a 5.4k disk on an 
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL 
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either 
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in 
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think 
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware 
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Jeff Mohler | 6 Dec 09:17 2011
Picon

Re: limit i/o for lun or volume

Is the LUN aligned properly?

On Tue, Dec 6, 2011 at 3:00 AM, Alon Zeltser <alonz <at> emet.co.il> wrote:
Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems
to help
this application don't need so much i/o and can be run a 5.4k disk on an
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters



--
---
Gustatus Similis Pullus
_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Alon Zeltser | 6 Dec 12:20 2011
Picon

Re: limit i/o for lun or volume

i believe so
since we are talking about an rdm with an ntfs file system (created with snapdrive ) there isn't usually an alignment issue like on vmfs luns


On 12/06/2011 10:17 AM, Jeff Mohler wrote:
Is the LUN aligned properly?

On Tue, Dec 6, 2011 at 3:00 AM, Alon Zeltser <alonz <at> emet.co.il> wrote:
Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems
to help
this application don't need so much i/o and can be run a 5.4k disk on an
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters



--
---
Gustatus Similis Pullus
_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Fred Grieco | 6 Dec 15:32 2011
Picon

Re: limit i/o for lun or volume

Do you mean that there is a performance issue with that particular VM, or with other systems on the filer that are caused by I/O on that VM?

You should be able to use flex share to set a lower priority for a "cache-hogging" and reduce its impact on the other volumes.  On the command line, run "priority on."  Then run "priority set volume $volumename level=Low" for the problem volume.  This will give less priority to that volume in the write cache, free up the cache for other things, and (hopefully) reduce latency on those other volumes.

In your case, you'll need to have that RDM in its own volume.

From: Alon Zeltser <alonz <at> emet.co.il>
To: toasters <at> teaparty.net
Sent: Tuesday, December 6, 2011 6:00 AM
Subject: limit i/o for lun or volume

Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems
to help
this application don't need so much i/o and can be run a 5.4k disk on an
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Alon Zeltser | 6 Dec 15:43 2011
Picon

Re: limit i/o for lun or volume

Thank you for replay but as I wrote im aware of flexshare but I don't think it will help since this system is not
impacting other systems and not compeating with other vms on resources 
Its only hurting itself it doing so much iops and getting such bad latency that after few minutes its doing
lun reset and the vm freezes and the user procces exit abnormally
I'm trying to limit the iops from this system so it wont load the Netapp so much and get better latency I don't
care if it will take much longer to run

Thanks again

Fred Grieco <fredgrieco <at> yahoo.com> wrote:

Do you mean that there is a performance issue with that particular VM, or with other systems on the filer that
are caused by I/O on that VM?

You should be able to use flex share to set a lower priority for a "cache-hogging" and reduce its impact on the
other volumes.  On the command line, run "priority on."  Then run "priority set volume $volumename
level=Low" for the problem volume.  This will give less priority to that volume in the write cache, free
up the cache for other things, and (hopefully) reduce latency on those other volumes.

In your case, you'll need to have that RDM in its own volume.

________________________________
 From: Alon Zeltser <alonz <at> emet.co.il>
To: toasters <at> teaparty.net 
Sent: Tuesday, December 6, 2011 6:00 AM
Subject: limit i/o for lun or volume

Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of 
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors 
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems 
to help
this application don't need so much i/o and can be run a 5.4k disk on an 
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL 
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either 
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in 
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think 
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware 
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Fred Grieco | 6 Dec 16:01 2011
Picon

Re: limit i/o for lun or volume

Have you run sysstat on the command line of the filer?  (sysstat -x 1)  Are you getting a lot of "b" deferred back to back type CPs?  My understanding is that the filer starts rejecting new scsi requests when there are back-to-back deferred CPs (because there's no where to put them in memory locally).  That's when you get resets and errors on the hosts.

Flexshare will limit the IOs to that volume and will reduce the deferred CPs, which may prevent the resets and errors.  The host itself may not like that though-- there may be a timeout or some other tuning in iscsi that will help with this.

From: Alon Zeltser <alonz <at> emet.co.il>
To: Fred Grieco <fredgrieco <at> yahoo.com>; "toasters <at> teaparty.net" <toasters <at> teaparty.net>
Sent: Tuesday, December 6, 2011 9:43 AM
Subject: Re: limit i/o for lun or volume

Thank you for replay but as I wrote im aware of flexshare but I don't think it will help since this system is not impacting other systems and not compeating with other vms on resources
Its only hurting itself it doing so much iops and getting such bad latency that after few minutes its doing lun reset and the vm freezes and the user procces exit abnormally
I'm trying to limit the iops from this system so it wont load the Netapp so much and get better latency I don't care if it will take much longer to run

Thanks again

Fred Grieco <fredgrieco <at> yahoo.com> wrote:

Do you mean that there is a performance issue with that particular VM, or with other systems on the filer that are caused by I/O on that VM?

You should be able to use flex share to set a lower priority for a "cache-hogging" and reduce its impact on the other volumes.  On the command line, run "priority on."  Then run "priority set volume $volumename level=Low" for the problem volume.  This will give less priority to that volume in the write cache, free up the cache for other things, and (hopefully) reduce latency on those other volumes.


In your case, you'll need to have that RDM in its own volume.



________________________________
From: Alon Zeltser <alonz <at> emet.co.il>
To: toasters <at> teaparty.net
Sent: Tuesday, December 6, 2011 6:00 AM
Subject: limit i/o for lun or volume

Hi all
i have a situation of a virtual machine running on esx 4.0 over iscsi
this machine  (while the user is running his stuff)  is using a lot of
i/o on a very small netapp controller (2020) with only 5 data disks aggr
and getting very bad latency and eventually getting a lun reset errors
and the machine freeze for a few minutes
i have increased the timeouts from the windows side but it doesn't seems
to help
this application don't need so much i/o and can be run a 5.4k disk on an
old laptop but given the resources of netapp it uses them to the fullest
it is important to mention that this application is running over MSSQL
db and the i/o problem goes to rdm lun
my question: is there a way to limit the i/o of this machine from either
netapp side /vmware side / windows side / network side
i'm aweare of vmware storage i/o control but this is only supported in
vsphere 4.1 and an upgrade is not an options right now
i'm also aware of flexshare from the netapp side but i don't think
giving low priority to this volume will help in this case
is there other way you can think of to limit the i/o goes from vmware
virtual machine through iscsi to an rdm lun or his hosting volume?

thank you

_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
Toasters <at> teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Gmane