Frederik Ferner | 11 Mar 19:07 2015
Picon

Re: shutdown work (plan)

All,

just a reminder and update of this planned work for this shutdown (with 
the obvious beamline typo fixed). The main new detail really is that 
IIRC i02 will want us to power off everything before Saturday (i.e. some 
time on Friday.) and it will remain powered of for a while (week or so.)

IMHO it would be good if we can start considering the merge to UserMode 
tomorrow to make sure we are ready to check it out on Friday.

The Lustre I/O errors are hopefully going to go away with the Lustre 
upgrade on the servers, so no need to reproduce this until after the 
Lustre maintenance.

Lustre and GPFS client and kernel modules have been built, cfengine has 
been updated with the correct versions (I hope).

Rsyncing /dls/i18 to Lustre might have to be delayed slightly as 
lustre03 is currently ~88% full and I'd like to clear some/most of this 
first...

Richard, did you have more details when various switch upgrades etc have 
been scheduled?

Cheers,
Frederik

On 04/03/15 14:47, Frederik Ferner wrote:
> All,
>
(Continue reading)

Marek "marx" Grac | 5 Mar 12:47 2015
Picon

fence-agents-4.0.16 stable release

Welcome to the fence-agents 4.0.16 release

This release includes several bugfixes and features:
* fence_kdump has implemented 'monitor' action that check if local node 
is capable of working with kdump
* path to smnp(walk|get|set) can be set at runtime
* new operation 'validate-all' for majority of agents that checks if 
entered parameters are sufficient without     connecting to fence 
device. Be aware that some checks can be done only after we receive 
information
     from fence device, so these are not tested.
* new operation 'list-status' that present CSV output (plug_number, 
plug_alias, plug_status) where status
     is ON/OFF/UNKNOWN

Git repository was moved to https://github.com/ClusterLabs/fence-agents/ 
so this is last
release made from fedorahosted.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.16.tar.xz 

To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
(Continue reading)

Megan . | 19 Feb 14:50 2015
Picon

Number of GFS2 mounts

Good Morning!

We have an 11 node Centos 6.6 cluster configuration.  We are using it
to share SAN mounts between servers (GFS2 via iscsi with LVM).  We
have a requirement to have 33 GFS2 mounts shared on the cluster (crazy
i know).  Are there any limitations on doing this?  I couldn't find
anything in the documentation about number of mounts, just the size of
the mounts.  Is there anything I can do to tune our cluster to handle
this requirement?

Thanks!

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Equipe R&S Netplus | 13 Feb 17:39 2015
Picon

NFS HA

Hello,

I would like to setup a cluster of NFS.
With RHCS, I use the ressource agent "nfsserver".

But I have a question :
Is it possible manage a NFS server where NFS client "will not be aware of any loss of service" ? In other words, if the NFS service failover the NFS client don't see any change.

Actually when there is a failover, I can't access to the NFS server anymore.
Indeed, I had the message "Stale NFS file handle".
In client NFS log :
<<
NFS: server X.X.X.X error: fileid changed
>>

Is there any solution please ?
Thank you.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Vallevand, Mark K | 13 Feb 17:16 2015
Picon

Is there a way for a resource agent to know the previous node on which it was active?

Is there a way for a resource agent to know the previous node on which it was active?

 

Regards.
Mark K Vallevand   Mark.Vallevand <at> Unisys.com
Outside of a dog, a book is man's best friend.
 Inside of a dog, its too dark to read.  - Groucho

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Masatake YAMATO | 5 Feb 10:57 2015
Picon

wrong error messages in fence-virt

Is fence-virt still maintained?
I cannot find the git repository for it.
There is one at sf.net. However, it looks obsoleted.

With my broken configuration, I got following debug output from
fence_xvm...

    #  fence_xvm -H targethost -o status -dddddd
    Debugging threshold is now 6
    -- args  <at>  0x7fff762de810 --
    ...
    Opening /dev/urandom
    Sending to 225.0.0.12 via 192.168.122.113
    Waiting for connection from XVM host daemon.
    Issuing TCP challenge
>   read: Is a directory
    Invalid response to challenge
    Operation failed

Look at the line marked with '>'. The error message is strange for me
because as far as reading the source code, read is called with a socket connected
to fence_virtd.

So I conducted a code walking and found two bugs:

1. Checking the result of read( and write ) system call

   perror is called even if the call is successful.

2. "read" is specified as an argument for perror when write system call is faield.

Both are not critical if fence_virtd is configured well.
However, users may be confused when it is not well.

Followig patch is not tested at all but it represents what I want to
say in above list.

Masatake YAMATO

--- fence-virt-0.3.2/common/simple_auth.c	2013-11-05 01:08:35.000000000 +0900
+++ fence-virt-0.3.2/common/simple_auth.c.new	2015-02-05 18:40:53.471029118 +0900
 <at>  <at>  -260,9 +260,13  <at>  <at> 
 		return 0;
 	}

-	if (read(fd, response, sizeof(response)) < sizeof(response)) {
+	ret = read(fd, response, sizeof(response));
+	if (ret < 0) {
 		perror("read");
 		return 0;
+	} else if (ret < sizeof(response)) {
+		fprintf(stderr, "RESPONSE is too short(%d) in %s\n", ret, __FUNCTION__);
+		return 0;
 	}

 	ret = !memcmp(response, hash, sizeof(response));
 <at>  <at>  -333,7 +337,7  <at>  <at> 
 	HASH_Destroy(h);

 	if (write(fd, hash, sizeof(hash)) < sizeof(hash)) {
-		perror("read");
+		perror("write");
 		return 0;
 	}

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

cluster lab | 29 Jan 05:50 2015
Picon

GFS2: "Could not open" the file on one of the nodes

Hi,

In a two node cluster, i received to different result from "qemu-img
check" on just one file:

node1 # qemu-img check VMStorage/x.qcow2
No errors were found on the image.

Node2 # qemu-img check VMStorage/x.qcow2
qemu-img: Could not open 'VMStorage/x.qcow2"

All other files are OK, and the cluster works properly.
What is the problem?

====
Packages:
kernel: 2.6.32-431.5.1.el6.x86_64
GFS2: gfs2-utils-3.0.12.1-23.el6.x86_64
corosync: corosync-1.4.1-17.el6.x86_64

Best Regards

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Vladimir Melnik | 13 Dec 17:04 2014
Picon

The file on a GFS2-filesystem seems to be corrupted

Dear colleagues,

I encountered some very strange issue and would be grateful if you share
your thoughts on that.

I have a qcow2-image that is located at gfs2 filesystem on a cluster.
The cluster works fine and there are dozens of other qcow2-images, but,
as I can see, one of images seems to be corrupted.

First of all, it has quite unusual size:
> stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
  File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
  Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
Device: fd06h/64774d    Inode: 220986752   Links: 1
Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-10-09 16:25:24.864877839 +0300
Modify: 2014-12-13 14:41:29.335603509 +0200
Change: 2014-12-13 15:52:35.986888549 +0200

By the way, I noticed that blocks' number looks rather okay.

Also qemu-img can't recognize it as an image:
> qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
file format: raw
virtual size: 6815746T (7493992262336241664 bytes)
disk size: 392G

Disk size, although, looks more reasonable: the image's size is really
should be about 300-400G, as I remember.

Alas, I can't do anything with this image. I can't check it by qemu-img,
neither I can convert it to the new image, as qemu-img can't do anything
with it:

> qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'

Any one have experienced the same issue? What do you think, is it qcow2
issue or a gfs2 issue? What would you do in similar situation?

Any ideas, hints and comments would be greatly appreciated.

Yes, I have snapshots, that's good, but wouldn't like to lose today's
changes to the data on that image. And I'm worried about the filesystem
at all: what if something goes wrong if I try to remove that file?

Thanks to all!

-- 
V.Melnik

P.S. I use CentOS-6 and I have these packages installed:
	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
	lvm2-cluster-2.02.100-8.el6.x86_64
	cman-3.0.12.1-59.el6_5.1.x86_64
	clusterlib-3.0.12.1-59.el6_5.1.x86_64
	kernel-2.6.32-431.5.1.el6.x86_64

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Jürgen Ladstätter | 2 Dec 11:29 2014

Fencing and dead locks

Hi guys,

 

we’re running a 9 node cluster with 5 gfs2 mounts. The cluster is mainly used for load balancing web based applications. Fencing is done with IPMI and works.

Sometimes one server gets fenced, but after rebooting isn’t able to rejoin the cluster. This triggers higher load and many open processes, leading to another server being fenced. This server then isn’t able to rejoin either and this continues until we lose quorum and have to manually restart the whole cluster.

Sadly this is not reproducible, but it looks like it happens more often when there is more write IO.

 

Since a whole cluster deadlock kinda removes the sense of a cluster, we’d need some input what we could do or change.

We’re running Centos 6.6, kernel 2.6.32-504.1.3.el6.x86_64

 

Did anyone of you test gfs2 with centos 7? Any known major bugs that could cause dead locks?

 

Thanks in advance, Jürgen

 

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Megan . | 1 Dec 15:16 2014
Picon

new cluster acting odd

Good Day,

I'm fairly new to the cluster world so i apologize in advance for
silly questions.  Thank you for any help.

We decided to use this cluster solution in order to share GFS2 mounts
across servers.  We have a 7 node cluster that is newly setup, but
acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
with Idracs).  They are all running Centos 6.6.  I have fencing
working (I'm able to do fence_node node and it will fence with
success).  I do not have the gfs2 mounts in the cluster yet.

When I don't touch the servers, my cluster looks perfect with all
nodes online. But when I start testing fencing, I have an odd problem
where i end up with split brain between some of the nodes.  They won't
seem to automatically fence each other when it gets like this.

in the  corosync.log for the node that gets split out i see the totem
chatter, but it seems confused and just keeps doing the below over and
over:

Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
21 23 24 25 26 27 28 29 2a 2b 32
..
..
..
Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

I can manually fence it, and it still comes online with the same
issue.  I end up having to take the whole cluster down, sometimes
forcing reboot on some nodes, then brining it back up.  Its takes a
good part of the day just to bring the whole cluster online again.

I used ccs -h node --sync --activate and double checked to make sure
they are all using the same version of the cluster.conf file.

Once issue I did notice, is that when one of the vmware hosts is
rebooted, the time comes off slitty skewed (6 seconds) but i thought i
read somewhere that a skew that minor shouldn't impact the cluster.

We have multicast enabled on the interfaces

          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
and we have been told by our network team that IGMP snooping is disabled.

With tcpdump I can see the multi-cast traffic chatter.

Right now:

[root <at> data1-uat ~]# clustat
Cluster Status for projectuat  <at>  Mon Dec  1 13:56:39 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Online
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Online
 data1-uat.domain.com                                   8 Online, Local

** Has itself ass online **
[root <at> map1-uat ~]# clustat
Cluster Status for projectuat  <at>  Mon Dec  1 13:57:07 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Offline, Local
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Online
 data1-uat.domain.com                                   8 Online

[root <at> cache1-uat ~]# clustat
Cluster Status for projectuat  <at>  Mon Dec  1 13:57:39 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Online
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Offline, Local
 data1-uat.domain.com                                   8 Online

[root <at> mgmt1-uat ~]# clustat
Cluster Status for projectuat  <at>  Mon Dec  1 13:58:04 2014
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Offline
 admin1-uat.domain.com                                  2 Offline
 mgmt1-uat.domain.com                                   3 Online, Local
 map1-uat.domain.com                                    4 Offline
 map2-uat.domain.com                                    5 Offline
 cache1-uat.domain.com                                  6 Offline
 data1-uat.domain.com                                   8 Offline

cman-3.0.12.1-68.el6.x86_64

[root <at> data1-uat ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="66" name="projectuat">
<clusternodes>
<clusternode name="admin1-uat.domain.com" nodeid="2">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="admin1-uat" ssl="on"
uuid="421df3c4-a686-9222-366e-9a67b25f62b2"/>
</method>
</fence>
</clusternode>
<clusternode name="mgmt1-uat.domain.com" nodeid="3">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="mgmt1-uat" ssl="on"
uuid="421d5ff5-66fa-5703-66d3-97f845cf8239"/>
</method>
</fence>
</clusternode>
<clusternode name="map1-uat.domain.com" nodeid="4">
<fence>
<method name="fencemap1uat">
<device name="idracmap1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="map2-uat.domain.com" nodeid="5">
<fence>
<method name="fencemap2uat">
<device name="idracmap2uat"/>
</method>
</fence>
</clusternode>
<clusternode name="cache1-uat.domain.com" nodeid="6">
<fence>
<method name="fencecache1uat">
<device name="idraccache1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="data1-uat.domain.com" nodeid="8">
<fence>
<method name="fencedata1uat">
<device name="idracdata1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="archive1-uat.domain.com" nodeid="1">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="archive1-uat" ssl="on"
uuid="421d16b2-3ed0-0b9b-d530-0b151d81d24e"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="x.x.x.130"
login="fenceuat" login_timeout="10" name="vcappliancesoap"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="10"
power_wait="30" retry_on="3" shell_timeout="10" ssl="1"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.47" login="fenceuat" name="idracdata1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.48" login="fenceuat" name="idracdata2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.82" login="fenceuat" name="idracmap1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.96" login="fenceuat" name="idracmap2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.83" login="fenceuat" name="idraccache1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.97" login="fenceuat" name="idraccache2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
</fencedevices>
</cluster>

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Rajat | 28 Nov 06:51 2014
Picon

Cluster Overhead I/O, Network, Memory, CPU

Hey Team,

Our customer is using RHEL 5.X and RHEL 6.X as Cluster in they production stack.

Customer is looking is there any doc/white paper which can share they management as cluster service usages on
Disk                        %
Network                %
Memory                 %
CPU                        %

Gratitude


--
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Gmane