Ferenc Wagner | 22 Sep 10:24 2014
Picon

ordering scores and kinds

Hi,

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html
says that optional ordering is achieved by setting the "kind" attribute
to "Optional".  However, the next section
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_advisory_ordering.html
says that advisory ordering is achieved by setting the "score" attribute
to 0.  Is there any difference between an optional and an advisory
ordering constraint?  How do nonzero score values influence cluster
behaviour, if at all?  Or is the kind attribute intended to replace all
score settings on ordering constraints?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Kaisar Ahmed Khan | 21 Sep 08:06 2014
Picon

GFS2 mount problem


 Dear All,

 I have been experiencing a problem for long time in GFS2 with three node cluster.

Short brief about my scenario
All three nodes in a Host with KVM technology.  storage accessing by iSCSI on all three nodes.
One 50GB LUN initiated on all three nodes , and configured GFS2 file system .
GFS file system mounted at all three nodes persistently by fstab.

Problem is:
When I reboot/ fence any machine , I found GFS2 file system not mounted . it got  mounted after  applying # mount –a Command .
 
What possible cause of this problem. ?

Thanks
Kaisar


 
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Nick Fisk | 17 Sep 17:45 2014
Picon

Wrong Variable name in iSCSILogicalUnit

Hi,

 

I have been trying to create a HA iSCSILogicalUnit resource and think I have come across a bug caused by a wrong variable name.

 

I have been using the master branch from cluster labs for my iSCSILogicalUnit resource agent running on Ubuntu 14.04.

 

Whilst the LUN and Target are correctly created by the agent when stopping the agent it was only removing the target, which cleared the LUN but left the iBlock device. This was then locking the underlying block device as it was still in use.

 

After spedning a fair while trawling through the agent I beleive I have discovered the problem, at least the change I made has fixed it for me.

 

In the monitor and stop actions there is a check which uses the wrong variable,  OCF_RESKEY_INSTANCE instead of OCF_RESOURCE_INSTANCE. I also found a “#{“ in front of one of the variables that prepares the path string for removing the LUN. I have also added a few more log entries to give a clearer picture of what is happening during removal, which made the debugging process much easier.

 

 

Below is a Diff which seems to fix the problem for me:-

 

 

+++ /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit  2014-09-17 16:40:23.208764599 +0100

<at> <at> -419,12 +419,14 <at> <at>

                                        ${initiator} ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC

                        fi

                done

-               lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_#{${OCF_RESKEY_lun}/"

+               lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_${OCF_RESKEY_lun}/"

                if [ -e "${lun_configfs_path}" ]; then

+                       ocf_log info "Deleting LUN ${OCF_RESKEY_target_iqn}/${OCF_RESKEY_lun}"

                        ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC

                fi

-               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path"

+               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path"

                if [ -e "${block_configfs_path}" ]; then

+                       ocf_log info "Deleting iBlock Device iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}"

                        ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC

                fi

                ;;

<at> <at> -478,7 +480,7 <at> <at>

                [ -e ${configfs_path} ] && [ `cat ${configfs_path}` = "${OCF_RESKEY_path}" ] && return $OCF_SUCCESS

 

                # if we aren't activated, is a block device still left over?

-               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path"

+               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path"

                [ -e ${block_configfs_path} ] && ocf_log warn "existing block without an active lun: ${block_configfs_path}"

                [ -e ${block_configfs_path} ] && return $OCF_ERR_GENERIC



Nick Fisk
Technical Support Engineer

System Professional Ltd
tel: 01825 830000
mob: 07711377522
fax: 01825 830001
mail: Nick.Fisk <at> sys-pro.co.uk
web: www.sys-pro.co.uk

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200


Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries.
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Ferenc Wagner | 17 Sep 13:36 2014
Picon

transition graph elements

Hi,

Some cluster configuration helpers here do some simple transition graph
analysis (no action planned or single resource start/restart).  The
information source is crm_simulate --save-graph.  It works pretty well,
but recently, after switching on utilization based resource placement,
load_stopped_* pseudo events appeared in the graph even when it was
beforehand an empty <transition_graph/>.  The workaround was obvious,
but I guess it's high time to seek out some definitive documentation
about the transition graph XML.  Is there anything of that sort
available somewhere?  If not, which part of the source shall I start
looking at?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Vallevand, Mark K | 16 Sep 23:20 2014
Picon

Cman (and corosync) starting before network interface is ready

It looks like there is some odd delay in getting a network interface up and ready.  So, when cman starts corosync, it can’t get to the cluster.  So, for a time, the node is a member of a cluster-of-one.  The cluster-of-one begins starting resources.  A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster.  The doubly-started resources are sorted out and all ends up OK.

 

Now, this is not a good thing to have these particular resources running twice.  I’d really like the clustering software to behave better.  But, I’m not sure what ‘behave better’ would be.

 

Is it possible to introduce a delay into cman or corosync startup?  Is that even wise?

Is there a parameter to get the clustering software to poll more often when it can’t rejoin the cluster?

 

Any suggestions would be welcome.

 

Running Ubuntu 12.04 LTS.  Pacemaker 1.1.6.  Cman 3.1.7.  Corosync 1.4.2.

 

Regards.
Mark K Vallevand

"If there are no dogs in Heaven, then when I die I want to go where they went."
-Will Rogers

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Amjad Syed | 9 Sep 09:14 2014
Picon

Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster

Hi,

I have setup a 2 node cluster using RHEL 6.5 .

The cluster .conf looks like this 

 

<?xml version="1.0"?>
<cluster config_version="7" name="oracleha">
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
           <fencedevice agent= "fence_ipmilan" ipaddr="10.10.63.93" login="ADMIN" name="inspuripmi"  passwd="XXXXX/>
           <fencedevice agent = "fence_ilo2" ipaddr="10.10.63.92" login="test" name="hpipmi"  passwd="XXXXX"/>
          </fencedevices>
      <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
           <clusternode name= "192.168.10.10"  nodeid="1" >
           <fence>
               <method name  = "1">
                 <device lanplus = "" name="inspuripmi"  action ="reboot"/>
                 </method>
            </fence>
           </clusternode>
            <clusternode name = "192.168.10.11" nodeid="2">
                 <fence>
                 <method name = "1">
                  <device lanplus = "" name="hpipmi" action ="reboot"/>
                  </method>
               </fence>
            </clusternode>
         </clusternodes>


        <rm>

          <failoverdomains/>
        <resources/>
        <service autostart="1" exclusive="0" name="IP" recovery="relocate">
                <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>
        </service>
</rm>

</cluster>


The network is as follows:

1)Heartbeat (Bonding) between node 1 and node 2  using ethernet cables 

The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node 2.

2) IPMI.  This is used for fencing and addresses  are 10.10.63.93 and 10.10.63.92

3) External ethernet connected to 10.10.5.x network.

If i do fence_node <ipaddress>, then fencing works,
However if i physically shutdown active node, the passive node also shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown and have to be physically rebooted.

Any thing i am doing wrong ?




--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Neale Ferguson | 8 Sep 16:44 2014
Picon

Re: F_SETLK fails after recovery

Further to the problem described last week. What I'm seeing is that the node (NODE2) that keeps going when
NODE1 fails has many entries in dlm_tool log_plocks output:

1410147734 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0
1410147734 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0
1410147734 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0
1410147736 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0
1410147736 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0
1410147736 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0
1410147738 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0
1410147738 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0
1410147738 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0
1410147740 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0
1410147740 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0
1410147740 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0
1410147742 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0
1410147742 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0
1410147742 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0

i.e. with no corresponding unlock entry. NODE1 is brought down by init 6 and when it restarts it gets as far as
"Starting cman" before NODE2 fences it (I assume we need a higher post_join_delay). When the node is
fenced I see:

1410147774 clvmd purged 0 plocks for 1
1410147774 lvclusdidiz0360 purged 3 plocks for 1

So it looks like it tried to some clean up but then when NODE1 attempts to join NODE2 examines the lockspace
and reports the following:

1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78067.0"
1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78068.0"
1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78059.0"
1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88464.0"
1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88478.0"
1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0
1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0

So it believes NODE1 will have 45 plocks to process when it comes back. NODE1 receives that plock
information: 

1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2
1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1

However, when NODE1 attempts to retrieve plocks it reports:

1410147820 lvclusdidiz0360 retrieve_plocks
1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0

Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2
target will fail on NODE1.

I'm am try to understand the checkpointing process and from where this information is actually being retrieved.

Neale

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

urgrue | 3 Sep 11:07 2014

Possible to apply changes without restart?

Hi,
Using cman/rgmanager in RHEL6 - is it possible to add a resource to my
service and have it be picked up and started without having to restart
cman/rgmanager? I thought ccs --activate did this, and the rgmanager.log
does output:
Sep 03 10:50:07 rgmanager Stopping changed resources.
Sep 03 10:50:07 rgmanager Restarting changed resources.
Sep 03 10:50:07 rgmanager Starting changed resources.

But the resource I added is not running nor is there any mention of it
in the logs. Is it normal or did I do something wrong?

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Neale Ferguson | 2 Sep 16:56 2014
Picon

F_SETLK fails after recovery

Hi,
 In our two node system if one node fails, the other node takes over the application and uses the shared gfs2
target successfully. However, after the failed node comes back any attempts to lock files on the gfs2
resource results in -ENOSYS. The following test program exhibits the problem - in normal operation the
lock succeeds but in the fail/recover scenario we get -ENOSYS:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int 
main(int argc, char **argv)
{
	int fd;
	struct flock fl;

	fd = open("/mnt/test.file",O_RDONLY);
	if (fd != -1) {
		if (fcntl(fd, F_SETFL, O_RDONLY|O_DSYNC) != -1) {
			fl.l_type = F_RDLCK;
			fl.l_whence = SEEK_SET;
			fl.l_start = 0;
			fl.l_len = 0;
			if (fcntl(fd, F_SETLK, &fl) != -1)
				printf("File locked successfully\n");
			else
				perror("fcntl(F_SETLK)");
		} else
			perror("fcntl(F_SETFL)");
		close (fd);
	} else 
		perror("open");
}

I've tracked things down to these messages:

1409631951 lockspace lvclusdidiz0360 plock disabled our sig 816fba01 nodeid 2 sig 2f6b
:
1409634840 lockspace lvclusdidiz0360 plock disabled our sig 0 nodeid 2 sig 2f6b

Which indicates the lockspace attribute disable_plock has been set by way of the other node calling send_plocks_stored
().

Looking at the cpg.c:

static void prepare_plocks(struct lockspace *ls)
{

struct change *cg = list_first_entry(&ls->changes, struct change, list);

struct member *memb;
uint32_t sig;

:
:
:
      if (nodes_added(ls))
            store_plocks(ls, &sig);
      send_plocks_stored(ls, sig);
}

If nodes_added(ls) returns false then an uninitialized "sig" value will be passed to
send_plocks_stored(). Do the "our sig" and "sig" values in the above log messages make sense?

If this is not the case, what is supposed to happen in order re-enable plocks on the recovered node?

Neale

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

manish vaidya | 30 Aug 16:12 2014

Please help me on cluster error

i created four node cluster in kvm enviorment But i faced error when create new pv such as pvcreate /dev/sdb1
got error , lock from node 2 & lock from node3

also strange cluster logs

Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
5f
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f
60
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63
64
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69
6a
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84
85
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a
9b


Please help me on this issue

Get your own FREE website, FREE domain & FREE mobile app with Company email.  
Know More >
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Neale Ferguson | 28 Aug 21:11 2014
Picon

Delaying fencing during shutdown

Hi,
 In a two node cluster I shutdown one of the nodes and the other node notices the shutdown but on rare occasions
that node will then fence the node that is shutting down. I assume Is this a situation where setting
post_fail_delay would be useful or setting the totem timeout to something higher than its default.

Neale

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Gmane