Neale Ferguson | 14 Oct 21:40 2014
Picon

Re: Permission denied

Yeah, I noted I was looking at the wrong lockspace. The gfs2 lockspace in
this cluster is vol1. Once I corrected at what I was looking at, I think I
solved my problem: I believe the problem is an endian thing. In
set_rcom_status:

        rs->rs_flags = cpu_to_le32(flags)

However, in receive_rcom_status() flags are checked:

        if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) {

But it should be:

        if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) {

I made this change and now the gfs2 volume is being mounted correctly on
both nodes. I¹ve repeated it a number of times and it¹s kept working.

Neale

On 10/14/14, 3:20 PM, "David Teigland" <teigland <at> redhat.com> wrote:

>clvmd is a userland lockspace and does not use lockspace_ops or slots/jids
>like a gfs2 (kernel) lockspace.
>
>To debug the dlm/gfs2 control mechanism, which assigns gfs2 a jid based on
>dlm slots, enable the fs_info() lines in gfs2/lock_dlm.c.  (Make sure that
>you're not somehow running gfs_controld on these nodes; we quit using that
>in RHEL7.)

(Continue reading)

Thomas Meier | 13 Oct 21:10 2014
Picon
Picon

Fencing issues with fence_apc_snmp (APC Firmware 6.x)

Hi

When configuring PDU fencing in my 2-node-cluster I ran into some problems with
the fence_apc_snmp agent. Turning a node off works fine, but
fence_apc_snmp then exits with error.

When I do this manually (from node2):

   fence_apc_snmp -a node1 -n 1 -o off

the output of the command is not an expected:

   Success: Powered OFF

but in my case:

   Returned 2: Error in packet.
   Reason: (genError) A general failure occured
   Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21

When I check the PDU, the port is without power, so this part works.
But it seems that the fence agent can't read the status of the PDU
and then exits with error. The same seems to happen when fenced 
is calling the agent. The agent also exits with an error and fencing can't succeed
and the cluster hangs.

>From the logfile: 

    fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent

(Continue reading)

Neale Ferguson | 13 Oct 17:20 2014
Picon

Permission denied

I reported last week that I was getting permission denied when pcs was
starting a gfs2 resource. I thought it was due to the resource being
defined incorrectly, but it doesn¹t appear to be the case. On rare
occasions the mount works but most of the time one node gets it mounted
but the other gets denied. I¹ve enabled a number of logging options and
done straces on both sides but I¹m not getting anywhere.

My cluster looks like:

# pcs resource show
 Clone Set: dlm-clone [dlm]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Resource Group: apachegroup
   VirtualIP	(ocf::heartbeat:IPaddr2):	Started
   Website	(ocf::heartbeat:apache):	Started
   httplvm	(ocf::heartbeat:LVM):	Started
   http_fs	(ocf::heartbeat:Filesystem):	Started
 Clone Set: clvmd-clone [clvmd]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
   Started: [ rh7cn1.devlab.sinenomine.net ]
   Stopped: [ rh7cn2.devlab.sinenomine.net ]

The gfs2 resource is defined:

# pcs resource show clusterfs
 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
 Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo
fstype=gfs2 options=noatime
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
(Continue reading)

Neale Ferguson | 3 Oct 21:32 2014
Picon

gfs2 resource not mounting

Using the same two-node configuration I described in an earlier post this forum, I'm having problems
getting a gfs2 resource started on one of the nodes. The resource in question:

 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime 
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
              monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s)

pcs status shows:

Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn1.devlab.sinenomine.net ]
     Stopped: [ rh7cn2.devlab.sinenomine.net ]

Failed actions:
    clusterfs_start_0 on rh7cn2.devlab.sinenomine.net 'unknown error' (1): call=46, status=complete,
last-rc-change='Fri Oct  3 14:41:26 2014', queued=4702ms, exec=0ms

Using pcs resource debug-start I see:

Operation start for clusterfs:0 (ocf:heartbeat:Filesystem) returned 1
 >  stderr: INFO: Running start for /dev/vg_cluster/ha_lv on /mnt/gfs2-demo
 >  stderr: mount: permission denied
 >  stderr: ERROR: Couldn't mount filesystem /dev/vg_cluster/ha_lv on /mnt/gfs2-demo

(Continue reading)

Digimer | 3 Oct 19:56 2014
Picon

Re: clvmd issues

On 03/10/14 12:57 PM, manish vaidya wrote:
> First i apologise for late reply , delay due to i cannot believe ,any
> response from site , I am a newcomer , already , i had posted this
> problem on many online forums , but they didn't give any response
>
> Thank all , for taking my problem seriously
>
> ** response from you
>
> are you using clvmd? if your answer is = yes, you need to be sure, you pv
>
> is visibile to your cluster nodes
>
> *** i am using clvmd & When use pvscan command cluster hangs
>
> I want to reproduce this situation again for perfection , such as when i
> try to run pvcreate command in cluster , message should come lock from
> node2 & node3 , I have created new cluster , this new cluster is working
> fine ,
> How to do This? any setting in lvm.conf

Can you share your setup please?

What kind of cluster? What version? What is the configuration file? Was 
there anything interesting in the system logs? etc.

--

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
(Continue reading)

Daniel Dehennin | 3 Oct 16:35 2014

cLVM unusable on quorated cluster

Hello,

I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
for an OpenNebula cluster.

As I'm new to cluster world, I have hard time figuring why sometime
things get really wrong and where I must look to find answers.

My OpenNebula frontend, running in a VM, does not manage to run the
resources and my syslog has a lot of:

#+begin_src
ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist
#+end_src

When this happens, other nodes have problem:

#+begin_src
root <at> nebula3:~# LANG=C vgscan
  cluster request failed: Host is down
  Unable to obtain global lock.
#+end_src

But things looks fin in “crm_mon”:

#+begin_src
root <at> nebula3:~# crm_mon -1
============
Last updated: Fri Oct  3 16:25:43 2014
Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
(Continue reading)

Neale Ferguson | 2 Oct 21:30 2014
Picon

Fencing of node

After creating simple two node cluster, one node is being fenced continually. I'm running pacemaker
(1.1.10-29) with two nodes and the following corosync.conf:

totem {
version: 2
secauth: off
cluster_name: rh7cluster
transport: udpu
}

nodelist {
  node {
        ring0_addr: rh7cn1.devlab.sinenomine.net
        nodeid: 1
       }
  node {
        ring0_addr: rh7cn2.devlab.sinenomine.net
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_syslog: yes
}

(Continue reading)

Ferenc Wagner | 22 Sep 10:24 2014
Picon

ordering scores and kinds

Hi,

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html
says that optional ordering is achieved by setting the "kind" attribute
to "Optional".  However, the next section
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_advisory_ordering.html
says that advisory ordering is achieved by setting the "score" attribute
to 0.  Is there any difference between an optional and an advisory
ordering constraint?  How do nonzero score values influence cluster
behaviour, if at all?  Or is the kind attribute intended to replace all
score settings on ordering constraints?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Kaisar Ahmed Khan | 21 Sep 08:06 2014
Picon

GFS2 mount problem


 Dear All,

 I have been experiencing a problem for long time in GFS2 with three node cluster.

Short brief about my scenario
All three nodes in a Host with KVM technology.  storage accessing by iSCSI on all three nodes.
One 50GB LUN initiated on all three nodes , and configured GFS2 file system .
GFS file system mounted at all three nodes persistently by fstab.

Problem is:
When I reboot/ fence any machine , I found GFS2 file system not mounted . it got  mounted after  applying # mount –a Command .
 
What possible cause of this problem. ?

Thanks
Kaisar


 
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Nick Fisk | 17 Sep 17:45 2014
Picon

Wrong Variable name in iSCSILogicalUnit

Hi,

 

I have been trying to create a HA iSCSILogicalUnit resource and think I have come across a bug caused by a wrong variable name.

 

I have been using the master branch from cluster labs for my iSCSILogicalUnit resource agent running on Ubuntu 14.04.

 

Whilst the LUN and Target are correctly created by the agent when stopping the agent it was only removing the target, which cleared the LUN but left the iBlock device. This was then locking the underlying block device as it was still in use.

 

After spedning a fair while trawling through the agent I beleive I have discovered the problem, at least the change I made has fixed it for me.

 

In the monitor and stop actions there is a check which uses the wrong variable,  OCF_RESKEY_INSTANCE instead of OCF_RESOURCE_INSTANCE. I also found a “#{“ in front of one of the variables that prepares the path string for removing the LUN. I have also added a few more log entries to give a clearer picture of what is happening during removal, which made the debugging process much easier.

 

 

Below is a Diff which seems to fix the problem for me:-

 

 

+++ /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit  2014-09-17 16:40:23.208764599 +0100

<at> <at> -419,12 +419,14 <at> <at>

                                        ${initiator} ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC

                        fi

                done

-               lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_#{${OCF_RESKEY_lun}/"

+               lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_${OCF_RESKEY_lun}/"

                if [ -e "${lun_configfs_path}" ]; then

+                       ocf_log info "Deleting LUN ${OCF_RESKEY_target_iqn}/${OCF_RESKEY_lun}"

                        ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC

                fi

-               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path"

+               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path"

                if [ -e "${block_configfs_path}" ]; then

+                       ocf_log info "Deleting iBlock Device iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}"

                        ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC

                fi

                ;;

<at> <at> -478,7 +480,7 <at> <at>

                [ -e ${configfs_path} ] && [ `cat ${configfs_path}` = "${OCF_RESKEY_path}" ] && return $OCF_SUCCESS

 

                # if we aren't activated, is a block device still left over?

-               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path"

+               block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path"

                [ -e ${block_configfs_path} ] && ocf_log warn "existing block without an active lun: ${block_configfs_path}"

                [ -e ${block_configfs_path} ] && return $OCF_ERR_GENERIC



Nick Fisk
Technical Support Engineer

System Professional Ltd
tel: 01825 830000
mob: 07711377522
fax: 01825 830001
mail: Nick.Fisk <at> sys-pro.co.uk
web: www.sys-pro.co.uk

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200


Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries.
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Ferenc Wagner | 17 Sep 13:36 2014
Picon

transition graph elements

Hi,

Some cluster configuration helpers here do some simple transition graph
analysis (no action planned or single resource start/restart).  The
information source is crm_simulate --save-graph.  It works pretty well,
but recently, after switching on utilization based resource placement,
load_stopped_* pseudo events appeared in the graph even when it was
beforehand an empty <transition_graph/>.  The workaround was obvious,
but I guess it's high time to seek out some definitive documentation
about the transition graph XML.  Is there anything of that sort
available somewhere?  If not, which part of the source shall I start
looking at?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Gmane