Andrew Kerber | 6 Jun 23:37 2016
Picon
Gravatar

Fencing Question

I am doing some experimentation with Linux clustering, and still fairly new on it. I have built a cluster as a proof of concept running a PostgreSQL 9.5 database on gfs2 using VMware workstation 12.0 and RHEL7.  GFS2 requires a fencing resource, which I have managed to create using fence_virsh.  And the clustering software thinks the fencing is working.  However, it will not actually shut down a node, and I have not been able to figure out the appropriate parameters for VMware workstation to get it to work.  I tried fence-scsi also, but that doesnt seem to work with a shared vmdk,   Has anyone figured out a fencing agent that will work with VMware workstation?

Failing that, is there a comprehensive set of instructions for creating my own fencing agent?


--
Andrew W. Kerber

'If at first you dont succeed, dont take up skydiving.'
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Eivind Olsen | 24 May 10:45 2016
Picon
Gravatar

How to add an unimportant resource to an important cluster?

I have a cluster, running RHEL 6.7, with Ricci, Luci, rgmanager etc.
This is a 2 node cluster, where services are running on one node. The 
service is an Oracle database, and the cluster controls several 
resources:
* LVM volume (using clvmd)
* file system on a logical volume
* IP address
* Oracle listener
* Oracle RDBMS instance

I have now been asked to put another resource (another Oracle RBMS 
instance) but with the requirement that this new resource shouldn't 
cause the rest of the cluster resources to fail over to the other node. 
Basically, what's been asked is to have another resource which will be 
started by the cluster, but if it fails the health check it will be left 
alone.

Is it possible to somehow mark one of the resources as "Not really 
important, attempt to restart if down but don't migrate the entire 
service with all the resources to the other node"?
My gut feeling tells me the better (correct, only etc.) way is probably 
to set up a separate service for this new less important RDBMS instance, 
giving it its own IP address, LVM volume, filesystem, listener etc.

Regards
Eivind Olsen

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Jonathan Davies | 15 Apr 16:55 2016
Gravatar

I/O to gfs2 hanging or not hanging after heartbeat loss

Dear linux-cluster,

I have made some observations about the behaviour of gfs2 and would 
appreciate confirmation of whether this is expected behaviour or 
something has gone wrong.

I have a three-node cluster -- let's call the nodes A, B and C. On each 
of nodes A and B, I have a loop that repeatedly writes an increasing 
integer value to a file in the GFS2-mountpoint. On node C, I have a loop 
that reads from both these files from the GFS2-mountpoint. The reads on 
node C show the latest values written by A and B, and stay up-to-date. 
All good so far.

I then cause node A to drop the corosync heartbeat by executing the 
following on node A:

iptables -I INPUT -p udp --dport 5404 -j DROP
iptables -I INPUT -p udp --dport 5405 -j DROP
iptables -I INPUT -p tcp --dport 21064 -j DROP

After a few seconds, I normally observe that all I/O to the GFS2 
filesystem hangs forever on node A: the latest value read by node C is 
the same as the last successful write by node A. This is exactly the 
behaviour I want -- I want to be sure that node A never completes I/O 
that is not able to be seen by other nodes.

However, on some occasions, I observe that node A continues in the loop 
believing that it is successfully writing to the file but, according to 
node C, the file stops being updated. (Meanwhile, the file written by 
node B continues to be up-to-date as read by C.) This is concerning -- 
it looks like I/O writes are being completed on node A even though other 
nodes in the cluster cannot see the results.

I performed this test 20 times, rebooting node A between each, and saw 
the "I/O hanging" behaviour 16 times and the "I/O appears to continue" 
behaviour 4 times. I couldn't see anything that might cause it to 
sometimes adopt one behaviour and sometimes the other.

So... is this expected? Should I be able to rely upon I/O hanging? Or 
have I misconfigured something? Advice would be appreciated.

Thanks,
Jonathan

Notes:
  * The I/O from node A uses an fd that is O_DIRECT|O_SYNC, so the page 
cache is not involved.

  * Versions: corosync 2.3.4, dlm_controld 4.0.2, gfs2 as per RHEL 7.2.

  * I don't see anything particularly useful being logged. Soon after I 
insert the iptables rules on node A, I see the following on node A:

2016-04-15T14:15:45.608175+00:00 localhost corosync[3074]:  [TOTEM ] The 
token was lost in the OPERATIONAL state.
2016-04-15T14:15:45.608191+00:00 localhost corosync[3074]:  [TOTEM ] A 
processor failed, forming new configuration.
2016-04-15T14:15:45.608198+00:00 localhost corosync[3074]:  [TOTEM ] 
entering GATHER state from 2(The token was lost in the OPERATIONAL state.).

Around the time node C sees the output from node A stop changing, node A 
reports:

2016-04-15T14:15:58.388404+00:00 localhost corosync[3074]:  [TOTEM ] 
entering GATHER state from 0(consensus timeout).

  * corosync.conf:

totem {
   version: 2
   secauth: off
   cluster_name: 1498d523
   transport: udpu
   token_retransmits_before_loss_const: 10
   token: 10000
}

logging {
   debug: on
}

quorum {
   provider: corosync_votequorum
}

nodelist {
   node {
     ring0_addr: 10.220.73.6
   }
   node {
     ring0_addr: 10.220.73.7
   }
   node {
     ring0_addr: 10.220.73.3
   }
}

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Stefano Panella | 12 Apr 14:45 2016

Help with corosync and GFS2 on multi network setup

Hi everybody,

we have been using corosync directly to provide clustering for GFS2 on our centos 7.2 pools with only one
network interface and all has been working great so far!

We now have a new set-up with two network interfaces for every host in the cluster:
A -> 1 Gbit (the one we would like corosync to use, 10.220.88.X)
B -> 10 Gbit (used for iscsi connection to storage, 10.220.246.X)

when we run corosync in this mode we get the logs continuously spammed by messages like these:

[12880] cl15-02 corosyncdebug   [TOTEM ] entering GATHER state from 0(consensus timeout).
[12880] cl15-02 corosyncdebug   [TOTEM ] Creating commit token because I am the rep.
[12880] cl15-02 corosyncdebug   [TOTEM ] Saving state aru 10 high seq received 10
[12880] cl15-02 corosyncdebug   [MAIN  ] Storing new sequence id for ring 5750
[12880] cl15-02 corosyncdebug   [TOTEM ] entering COMMIT state.
[12880] cl15-02 corosyncdebug   [TOTEM ] got commit token
[12880] cl15-02 corosyncdebug   [TOTEM ] entering RECOVERY state.
[12880] cl15-02 corosyncdebug   [TOTEM ] TRANS [0] member 10.220.88.41:
[12880] cl15-02 corosyncdebug   [TOTEM ] TRANS [1] member 10.220.88.47:
[12880] cl15-02 corosyncdebug   [TOTEM ] position [0] member 10.220.88.41:
[12880] cl15-02 corosyncdebug   [TOTEM ] previous ring seq 574c rep 10.220.88.41
[12880] cl15-02 corosyncdebug   [TOTEM ] aru 10 high delivered 10 received flag 1
[12880] cl15-02 corosyncdebug   [TOTEM ] position [1] member 10.220.88.47:
[12880] cl15-02 corosyncdebug   [TOTEM ] previous ring seq 574c rep 10.220.88.41
[12880] cl15-02 corosyncdebug   [TOTEM ] aru 10 high delivered 10 received flag 1

[12880] cl15-02 corosyncdebug   [TOTEM ] Did not need to originate any messages in recovery.
[12880] cl15-02 corosyncdebug   [TOTEM ] got commit token
[12880] cl15-02 corosyncdebug   [TOTEM ] Sending initial ORF token
[12880] cl15-02 corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1
count 0, aru 0
[12880] cl15-02 corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
[12880] cl15-02 corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1
count 1, aru 0
[12880] cl15-02 corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
[12880] cl15-02 corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1
count 2, aru 0
[12880] cl15-02 corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
[12880] cl15-02 corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1
count 3, aru 0
[12880] cl15-02 corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
[12880] cl15-02 corosyncdebug   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
[12880] cl15-02 corosyncdebug   [TOTEM ] Resetting old ring state
[12880] cl15-02 corosyncdebug   [TOTEM ] recovery to regular 1-0
[12880] cl15-02 corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
Apr 11 16:19:54 [13372] cl15-02 pacemakerd:     info: pcmk_quorum_notification: Membership 22352: quorum
retained (2)
Apr 11 16:19:54 [13378] cl15-02       crmd:     info: pcmk_quorum_notification: Membership 22352: quorum
retained (2)
[12880] cl15-02 corosyncdebug   [TOTEM ] entering OPERATIONAL state.
[12880] cl15-02 corosyncnotice  [TOTEM ] A new membership (10.220.88.41:22352) was formed. Members
[12880] cl15-02 corosyncdebug   [SYNC  ] Committing synchronization for corosync configuration map access
Apr 11 16:19:54 [13373] cl15-02        cib:     info: cib_process_request:      Forwarding cib_modify operation for
section nodes to master (origin=local/crmd/27157)
[12880] cl15-02 corosyncdebug   [CMAP  ] Not first sync -> no action
Apr 11 16:19:54 [13373] cl15-02        cib:     info: cib_process_request:      Forwarding cib_modify operation for
section status to master (origin=local/crmd/27158)
[12880] cl15-02 corosyncdebug   [CPG   ] got joinlist message from node 0x2
[12880] cl15-02 corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.220.88.41) ; members(old:2 left:0)
[12880] cl15-02 corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.220.88.47) ; members(old:2 left:0)
[12880] cl15-02 corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.220.88.41) ; members(old:2 left:0)
[12880] cl15-02 corosyncdebug   [CPG   ] got joinlist message from node 0x1
[12880] cl15-02 corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
Apr 11 16:19:54 [13373] cl15-02        cib:     info: cib_process_request:      Completed cib_modify operation for
section nodes: OK (rc=0, origin=cl15-02/crmd/27157, version=0.18.22)
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[0] group:clvmd, ip:r(0) ip(10.220.88.41) , pid:35677
Apr 11 16:19:54 [13373] cl15-02        cib:     info: cib_process_request:      Completed cib_modify operation for
section status: OK (rc=0, origin=cl15-02/crmd/27158, version=0.18.22)
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[1] group:dlm:ls:clvmd\x00, ip:r(0)
ip(10.220.88.41) , pid:34995
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[2] group:dlm:controld\x00, ip:r(0)
ip(10.220.88.41) , pid:34995
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[3] group:crmd\x00, ip:r(0) ip(10.220.88.41)
, pid:13378
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[4] group:attrd\x00, ip:r(0)
ip(10.220.88.41) , pid:13376
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[5] group:stonith-ng\x00, ip:r(0)
ip(10.220.88.41) , pid:13374
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[6] group:cib\x00, ip:r(0) ip(10.220.88.41) , pid:13373
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[7] group:pacemakerd\x00, ip:r(0)
ip(10.220.88.41) , pid:13372
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[8] group:crmd\x00, ip:r(0) ip(10.220.88.47)
, pid:12879
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[9] group:attrd\x00, ip:r(0)
ip(10.220.88.47) , pid:12877
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[10] group:stonith-ng\x00, ip:r(0)
ip(10.220.88.47) , pid:12875
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[11] group:cib\x00, ip:r(0) ip(10.220.88.47)
, pid:12874
[12880] cl15-02 corosyncdebug   [CPG   ] joinlist_messages[12] group:pacemakerd\x00, ip:r(0)
ip(10.220.88.47) , pid:12873
[12880] cl15-02 corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice:
No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
[12880] cl15-02 corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 3 flags: 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice:
No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
[12880] cl15-02 corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 3 state=2, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
[12880] cl15-02 corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
[12880] cl15-02 corosyncdebug   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 3 flags: 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice:
No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
[12880] cl15-02 corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
[12880] cl15-02 corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
[12880] cl15-02 corosyncdebug   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
[12880] cl15-02 corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] node 3 state=2, votes=1, expected=3
[12880] cl15-02 corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
[12880] cl15-02 corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
[12880] cl15-02 corosyncnotice  [QUORUM] Members[2]: 1 2
[12880] cl15-02 corosyncdebug   [QUORUM] sending quorum notification to (nil), length = 56
[12880] cl15-02 corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
[12880] cl15-02 corosyncdebug   [TOTEM ] waiting_trans_ack changed to 0
[12880] cl15-02 corosyncdebug   [QUORUM] got quorate request on 0x7f5a907749a0
[12880] cl15-02 corosyncdebug   [TOTEM ] entering GATHER state from 11(merge during join).

and we do not get them when there is only a single network interface in the systems.

--------------------------------------------------------------------------------------
These are the network configurations on the three hosts:

[root <at> cl15-02 ~]# ifconfig | grep inet
        inet 10.220.88.41  netmask 255.255.248.0  broadcast 10.220.95.255
        inet 10.220.246.50  netmask 255.255.255.0  broadcast 10.220.246.255
        inet 127.0.0.1  netmask 255.0.0.0

[root <at> cl15-08 ~]# ifconfig | grep inet
        inet 10.220.88.47  netmask 255.255.248.0  broadcast 10.220.95.255
        inet 10.220.246.51  netmask 255.255.255.0  broadcast 10.220.246.255
        inet 127.0.0.1  netmask 255.0.0.0

[root <at> cl15-09 ~]# ifconfig | grep inet
        inet 10.220.88.48  netmask 255.255.248.0  broadcast 10.220.95.255
        inet 10.220.246.59  netmask 255.255.255.0  broadcast 10.220.246.255
        inet 127.0.0.1  netmask 255.0.0.0

-----------------------------------------------------------------------------------
corosync-quorumtool output:

[root <at> cl15-02 ~]# corosync-quorumtool
Quorum information
------------------
Date:             Mon Apr 11 15:46:26 2016
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          18952
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 cl15-02 (local)
         2          1 cl15-08
         3          1 cl15-09

---------------------------------------------------------------------------
/etc/corosync/corosync.conf:

[root <at> cl15-02 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: gfs_cluster
    transport: udpu
}

nodelist {
    node {
        ring0_addr: cl15-02
        nodeid: 1
    }

    node {
        ring0_addr: cl15-08
        nodeid: 2
    }

    node {
        ring0_addr: cl15-09
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    debug: on
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Daniel Dehennin | 11 Apr 14:29 2016

GFS2 and LVM stripes

Hello,

My OpenNebula cluster has a 4TB GFS2 logical volume supported by two
physical volumes (2TB each).

The result is that near all I/O go to a single PV.

Now I'm looking at a way to convert linear LV to a stripping one and
only found the possibility to go with a mirror[1].

Do you have any advice on the use of GFS2 over stipped LVM?

Regards.

Footnotes: 
[1]  http://community.hpe.com/t5/System-Administration/Need-to-move-the-data-from-Linear-LV-to-stripped-LV-on-RHEL-5-7/td-p/6134323

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Daniel Dehennin | 8 Apr 11:21 2016

GFS2: debugging I/O issues

Hello,

On our virtualisation infrastructure we have a 4To GFS2 over a SAN.

Since one or two weeks we are facing read I/O issues, 5k or 6k IOPS with
an average block size of 5kB.

I'm looking for the possibilities and didn't find anything yet, so my
question: 

    Is it possible that reaching over 80% use of the GFS2 can produce
    such workload?

Regards.

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Bob Peterson | 29 Mar 16:34 2016
Picon

Re: fsck.gfs2 The root dinode block is destroyed.

----- Original Message -----
> Good Morning,
> 
> We have a large cluster with 50 gfs2 SAN mounts.  The mounts range in
> size from 1TB to 15TB each.  We have some with 6-8TB of data but most
> average around 3TB used right now.  We were doing network testing a
> while back to check our redundancy incase of a switch failure, and the
> tests failed..  multiple times.  We ended having the SAN mounts yanked
> out from under the cluster.  Long story short, we seem to have
> corruption.  I can still bring the volumes up with the cluster but
> when i take everything down and do a fsck I get the following:
> 
> 
> (ran with fsck -n /dev/$device)
> 
> Found a copy of the root directory in a journal at block: 0x501ca.
> Damaged root dinode not fixed.
> The root dinode should be at block 0x2f3b98b7 but it seems to be destroyed.
> Found a copy of the root directory in a journal at block: 0x501d2.
> Damaged root dinode not fixed.
> The root dinode should be at block 0x28a3ac7f but it seems to be destroyed.
> Found a copy of the root directory in a journal at block: 0x501da.
> Damaged root dinode not fixed.
> Unable to locate the root directory.
> Can't find any dinodes that might be the root; using master - 1.
> Found a possible root at: 0x16
> The root dinode block is destroyed.
> At this point I recommend reinitializing it.
> Hopefully everything will later be put into lost+found.
> The root dinode was not reinitialized; aborting.
> 
> 
> This particular device had 4698 "seems to be destroyed..  found a
> copy" messages before the final, "Can't find any dinodes" message.  I
> fear that we have a number of mounts in this state.
> 
>  Is there any way to recover?  Thanks in advance.

Hi Megan,

I can't tell what's "really" going on unless I examine the GFS2 file system
metadata up close. If you save the metadata (gfs2_edit savemeta <device> <file>)
and also the first 1MB of the block device, and somehow get it to me, I might
be able to figure out what's going on, how it got that way, and what to do to
recover it. Ordinarily, the root directory appears early in the metadata and
it should not be deleted. What was the history of the file system? Was it
converted from GFS1 with gfs2_convert or something?

Regards,

Bob Peterson
Red Hat File Systems

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Megan . | 29 Mar 15:52 2016
Picon

fsck.gfs2 The root dinode block is destroyed.

Good Morning,

We have a large cluster with 50 gfs2 SAN mounts.  The mounts range in
size from 1TB to 15TB each.  We have some with 6-8TB of data but most
average around 3TB used right now.  We were doing network testing a
while back to check our redundancy incase of a switch failure, and the
tests failed..  multiple times.  We ended having the SAN mounts yanked
out from under the cluster.  Long story short, we seem to have
corruption.  I can still bring the volumes up with the cluster but
when i take everything down and do a fsck I get the following:

(ran with fsck -n /dev/$device)

Found a copy of the root directory in a journal at block: 0x501ca.
Damaged root dinode not fixed.
The root dinode should be at block 0x2f3b98b7 but it seems to be destroyed.
Found a copy of the root directory in a journal at block: 0x501d2.
Damaged root dinode not fixed.
The root dinode should be at block 0x28a3ac7f but it seems to be destroyed.
Found a copy of the root directory in a journal at block: 0x501da.
Damaged root dinode not fixed.
Unable to locate the root directory.
Can't find any dinodes that might be the root; using master - 1.
Found a possible root at: 0x16
The root dinode block is destroyed.
At this point I recommend reinitializing it.
Hopefully everything will later be put into lost+found.
The root dinode was not reinitialized; aborting.

This particular device had 4698 "seems to be destroyed..  found a
copy" messages before the final, "Can't find any dinodes" message.  I
fear that we have a number of mounts in this state.

 Is there any way to recover?  Thanks in advance.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Robert Hayden | 18 Mar 14:24 2016
Picon

ACPI like feature on RHEL 7 with Pacemaker?

I was testing fence_ipmilan on RHEL 7 cluster and noticed that running the fence agent with the option to power off the remote node, it appears to cleanly stop the remote node instead of removing power immediately.  I suspect something like ACPI intercepting the power off and trying to stop RHEL 7 nicely.  I read through the documentation and did not see any mentions of turning off acpid in RHEL 7 (may because it does not exist) like in the RHEL 6 documentation.

 

Is the proper way to disable acpi like functionality to use the kernel line acpi=off?  Curious on what others are using with HP iLO.

 

Example of remote node’s /var/log/messages … the RHCS_TESTING lines are my unit testing scripts that insert the command I am running.  You can see where I attempt to power off the node and then how system is starting to cleanly stop the node along with pacemaker.

 

Mar 17 15:38:37 node2 RHCS_TESTING: .

Mar 17 15:38:37 node2 RHCS_TESTING: CMD/MSG: fence_ipmilan -P -a x.x.x.x -l XXXXXX -p XXXXX  -L OPERATOR -A password -o off

Mar 17 15:38:37 node2 RHCS_TESTING: .

Mar 17 15:38:37 node2 systemd-logind: Removed session 993.

Mar 17 15:38:37 node2 systemd: Removed slice user-0.slice.

Mar 17 15:38:37 node2 systemd: Stopping user-0.slice.

Mar 17 15:38:37 node2 systemd-logind: Power key pressed.

Mar 17 15:38:37 node2 systemd-logind: Powering Off...

Mar 17 15:38:37 node2 systemd-logind: System is powering down.

Mar 17 15:38:37 node2 systemd: Stopping Availability of block devices...

Mar 17 15:38:37 node2 systemd: Stopping LVM2 PV scan on device 8:2...

Mar 17 15:38:37 node2 systemd: Stopping Pacemaker High Availability Cluster Manager...

Mar 17 15:38:37 node2 systemd: Deactivating swap /dev/mapper/vg00-swaplv00...

Mar 17 15:38:37 node2 pacemakerd[79586]:  notice: Invoking handler for signal 15: Terminated

Mar 17 15:38:37 node2 pacemakerd[79586]:  notice: Shuting down Pacemaker

Mar 17 15:38:37 node2 pacemakerd[79586]:  notice: Stopping crmd: Sent -15 to process 79592

Mar 17 15:38:37 node2 crmd[79592]:  notice: Invoking handler for signal 15: Terminated

Mar 17 15:38:37 node2 crmd[79592]:  notice: Requesting shutdown, upper limit is 1200000ms

Mar 17 15:38:37 node2 crmd[79592]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]

Mar 17 15:38:37 node2 multipathd: mpathb: stop event checker thread (140595795502848

 

 

Thanks

Robert

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Elsaid Younes | 11 Mar 04:55 2016
Picon

Copying the result continuously.


Hi all,

I  wish to be able to run long simulation through gromacs program, using MPI method. I want to modify the input data after every sub-task.
I think that is the meaning of the following code, which is part my script.
cat <<EOF > copyfile.sh #!/bin/sh cp -p result*.dat $SLURM_SUBMIT_DIR EOF chmod u+x copyfile.sh srun -n $SLURM_NNODES -N $SLURM_NNODES cp copyfile.sh $SNIC_TMP
And I have to srun copyfile.sh in the end of every processor.
srun -n $SLURM_NNODES -N $SLURM_NNODES copyfile.sh
Is there something wrong? I need to know what is the meaning of result*?

Thanks in advance,
/Elsaid
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Shreekant Jena | 5 Mar 07:46 2016
Picon

CMAN Failed to start on Secondary Node

Dear All,

I have a 2 node cluster but after reboot secondary node is showing offline . And cman failed to start .

Please find below logs on secondary node:-

root <at> EI51SPM1 cluster]# clustat
msg_open: Invalid argument
Member Status: Inquorate

Resource Group Manager not running; no service information available.

Membership information not available
[root <at> EI51SPM1 cluster]# tail -10 /var/log/messages
Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect: Connection refused
Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing connection.
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect: Connection refused
Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect: Connection refused
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect: Connection refused
Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request
[root <at> EI51SPM1 cluster]#
[root <at> EI51SPM1 cluster]# cman_tool status
Protocol version: 5.0.1
Config version: 166
Cluster name: IVRS_DB
Cluster ID: 9982
Cluster Member: No
Membership state: Joining
[root <at> EI51SPM1 cluster]# cman_tool nodes
Node  Votes Exp Sts  Name
[root <at> EI51SPM1 cluster]#
[root <at> EI51SPM1 cluster]#


Thanks & regards 
SHREEKANTA JENA

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Gmane