John Anthony | 23 May 23:50
Picon

Changing the name of a resource

All:


Wanted to change the name of the drbd resource to better reflect its functional use.

In test, Tried changing the file and resource name on the secondary side, adjusting and re-connecting. And other slightly different steps, to see what can work.

The results are that when re-connecting after changing the file name - the primary side seems to think that the device is out of sync and becomes the sync source but the secondary side will have none of it and goes it 'WFConnection' state.

Find below the log extract when the secondary side is connected after the resource is renamed. ( From the log it seems that it does become the sync target and actually syncs, but does not like something).

Is there a known sequence of steps that can work in this case ? Need I experiment any further ? What else can I try ?

-JA

block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [32028])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 96
block drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [14379])
block drbd0: data-integrity-alg: md5
block drbd0: drbd_sync_handshake:
block drbd0: self A3DA23FAA568B544:0000000000000000:CC820BAF75D3008A:CC810BAF75D3008B bits:0 flags:0
block drbd0: peer 518B27BC03329A91:A3DB23FAA568B545:A3DA23FAA568B545:CC820BAF75D3008B bits:721870 flags:0
block drbd0: Did not got last syncUUID packet, corrected:
block drbd0: peer 518B27BC03329A91:A3DA23FAA568B545:CC820BAF75D3008B:CC820BAF75D3008B bits:721870 flags:0
block drbd0: uuid_compare()=-1 by rule 51
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 239(1), total 239; compression: 99.9%
block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 239(1), total 239; compression: 99.9%
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: updated sync uuid A3DB23FAA568B544:0000000000000000:CC820BAF75D3008A:CC810BAF75D3008B
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
block drbd0: Began resync as SyncTarget (will sync 2887480 KB [721870 bits set]).
block drbd0: Resync done (total 9 sec; paused 0 sec; 320828 K/sec)
block drbd0: 100 % had equal check sums, eliminated: 2887480K; transferred 0K total 2887480K
block drbd0: updated UUIDs 518B27BC03329A90:0000000000000000:A3DB23FAA568B544:A3DA23FAA568B545
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
block drbd0: bitmap WRITE of 0 pages took 0 jiffies
block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd0: peer( Primary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
block drbd0: asender terminated
block drbd0: Terminating asender thread
block drbd0: Connection closed
block drbd0: conn( TearDown -> Unconnected )
block drbd0: receiver terminated
block drbd0: Restarting receiver thread
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: conn( WFConnection -> Disconnecting )
block drbd0: Discarding network configuration.
block drbd0: Connection closed
block drbd0: conn( Disconnecting -> StandAlone )
block drbd0: receiver terminated
block drbd0: Terminating receiver thread
block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [32028])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )




_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user
Zev Weiss | 23 May 22:14
Picon

Recovering from erroneous sync state

Hi,

I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 --
getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). 
Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if
there were any known ways of recovering it without doing anything disruptive to the other resources (e.g.
rebooting or unloading the kernel module).

I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions?  It doesn't really matter
to me if it wipes the resource or anything, I'd just like to have my test device back in a working state
without disturbing anything else.

Thanks,
Zev Weiss
Mahmoud Alshinhab | 23 May 17:12
Picon

need discribtion and help

Dears Kindly help as I Can’t create the drbd device because of this
error message

drbdmeta /dev/drbd0 v08 /dev/sda2 internal create-md' terminated with
exit code 40

global {
                usage-count yes;
}
Common {
                Syncer {
                rate 10M;
}
}
resource r0 {
                protocol C;
                on master {
                                device   /dev/drbd1;
                                disk        /dev/sdb1;
                                address 192.168.1.1;7789;
                                meta-disk internal;
}
                on slave{
                                device   /dev/drbd1;
                                disk        /dev/sdb1;
                                address 192.168.1.2;7789;
                                meta-disk internal;
}

}

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          38      305203+  83  Linux
/dev/sda2              39        2349    18563107+  83  Linux
/dev/sda3            2350        2610     2096482+  82  Linux swap / Solaris

Disk /dev/sdb: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         652     5237158+  83  Linux

[root <at> master ~]# drbdadm create-md r0
v08 Magic number not found
md_offset 5362843648
al_offset 5362810880
bm_offset 5362647040

Found ext3 filesystem which uses 5237156 kB
current configuration leaves usable 5236960 kB

Device size would be truncated, which
would corrupt data and result in
'access beyond end of device' errors.
You need to either
   * use external meta data (recommended)
   * shrink that filesystem first
   * zero out the device (destroy the filesystem)
Operation refused.

Command 'drbdmeta /dev/drbd1 v08 /dev/sdb1 internal create-md'
terminated with exit code 40
drbdadm aborting

--

-- 
Eng. Mahmoud Alshinhab
NOC System Engineer
RHCT -RedHat Certified Technician-
Fedora Ambassador
Wiki : https://fedoraproject.org/wiki/User:Tuxawy
mahmoud.alshinhab@...
tuxawy@...
Keith Christian | 23 May 17:29
Picon

Moving a DRBD cluster from physical machines to VMWare machines

Looking for guidance from Linbit, or the list, about DRBD's disks
remaining on two separate physical servers, except that the peer
servers are virtualized?

For instance: Writes of approximately 18,000 blocks per second for the
busy hours, as reported by SAR.  How much a performance hit is there
with the same hardware after VMWare is in the picture?  (I'm sure
there are VMWare config settings to optimize disk reads/writes for an
application like DRBD.)

Thanks.

=====Keith
Felix Frank | 21 May 10:27
Picon

Re: drbd wrong lower device doubt

No.

On 05/21/2012 10:24 AM, 陈楠 wrote:
> Actually, we just have one resource. /dev/vg01/share and /dev/vg02/share
> has the same size.
> Business system A
> 
> resource r0 {
>         meta-disk internal;
>         device /dev/drbd0;
>         disk /dev/vg01/share;  # <- Partition A: always vg01
>         on NodeA {
>                 address 2.2.2.150:7788;
>         }
>         on NodeB {
> 		address 2.2.2.151:7788;
>         }
> }
> 
> Business system B
> 
> resource r0 {
>         meta-disk internal;
>         device /dev/drbd0;
>         disk /dev/vg02/share;  # <- Partition A: always vg01
>         on NodeA {
>                 address 2.2.2.150:7788;
>         }
>         on NodeB {
> 		address 2.2.2.151:7788;
>         }
> }

Both nodes:

resource r0 {
        meta-disk internal;
        device /dev/drbd0;
        on NodeA {
	        disk /dev/vg01/share;
                address 2.2.2.150:7788;
        }
        on NodeB {
	        disk /dev/vg02/share;
		address 2.2.2.151:7788;
        }
}

It's OK for your nodes to be different, but *do include that in your
config*. Save yourself from pain down the road.

Regards,
Felix
_______________________________________________
drbd-user mailing list
drbd-user <at> lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Picon

Reasons not to use allow-two-primaries with DRDB

Hello,

I am in the process of setting up DRBD on my servers, the network
bandwidth being the bottleneck.  After having evaluated GlusterFS I
realised, that I need the instant read access offered by DRBD.

Logically I am able to separate partitions that would require access
from both nodes, and partitions where an asynchronous master-slave
sync is sufficient.  But as far as I understand, the benefits from
using Protocol A instead of C are limited, when the network is stable.

My question:
Are there any additional benefits from NOT using two primaries or
additional risks when using it? eg. would there be significant
performance gain by using ext4 instead of GFS2/OCFS2? Anything else I
should take into consideration?

Thanks for any ideas or pointers where to look.

Karel
Chris Dickson | 19 May 14:42
Picon
Gravatar

DRBD block script with Xen XL toolstack

Hello,

Currently it seems that the drbd disk type with Xen's XL toolstack (vs Xend) to get the automatic promotion/demotion behavior of drbd devices is not supported by default; is there a way to get Xend's current block-drbd script behavior with XL?

Thanks,

Chris
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user
陈楠 | 18 May 09:19
Favicon

drbd wrong lower device doubt


Hi All, I have some doubts about drbd. I have configure two servers as Host A and Host B. Host A drbd configuration is like this: 
resource r0 {
        on A {
                device /dev/drbd0;
                disk /dev/vg01/share;
                address 2.2.2.150:7788;
                meta-disk internal;
        }
        on B {
                device /dev/drbd0;
                disk /dev/vg01/share;
                address 2.2.2.151:7788;
                meta-disk internal;
        }
}
and Host B is like this:
resource r0 {
        on A {
                device /dev/drbd0;
                disk /dev/vg02/share;
                address 2.2.2.150:7788;
                meta-disk internal;
        }
        on B {
                device /dev/drbd0;
                disk /dev/vg02/share;
                address 2.2.2.151:7788;
                meta-disk internal;
        }
}
You can notice that Host A and Host B configuration file is not same. Actually Host A lower device is /dev/vg01/share and Host B lower device is  /dev/vg02/share. The specified destination lower device is wrong in each server. Network setting is right . I set Host A disk state to UpToDate and Host B disk state inconsistent. I find that Host A is syncing to Host B.  Why it can work regularly when I configure wrong lower device.



_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user
Cristian Caceres | 17 May 23:14
Picon
Favicon

I need Reset DRBD Service

Hi all, I have little experience in drbd, in fact I received as a legacy a system with this implementation, my problem is that one of the nodes, the secondary, we had to restart, but now I see they are not connected according to, I have sought some solution without success, please if someone can help me decipher this I would appreciate.


the status of each server is the following:

Primary Server:

drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root-wDAxgLocYoTaKzBbEiVmYQ@public.gmane.orgltiexportfoods.com, 2008-12-23 13:00:05
m:res cs st ds p mounted fstype
0:??not-found?? StandAlone Primary/Unknown UpToDate/DUnknown -

Secondary Server:

drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root-wDAxgLocYoT0MrXhQeckh6qrw+pMJB36232DYX7GltoAvxtiuMwx3w@public.gmane.org, 2008-12-23 13:00:05
m:res cs st ds p mounted fstype
0:??not-found?? WFConnection Secondary/Unknown UpToDate/DUnknown B

Thanks..

rca
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user
Matthew Bloch | 16 May 22:11
Picon
Gravatar

"PingAck not received" messages

I'm trying to understand a symptom for a client who uses drbd to run
sets of virtual machines between three pairs of servers (v1a/v1b,
v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
is buffered depending on what mode is chosen, and buffer settings.

Firstly, it surprised me that even in replication mode "A", the system
still seemed limited by by the bandwidth between nodes.  I found this
out when the customer's bonded interface had flipped over to its 100Mb
backup connection, and suddenly they had I/O problems.  While I was
investigating this and running tests, I noticed that switching to mode A
didn't help, even when measuring short transfers that I'd expect would
fit into reasonable-sized buffers.  What kind of buffer size can I
expect from an "auto-tuned" DRBD?  It seems important to be able to
cover bursts without leaning on the network, so I'd like to know whether
that's possible with some special tuning.

The other problem is the "PingAck not received" messages that have been
littering the logs of the v3a/v3b servers for the last couple of weeks,
e.g. this has been happening every few hours for one DRBD or another:

May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
not arrive in time.
May 14 08:21:45 v3b kernel: [661127.875553] block drbd10: peer( Primary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
May 14 08:21:45 v3b kernel: [661127.875562] block drbd10: asender terminated
May 14 08:21:45 v3b kernel: [661127.875564] block drbd10: Terminating
drbd10_asender
May 14 08:21:45 v3b kernel: [661127.875597] block drbd10: short read
expecting header on sock: r=-512
May 14 08:21:45 v3b kernel: [661127.882896] block drbd10: Connection closed
May 14 08:21:45 v3b kernel: [661127.882899] block drbd10: conn(
NetworkFailure -> Unconnected )
May 14 08:21:45 v3b kernel: [661127.882904] block drbd10: receiver
terminated
May 14 08:21:45 v3b kernel: [661127.882908] block drbd10: Restarting
drbd10_receiver
May 14 08:21:45 v3b kernel: [661127.882910] block drbd10: receiver
(re)started
May 14 08:21:45 v3b kernel: [661127.882913] block drbd10: conn(
Unconnected -> WFConnection )
May 14 08:21:46 v3b kernel: [661129.123506] block drbd10: Handshake
successful: Agreed network protocol version 91
May 14 08:21:46 v3b kernel: [661129.123511] block drbd10: conn(
WFConnection -> WFReportParams )
May 14 08:21:46 v3b kernel: [661129.123535] block drbd10: Starting
asender thread (from drbd10_receiver [31418])
May 14 08:21:46 v3b kernel: [661129.123876] block drbd10:
data-integrity-alg: <not-used>
May 14 08:21:46 v3b kernel: [661129.123898] block drbd10:
drbd_sync_handshake:
May 14 08:21:46 v3b kernel: [661129.123900] block drbd10: self
C5DC68A8AFD5BFEC:0000000000000000:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123903] block drbd10: peer
F8BB238D22A7ACFF:C5DC68A8AFD5BFED:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123905] block drbd10:
uuid_compare()=-1 by rule 50
May 14 08:21:46 v3b kernel: [661129.123908] block drbd10: peer( Unknown
-> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown ->
UpToDate )
May 14 08:21:46 v3b kernel: [661129.138101] block drbd10: conn(
WFBitMapT -> WFSyncUUID )
May 14 08:21:46 v3b kernel: [661129.139563] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10
May 14 08:21:46 v3b kernel: [661129.140282] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10 exit code 0 (0x0)
May 14 08:21:46 v3b kernel: [661129.140286] block drbd10: conn(
WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
May 14 08:21:46 v3b kernel: [661129.140292] block drbd10: Began resync
as SyncTarget (will sync 0 KB [0 bits set]).
May 14 08:21:47 v3b kernel: [661129.693954] block drbd10: Resync done
(total 1 sec; paused 0 sec; 0 K/sec)
May 14 08:21:47 v3b kernel: [661129.693961] block drbd10: conn(
SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
May 14 08:21:47 v3b kernel: [661129.693969] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10
May 14 08:21:47 v3b kernel: [661129.694725] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10 exit code 0 (0x0)

I've not been able to correlate these ping drops and reconnections to
any of:

1) interface capacity issues (a few times we might make a 400Mb spike,
but sometimes there's none at all);

2) loss of connectivity or ARP problems on the two servers' dedicated
DRBD interfaces (i.e. I've got an unbroken log of pings between the two
servers);

3) any kernel grumbles about the network interface, bonding, RAID or
anything remotely hardware-related.  Apart from the drbd messages
there's no other chatter from the kernel.

The customer's other two pairs of servers have been running 18 months
and not exhibited this behaviour.

The customer hasn't given me the data to show these blips (which are
anything from 2s-30s) correspond to any real performance problems and I
don't have access to the inside of their VMs to check for myself.  So my
questions are - would you expect these disconnections to cause
variations in I/O bandwidth or responsiveness?

And secondly, what should I be doing about it?  My unsatisfactory
response to the customer's worry is to reconnect all the drbds with a
longer ping-timeout, and in 10 hours it hasn't reoccurred, which is an
unusually long record.  I will be more convinced by the end of the day.

Even if that does solve these messages, I'm curious as to the cause.
We've not hit a network bandwidth ceiling, and so we've definitely not
hit an I/O ceiling (which is 4x146GB 15kRPM discs, RAID10, HP RAID).  I
can accept that some VMs will use more bandwidth than others, and so it
wouldn't be surprising that one VM on the machine was the "cause".

But when the disconnections happen, they appear to be completely random.
 Checking with grep/uniq -c, I see out of the 11 devices on the systems,
it happens pretty randomly (and drbd10 is just a test, getting
absolutely zero I/O).

      5 drbd0:
      5 drbd1:
     11 drbd2:
      8 drbd3:
     11 drbd4:
      4 drbd5:
      6 drbd6:
      7 drbd7:
      5 drbd8:
     14 drbd9:
     12 drbd10:
      7 drbd11:

So even if upping the ping time stops the problem, and even if the
effects of the disconnect/reconnect cycles are harmless - why might DRBD
exhibit these symptoms on one pair of servers, but not two other sets?
Is there some I/O pattern that might cause pings to get lost, even over
a lightly-loaded gigabit link?

Thanks for any insights in advance.

--

-- 
Matthew
Marcel Kraan | 16 May 21:40
Favicon
Gravatar

NFS not starting with heartbeat

Hello,

I use 2 servers with CentOS 6.2 

But on 1 server (kvmstorage1) ifs is not starting after a restart or when i shutdown heartbeat and restart is later.

# this is my resources file? it start not all services i have given in the haresources file?
Do i something wrong?

#kvmstorage1
cat /etc/ha.d/haresources
kvmstorage1.localdomain IPaddr::192.168.123.209/24/eth0 drbddisk::main
Filesystem::/dev/drbd0::/datastore::ext4 nfs nfslock rpcidmapd mysql

ResourceManager[2506]:	2012/05/16_21:34:40 info: Running /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /datastore ext4 start
Filesystem[2933]:	2012/05/16_21:34:40 INFO: Running start for /dev/drbd0 on /datastore
Filesystem[2921]:	2012/05/16_21:34:41 INFO:  Success
ResourceManager[2506]:	2012/05/16_21:34:41 info: Running /etc/init.d/nfslock  start
ResourceManager[2506]:	2012/05/16_21:34:41 info: Running /etc/init.d/rpcidmapd  start
ResourceManager[2506]:	2012/05/16_21:34:42 info: Running /etc/init.d/mysqld  start
May 16 21:34:43 kvmstorage1.localdomain heartbeat: [2489]: info: local HA resource acquisition
completed (standby).
May 16 21:34:43 kvmstorage1.localdomain heartbeat: [1583]: info: Standby resource acquisition done [foreign].
May 16 21:34:43 kvmstorage1.localdomain heartbeat: [1583]: info: Initial resource acquisition
complete (auto_failback)
May 16 21:34:43 kvmstorage1.localdomain heartbeat: [1583]: info: remote resource transition completed.

#kvmstorage2
cat /etc/ha.d/haresources
kvmstorage1.localdomain IPaddr::192.168.123.209/24/eth0 drbddisk::main
Filesystem::/dev/drbd0::/datastore::ext4 nfs nfslock rpcidmapd mysql

Filesystem[16037]:	2012/05/16_21:33:53 INFO:  Resource is stopped
ResourceManager[15787]:	2012/05/16_21:33:53 info: Running /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /datastore ext4 start
Filesystem[16117]:	2012/05/16_21:33:53 INFO: Running start for /dev/drbd0 on /datastore
Filesystem[16109]:	2012/05/16_21:33:53 INFO:  Success
ResourceManager[15787]:	2012/05/16_21:33:54 info: Running /etc/init.d/nfs  start
ResourceManager[15787]:	2012/05/16_21:33:54 info: Running /etc/init.d/nfslock  start
ResourceManager[15787]:	2012/05/16_21:33:54 info: Running /etc/init.d/mysqld  start
mach_down[15761]:	2012/05/16_21:33:56 info: /usr/share/heartbeat/mach_down: nice_failback:
foreign resources acquired
May 16 21:33:56 kvmstorage2.localdomain heartbeat: [1528]: info: mach_down takeover complete.
mach_down[15761]:	2012/05/16_21:33:56 info: mach_down takeover complete for node kvmstorage1.localdomain.

Gmane