Scott Prater | 5 Mar 00:14 2014
Picon

[jgroups-users] Problem adding more than two nodes to a TCP cluster

I'm a new JGroups user, trying to get an existing application, built on 
top of Infinispan, working in a clustered (replication) configuration. 
I have five nodes, all on VMs in a private network.  I can start the 
first node up okay, and then the second node, but when I try to start 
the third node, fourth node, fifth node, these later nodes report this 
warning:

WARN 16:33:37.326 (TCP) JGRP000032: Node3-5858: no physical address for 
41c900b2-0a8c-6873-1d3b-0594577d57cd, dropping message

(substitute Node3 for Node4, etc.)

After several of these, Infinispan shuts down with this exception:

Caused by: org.infinispan.CacheException: Initial state transfer timed 
out for cache MyRepository on Node3-5858
         at 
org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:221)

and my application shuts down.

Here's the config I am using for all the nodes:

https://gist.github.com/sprater/9357672

(I substitute "MyNodeAddress" for the node's hostname in each node's 
config file.)

I've tried using both UDP and TCP for discovery, with the same results, 
leading me to think it might be a lower-level problem.  I finally fell 
(Continue reading)

Bram Klein Gunnewiek | 27 Feb 13:46 2014
Picon

[jgroups-users] Cant recover from ifdown; ifup when node uses a linux bridge

We use jgroups on a set-up where all nodes use network cards that are 
configured as a bridge instead of a normal network card. We use Ubuntu 
as our OS, a typical Ubuntu config for a nic is this:

auto eth0
iface eth0 inet dhcp

However, we use eth0 as a bridge and ours looks like this:

auto br0
iface br0 inet dhcp
     bridge_ports eth0
     bridge_stp off
     bridge_fd 0

The problem we have is that our application (using JGroups 3.4.1.Final) 
can't recover when when put the bridge down and up again (ifdown 
br0;ifup br0). JGroups does not reestablish the connection to the 
cluster and the node never gets merged back into it. When we do the 
exact same thing on the node with our network devices configured 
normally JGroups does recover. Since JGroups seems to "eat" the 
exception we also have no way of knowing that JGroups is in some sort of 
zombie state and needs to reconnect.

Here is a post about someone with the same thing: 
http://sourceforge.net/p/javagroups/mailman/message/9724641/

How do we need to handle these situations? Is this behavior caused by 
linux drivers? I don't really understand why having the interfaces 
configured as a bridge causes different behavior. The output we get 
(Continue reading)

sergez | 27 Feb 11:25 2014
Picon

[jgroups-users] JGroups communication in LAN

Hi! I've met some problem.

I can launch example from the Tutorial and it is possible to test out how a
bunch of node do their communication.

But when I try to launch SimpleChat on two separate over the LAN machines -
then nothing happens?

So, the question is: how to program nodes separated over the LAN ? Is it
suppose to run over the LAN without any changes being made to SimpleChat
example in the Tutorial or not ? If second, then it is not obvious from the
Manual how to program such things.

Thank !

--
View this message in context: http://jgroups.1086181.n5.nabble.com/JGroups-communication-in-LAN-tp10087.html
Sent from the JGroups - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
pooja khambhayata | 24 Feb 18:17 2014
Picon

[jgroups-users] Configure TCP.sock_conn_timeout programatically

while I was looking into the properties configuration, I realized there is no setter for sock_conn_timeout. I tried to find the setter both from TP and TCP. Is it not possible to set the sock_conn_timeout programatically? If any one has tried this before by other means pls let me know 

Best,
Pooja
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
javagroups-users mailing list
javagroups-users <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/javagroups-users
Bela Ban | 19 Feb 16:26 2014
Picon

[jgroups-users] JGroups 3.5.0.Alpha1

FYI,

I released an alpha1 of 3.5. Resolved issues and optimizations can be 
seen at [1].

[1] https://issues.jboss.org/browse/JGRP/fixforversion/12318576

--

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
Andrew Scully | 19 Feb 11:10 2014
Picon

[jgroups-users] enable_bundling warning?

Hi,

When starting a Jgroups stack like this...

<UDP singleton_name="udp" enable_bundling="false" ...

...I get a warning like this...

JGRP000014: TP.enable_bundling has been deprecated: will be ignored as bundling is on by default.

I want to turn bundling off on this stack, hence why we're setting it to "false", so the warning doesn't make sense.

Is there now a new way to disable bundling?

I'm running against 3.4.2.Final, we didn't get this against 3.2.7.Final.

My guess is it has something to do with this ticket:


Cheers, Andy.
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
javagroups-users mailing list
javagroups-users <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/javagroups-users
Aleksandr Korostov | 13 Feb 21:06 2014
Picon

[jgroups-users] A node does not detect that it was excluded from the cluster

Hi,

We have a cluster of several dozen of nodes running with JGroups 2.12.1 (I know that it is a very old version). We use UDP, our current configuration is the following:

UDP(mcast_addr=${multicastAddress};ip_ttl=${timeToLive};mcast_port=${port}):
PING(timeout=2000;num_initial_members=4):
MERGE2(max_interval=30000;min_interval=10000):
FD(timeout=5000;max_tries=5):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(gc_lag=50;retransmit_timeout=600,1200,2400,4800):
UNICAST(timeout=600,1200,2400,4800):
pbcast.STABLE(desired_avg_gossip=20000):
FRAG:
pbcast.GMS(join_timeout=3000;merge_timeout=10000;print_local_addr=true)

We faced a problem when testing the ability to restore the cluster communication after a major network failure.

So we were running 14 nodes on several separate machines and were disrupting and restoring the network between the machines. When the cluster was parted we saw that the part with the active coordinator was excluding unreachable nodes one by one. For example if we part cluster like this

{1(coord), 2, 3, 4, 5, 6, 7 }              |          {8, 9}       |      {10, 11, 12, 13, 14}
 
The FD on the node 7 detects the failure of the node 8 and notifies the coordinator about it, node 8 is excluded and node 7 starts pinging node 9 and detects its failure after some time. So the number of the cluster members in the first partition gradually drops to 7.

The second partition {8, 9} works quite differently. In the absence of the coordinator these nodes "think" they run in the cluster of 14 nodes for quite a long time, and then the size of the cluster drops to 2. I think I understand why this is happening: node 9 pings node 10 and notifies the unreachable coordinator of its failure, then node 9 starts pinging node 11 (even without the coordinator excluding node 10), then it starts pinging node 12, after some time node 9 starts pinging node 1 which is its coordinator, and only when the node 9 understands that its coordinator is down this partition forms a cluster of 2 nodes. I know that the FD is not optimal and I'm going to switch to FD_ALL, however please read further.

Now consider what happens if we restore the network before node 9 detects the failure of the coordinator. For example at that point node 9 was pinging node 13. When we restore the network node 13 responds to the ARE_YOU_ALIVE message. What we see is that sometimes nodes 8 and 9 are stuck and never join the cluster back. So we have a situation when the new restored cluster has 12 nodes and nodes 8 and 9 "think" they are part of the cluster of 14 nodes while they are not. I'm not sure this is 100% reproducible.

So my main question: what is the mechanism for the node to "understand" that its not a member of the cluster anymore in the following cases:
   a. coordinator stays the same, but the coordinator excluded the node but the node did not receive that message due to the network failure and still "thinks" it is a member of the cluster
   b. coordinator changed. So the node "thinks" it is a part of the cluster with coordinator A, while the active cluster has different coordinator.


--
All the best,
Aleksandr Korostov
------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
javagroups-users mailing list
javagroups-users <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/javagroups-users
arnaud.pillac | 6 Feb 11:39 2014
Picon
Picon

[jgroups-users] Cluster not reconnect node

We actually encounter an error in production environnement. We have two nodes
(oppodindex1&2) in our JGroups cluster. Sometimes in load usage, cluster
member doesn't see each other and we have the following error in our logs:

node1:
22:04:19,971 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:04:19,975 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = MergeView::[oppodindex2-65413|3] (2) [oppodindex2-65413,
oppodindex1-12124], 2 subgroups: [oppodindex2-65413|2] (1)
[oppodindex2-65413], [oppodindex1-12124|1] (1) [oppodindex1-12124]
22:04:21,975 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK
22:04:21,975 | WARN  | ppodindex1-12124 | GMS                              |
253 - org.jgroups - 3.4.0.Final | oppodindex1-12124: failed to collect all
ACKs (expected=2) for view [oppodindex2-65413|3] after 2000ms, missing 2
ACKs from oppodindex2-65413, oppodindex1-12124
22:07:02,325 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:07:02,627 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = [oppodindex1-12124|4] (1) [oppodindex1-12124]
22:07:02,627 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK
22:07:58,088 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:07:58,089 | WARN  | ppodindex1-12124 | NAKACK2                          |
253 - org.jgroups - 3.4.0.Final | JGRP000011: oppodindex1-12124: dropped
message batch from non-member oppodindex2-65413 (view=[oppodindex1-12124|4]
(1) [oppodindex1-12124])
22:07:58,091 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = MergeView::[oppodindex2-65413|5] (2) [oppodindex2-65413,
oppodindex1-12124], 2 subgroups: [oppodindex2-65413|3] (1)
[oppodindex2-65413], [oppodindex1-12124|4] (1) [oppodindex1-12124]
22:08:02,039 | INFO  | ppodindex1-12124 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK

Node 2:
22:07:03,992 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:07:04,296 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = [oppodindex2-65413|2] (1) [oppodindex2-65413]
22:07:04,297 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK
22:07:35,154 | WARN  | ppodindex2-65413 | NAKACK2                          |
253 - org.jgroups - 3.4.0.Final | JGRP000011: oppodindex2-65413: dropped
message batch from non-member oppodindex1-12124 (view=[oppodindex2-65413|2]
(1) [oppodindex2-65413])
22:07:51,055 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:07:51,060 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = MergeView::[oppodindex2-65413|3] (2) [oppodindex2-65413,
oppodindex1-12124], 2 subgroups: [oppodindex2-65413|2] (1)
[oppodindex2-65413], [oppodindex1-12124|1] (1) [oppodindex1-12124]
22:07:53,059 | WARN  | ppodindex2-65413 | Merger                           |
253 - org.jgroups - 3.4.0.Final | oppodindex2-65413: failed to collect all
ACKs (2) for merge view MergeView::[oppodindex2-65413|3] (2)
[oppodindex2-65413, oppodindex1-12124], 2 subgroups: [oppodindex2-65413|2]
(1) [oppodindex2-65413], [oppodindex1-12124|1] (1) [oppodindex1-12124] after
2000 ms, missing ACKs from oppodindex1-12124
22:07:53,061 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK
22:11:29,176 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
22:11:29,178 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = MergeView::[oppodindex2-65413|5] (2) [oppodindex2-65413,
oppodindex1-12124], 2 subgroups: [oppodindex2-65413|3] (1)
[oppodindex2-65413], [oppodindex1-12124|4] (1) [oppodindex1-12124]
22:11:31,177 | WARN  | ppodindex2-65413 | Merger                           |
253 - org.jgroups - 3.4.0.Final | oppodindex2-65413: failed to collect all
ACKs (2) for merge view MergeView::[oppodindex2-65413|5] (2)
[oppodindex2-65413, oppodindex1-12124], 2 subgroups: [oppodindex2-65413|3]
(1) [oppodindex2-65413], [oppodindex1-12124|4] (1) [oppodindex1-12124] after
2000 ms, missing ACKs from oppodindex2-65413
22:11:31,178 | WARN  | ppodindex2-65413 | GMS                              |
253 - org.jgroups - 3.4.0.Final | oppodindex2-65413: failed to collect all
ACKs (expected=2) for view [oppodindex2-65413|5] after 2000ms, missing 1
ACKs from oppodindex1-12124
22:11:31,178 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
UNBLOCK
23:08:13,884 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster does
BLOCK
23:08:14,186 | INFO  | ppodindex2-65413 | IndexReceiver                    |
275 - fr.bull.telco.orange.pod.index-manager - 0.1.4.SNAPSHOT | Cluster new
VIEW = [oppodindex2-65413|6] (1) [oppodindex2-65413]

After these errors, the service is unavailable because the JGroup cluster is
down, no communication between nodes (node 2 says: no physical address for
node1). We don't understand why the cluster create new view with subgroups
and not just only the two nodes? We use the JGroups 3.4.0 version.

We have the following configuration:
<config
	xmlns="urn:org:jgroups"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="urn:org:jgroups
http://www.jgroups.org/schema/JGroups-3.3.xsd">
	<UDP
		bind_addr="podclusterip"
		bind_port="8900"
		mcast_addr="224.9.9.9"
		mcast_port="8901" />
	<PING num_initial_members="2" />
	<MERGE2 />
	<FD_SOCK />
	<FD_ALL />
	<VERIFY_SUSPECT />
	<pbcast.NAKACK2 />
	<UNICAST3 />
	<pbcast.STABLE />
	<pbcast.GMS join_timeout="300000" />
	<UFC />
	<MFC />
	<FRAG2 />
	<RSVP />
	<pbcast.STATE_SOCK
		bind_addr="podclusterip"
		bind_port="8902" />
	<pbcast.FLUSH timeout="300000" />
</config>

Could you please send us a feeback, maybe we need to adjust timeout when
servers are in load, or upgrade JGroups to the latest release.

Arnaud Pillac

--
View this message in context: http://jgroups.1086181.n5.nabble.com/Cluster-not-reconnect-node-tp10059.html
Sent from the JGroups - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
Marilen Corciovei | 5 Feb 11:47 2014
Picon

[jgroups-users] JGroups communication stops working after a few minutes

Hello everybody,

I am using jgroups as a base for ehcache replication and in certain 
conditions I am facing a very strange condition.

JGroups is configured for udp, for a 2 nodes cluster. JGroups starts, 
connects, works for 3-4 minutes then stops working without any error. I 
have several clusters and this problem might be related to the 
virtualization platform for the clusters (kvm not ok, virtualbox ok) 
however what I cannot understand is why it works perfectly ok at the 
beginning.

I have spent the last days turning on TRACE for jgroups and I can see 
that at some point some sockets are closed:

2014-02-04 21:48:38,907 TRACE 
[INT-1,EH_CACHE,linux1-26104-org.jgroups.protocols.UNICAST3] 
linux1-26104: removed receive connection for linux2-63479
2014-02-04 21:48:38,907 TRACE 
[INT-1,EH_CACHE,linux1-26104-org.jgroups.protocols.UNICAST3] 
linux1-26104: removed receive connection for linux2-63479
2014-02-04 21:48:39,009 DEBUG 
[Timer-5,EH_CACHE,linux1-26104-org.jgroups.protocols.UNICAST3] 
linux1-26104: removing expired connection for linux1-26104 (60035 ms 
old) from send_table
2014-02-04 21:48:39,009 DEBUG 
[Timer-5,EH_CACHE,linux1-26104-org.jgroups.protocols.UNICAST3] 
linux1-26104: removing expired connection for linux1-26104 (60035 ms 
old) from send_table

and after this point nothing works anymore. However some time later the 
FD_ALL detects a problem, checks for the dead peer and the dead peer 
says it's ok yet messages are no longer received, maybe a listening 
thread is dead?. It works with TCP and I have tried dozens of UDP 
configurations with no luck.

Any ideeas are appreciated, regards,
Len
www.len.ro

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
Chris Lecompte | 5 Feb 07:43 2014
Picon

[jgroups-users] Long Delay in Receiving Messages

First let me say that I’m using a rather old version of JGroups (2.11.1), I’ll be upgrading but can’t
in this particular instance.  I’m using the following protocols, UDP, PING, MERGE2, FD_ALL,
VERIFY_SUSPECT, NAKACK and GMS.  I am seeing an issue for a 24 node cluster where messages are experiencing
serious delays but only to/from certain nodes.  For instance I have a ping operation established to test
communication in the cluster.  The protocol using UDP sends a message from a single node to the group using
channel.send(null, message) and then expects a reply message from each node (including itself).  When
experiencing the problem, a node in question does not appear to receive any messages via multicast.  For
instance, if I run the ping operation on a node that is not experiencing the problem I receive a 23 of 24
replies.  If I run the same operation on the node experiencing the issue then I receive 0 of 24 replies
implying that the node could not send/receive any of the messages within the 10 second timeout that the
operation will wait.  After some time ~30 minutes the messages are received and then from that point on the
issue no longer exists (until it crops up again).  Other nodes on the same host do not necessarily exhibit
the problem.  Is there any particular diagnostic information that I could inspect from the Probe command
or otherwise that might indicate if this is a network related issue?  It seems to be but the delay seems
rather huge in this case.  

Chris
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
Bela Ban | 24 Jan 14:36 2014
Picon

[jgroups-users] JGroups status presentation

FYI,
[1] http://belaban.blogspot.ch/2014/01/jgroups-status-and-outlook.html

--

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk

Gmane