Craig Dibble | 2 May 2006 03:20

Alerts cause Smokeping to stop working

Hi all,

I've got two servers running Smokeping - one (Server A) monitoring 127 
hosts with fping, the other (Server B) in another city monitoring just 
11 hosts.

When one device stops responding on Server B it stops logging data for 
all the hosts it is monitoring, but when the same thing happens on 
Server A it steps over the failures and carries on.

During outages the logs on both servers are filled with messages like this:

May  1 10:55:20 mon01 smokeping[9540]: FPing: WARNING: smokeping took 
130 seconds to complete 1 round of polling. It should complete polling 
in 60 seconds. You may have unresponsive devices in your setup.

The strange thing is, the configs are identical, apart from the Target 
definitions, and the fact that Server B has:

concurrentprobes = yes

set (although we are only using one probe so I'm not sure this is 
relevant, unless I misunderstood).

Server A is running version 1.4, compared to 2.0.4 on Server B.

I have seen a few mentions of a similar problem in the list archives, 
but I haven't found a satisfactory answer. I know I should probably 
upgrade Server A to a newer version, but obviously am reluctant to do so 
when it works but the newer version doesn't.
(Continue reading)

Niko Tyni | 2 May 2006 08:09
Picon
Picon

Re: Smokeping IOSPing

On Fri, Apr 28, 2006 at 09:24:28AM +0200, Andreas Schneider wrote:

> sorry for the confusion! Yes i send several mails, because sometimes the mail i send
> or receive is filtered and is not delivered. And so i don't know if it reaches the list.
> I still have problems with TelnetIOSPing or IOSPing. I do not insist on TelnetIOSPing or IOSPing
> i just want to measure connectivity between two cisco-devices or from a cisco-device to a network
> device.
> 
> I attached the smokeping --debug output and the config file.

OK, the problem is here:

> rcmd: 172.43.6.11: short readIOSPing: 172.43.6.18: got

Looks like there's a problem with getting the rsh connection.
Are you sure that rsh access to the router is enabled? Can you
log into it with rsh from the command line on your Smokeping
server?

Cheers,
--

-- 
Niko Tyni	ntyni <at> iki.fi

Niko Tyni | 2 May 2006 08:42
Picon
Picon

Re: Alerts cause Smokeping to stop working

On Tue, May 02, 2006 at 11:20:59AM +1000, Craig Dibble wrote:

> I've got two servers running Smokeping - one (Server A) monitoring 127 
> hosts with fping, the other (Server B) in another city monitoring just 
> 11 hosts.
> 
> When one device stops responding on Server B it stops logging data for 
> all the hosts it is monitoring, but when the same thing happens on 
> Server A it steps over the failures and carries on.
> 
> During outages the logs on both servers are filled with messages like this:
> 
> May  1 10:55:20 mon01 smokeping[9540]: FPing: WARNING: smokeping took 
> 130 seconds to complete 1 round of polling. It should complete polling 
> in 60 seconds. You may have unresponsive devices in your setup.

Hi,

some clarifications: 

- Do you have any alerts enabled in the Targets section?

- Is the above quote from server A or server B? If from A, please include
  it from server B too. (Server A is not interesting here; it's working
  'well enough' and is an ancient version.)

- When server B stops logging, does the smokeping daemon die or is it
  just doing nothing? Does it recover when the unresponsible devices
  come back?

(Continue reading)

Craig Dibble | 2 May 2006 09:10

Re: Alerts cause Smokeping to stop working


Hi Niko, thanks for your prompt reply. Responses inline.

Niko Tyni wrote:

> Hi,
> 
> some clarifications: 
> 
> - Do you have any alerts enabled in the Targets section?

Yes, I cut the Target section for reasons of brevity, but the alerts are 
set up in the following fashion on both servers:

*** Targets ***

probe = FPing

menu = Top
title = Network Latency Grapher
remark = Welcome to SmokePing

+ Server B
menu = Server B
title = Server B

++ core
menu = Core
title = Core
alerts = bigloss,someloss,startloss
(Continue reading)

Andreas Schneider | 2 May 2006 09:30
Favicon

AW: Re: Smokeping IOSPing

Hi,

this is what rsh produces on the commandline:

WSIZ10-RTT:~# rsh -l smokeping 172.43.6.11 ping
rcmd: 172.43.6.11: short readWSIZ10-RTT:~# ping 172.43.6.18
PING 172.43.6.18 (172.43.6.18) 56(84) bytes of data.
64 bytes from 172.43.6.18: icmp_seq=1 ttl=255 time=13.2 ms
64 bytes from 172.43.6.18: icmp_seq=2 ttl=255 time=1.55 ms
64 bytes from 172.43.6.18: icmp_seq=3 ttl=255 time=1.56 ms
64 bytes from 172.43.6.18: icmp_seq=4 ttl=255 time=1.50 ms
64 bytes from 172.43.6.18: icmp_seq=5 ttl=255 time=1.55 ms
64 bytes from 172.43.6.18: icmp_seq=6 ttl=255 time=1.52 ms

--- 172.43.6.18 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5048ms
rtt min/avg/max/mdev = 1.505/3.491/13.234/4.357 ms
WSIZ10-RTT:~#

I have to tell that 172.43.6.11 ist a Catalyst4006 Switch with IOS Version 12.2
After i typed 'rsh -l smokeping 172.43.6.11 ping' , i have to enter the 'ping 172.43.6.18 <ENTER>'
to
get the ping going.

Regards,
Andreas

-----Ursprungliche Nachricht-----
Von: Niko Tyni [mailto:ntyni+smokeping-users <at> mappi.helsinki.fi]
Gesendet: Dienstag, 2. Mai 2006 08:09
(Continue reading)

Niko Tyni | 3 May 2006 09:31
Picon
Picon

Re: Smokeping IOSPing

On Tue, May 02, 2006 at 09:30:52AM +0200, Andreas Schneider wrote:

> this is what rsh produces on the commandline:
> 
> WSIZ10-RTT:~# rsh -l smokeping 172.43.6.11 ping
> rcmd: 172.43.6.11: short readWSIZ10-RTT:~# ping 172.43.6.18
> PING 172.43.6.18 (172.43.6.18) 56(84) bytes of data.
> 64 bytes from 172.43.6.18: icmp_seq=1 ttl=255 time=13.2 ms
> 64 bytes from 172.43.6.18: icmp_seq=2 ttl=255 time=1.55 ms
> 64 bytes from 172.43.6.18: icmp_seq=3 ttl=255 time=1.56 ms
> 64 bytes from 172.43.6.18: icmp_seq=4 ttl=255 time=1.50 ms
> 64 bytes from 172.43.6.18: icmp_seq=5 ttl=255 time=1.55 ms
> 64 bytes from 172.43.6.18: icmp_seq=6 ttl=255 time=1.52 ms
> 
> --- 172.43.6.18 ping statistics ---
> 6 packets transmitted, 6 received, 0% packet loss, time 5048ms
> rtt min/avg/max/mdev = 1.505/3.491/13.234/4.357 ms
> WSIZ10-RTT:~#
> 
> I have to tell that 172.43.6.11 ist a Catalyst4006 Switch with IOS Version 12.2
> After i typed 'rsh -l smokeping 172.43.6.11 ping' , i have to enter the 'ping 172.43.6.18 <ENTER>'
> to
> get the ping going.

This looks like you aren't pinging anything with the router, but
rather the WSIZ10-RTT host. The 'short read' is a fatal error,
and you end back on the WSIZ10-RTT command line.

Try to get rsh access working first, or maybe better switch to
TelnetIOSPing. If that's still not working either, send a new
(Continue reading)

37asma | 3 May 2006 20:43
Picon

Need help urgently-XML format

hi all,
I need to convert historical PingER data into Smokeping (RRDTool) data in
order to plot it using Smokeping.
PingER is also a network monitoring tool and has enough information
(timestamps for each probe of x number of pings it sends, the round trip
times of each ping, and also the pachet loss), that the Smokeping Database
needs in order  to plot graphs. Actually I can say enough information
because I have compared the pingER text data file with that of Smokepings
file by first converting a .rrd file, generated by Smokeping, into XML
format.

My strategy is to take pingER data and convert it at "run time" into the
XML format and then give it to Smokeping to plot after using "Restore"
function to restore the XML files to .rrd files.

Q1. Is my strategy ok? please guide me. Can I use Update function instead
? If yes how can I use it if the timestamps of PingER data are not exactly
30 min apart (PingER uses a 30 min interval between each probe it sends).
Will I have to know the size of file in advance if I use this strategy.
Also will this method be faster than the one described in paragrah above.

Q2. I there any way in which Smokeping can be configured as :

Site 1 probe
10 pings
300 seconds

//at the same time, in the config file also specify

Site 2 probe
(Continue reading)

amedee | 3 May 2006 22:16
Picon
Gravatar

Re: Need help urgently-XML format


> hi all,
> I need to convert historical PingER data into Smokeping (RRDTool) data in
> order to plot it using Smokeping.
> PingER is also a network monitoring tool and has enough information
> (timestamps for each probe of x number of pings it sends, the round trip
> times of each ping, and also the pachet loss), that the Smokeping Database
> needs in order  to plot graphs. Actually I can say enough information
> because I have compared the pingER text data file with that of Smokepings
> file by first converting a .rrd file, generated by Smokeping, into XML
> format.
>
> My strategy is to take pingER data and convert it at "run time" into the
> XML format and then give it to Smokeping to plot after using "Restore"
> function to restore the XML files to .rrd files.

I suggest you take a deep breath and dive into the RRDtool documentation.
I think you don't even need the XML files. You can inject your data
directly into a rrd file using rrdupdate:
http://oss.oetiker.ch/rrdtool/doc/rrdupdate.en.html
That's my suggestion for live data.

For historical data, you are on the right track with an xml file and
rrdrestore: http://oss.oetiker.ch/rrdtool/doc/rrdrestore.en.html

--

-- 
Amedee Van Gasse

Craig Dibble | 4 May 2006 02:51

Re: Alerts cause Smokeping to stop working

<debug output sent offlist>

Niko Tyni wrote:

> The 'no match for target' is just debugging info that shows the alert was
> not triggered this time. It's perfectly OK to have them.

Ok, that makes sense.

> This output doesn't help, but that's not your fault :)
> 
> As a datapoint, could you try on server B how long it takes to fping an
> unreachable target:
> 
> % time /usr/sbin/fping -C 10 -q -B1 -r1 -i10 <something that doesn't respond>
> 
> I'm more suspicious of the sendmail side, though. Now that I look at it,
> you seem to have 'mailhost' defined. This means that Smokeping will use
> the Net::SMTP module instead of the sendmail program to send the mail.
> It's possible that sending mail via Net::SMTP just takes long for some
> reason... 

You could be on to something there - it turns out the specified mailhost 
itself was not reachable (human error there), so to kill two birds with 
one stone, here's the output of fping for that address:

# time /usr/sbin/fping -C 10 -q -B1 -r1 -i10 10.11.2.136
10.11.2.136 : - - - - - - - - - -

real    0m9.600s
(Continue reading)

Craig Dibble | 4 May 2006 06:51

Re: Alerts cause Smokeping to stop working

Craig Dibble wrote:

>> You could try commenting out the 'mailhost' line (after making sure that
>> /usr/sbin/sendmail is properly configured so it can be used for sending
>> mail). That way the mail sending shouldn't take any time, as sendmail
>> is going to queue it if the smarthost is slow or unreachable. 

>> I think the old version on server A didn't use Net::SMTP at all, so
>> that at least might be the key.
> 
> Ah, hopefully that's it. I've CC'ed this back to the list in case it 
> helps anyone else and I'll post an update when I can confirm if it fixed 
> the problem (or not).

Quick update: due to the current setup of this box it turns out sendmail 
is not an option so I added the correct mailhost and removed the 
sendmail line from the config.

We have just simulated an outage and I'm happy to confirm it now works.

Thanks again for your help Niko.

Craig


Gmane