Michael Osburn | 12 Mar 16:44 2008

Putting a server into maintance

All,

	We are preparing to roll mon into our production environment. There is still a few things 
that I need to prove to get the higher-ups approval. How are most of you managing planned system 
downtime? Looking into the scripts it appears that mon has been designed more to load the mon.cf
file at startup and stays with it. Is the preferred way to remove and/or add a host to add it to
the config file and restart mon? Or am I missing a feature to have mon check it's configuration 
file and reload if it changes?

Thanks all,

--

-- 
Michael Osburn
michael.osburn <at> echostar.com
Engineer I
Conditional Access Systems
EchoStar Operating Corp.

------------

This e-mail and its attachments is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. Any review, retransmission, 
dissemination or other use of, or taking of any action in 
reliance upon, this information by persons or entities other than 
the intended recipient is prohibited. 
If you received this in error, please contact the sender and 
delete the material from any computer.
Michael Osburn | 12 Mar 19:19 2008

Re: Putting a server into maintance

Thanks, 
	Looks like it will be easy to pass through in the meeting today. 
Thanks for the tips.

Michael

On Wed, 12 Mar 2008 12:07:38 -0400
Ed Ravin <eravin <at> panix.com> wrote:

> On Wed, Mar 12, 2008 at 09:44:59AM -0600, Michael Osburn wrote:
> >  How are most of you managing planned system 
> >  downtime?
> 
> In most cases, our engineers log into Mon and use the "host disable"
> or "service disable" to stop montoring the stuff that's about to go
> down, and re-enable them when the maintenance is over.
> 
> Sometimes, we just ACK whatever's broken when Mon starts alarming.
> 
> If I had a really big planned outage I would comment out big chunks of
> the config file and restore it after the window.
> 
> > Or am I missing a feature to have mon check it's configuration 
> > file and reload if it changes?
> 
> You are - look up "reset Mon" in the CGI or the API.  You can also
> send Mon a HUP signal to make it reload its config.

--

-- 
Michael Osburn
(Continue reading)

Ed Ravin | 12 Mar 17:07 2008
Picon

Re: Putting a server into maintance

On Wed, Mar 12, 2008 at 09:44:59AM -0600, Michael Osburn wrote:
>  How are most of you managing planned system 
>  downtime?

In most cases, our engineers log into Mon and use the "host disable"
or "service disable" to stop montoring the stuff that's about to go
down, and re-enable them when the maintenance is over.

Sometimes, we just ACK whatever's broken when Mon starts alarming.

If I had a really big planned outage I would comment out big chunks of
the config file and restore it after the window.

> Or am I missing a feature to have mon check it's configuration 
> file and reload if it changes?

You are - look up "reset Mon" in the CGI or the API.  You can also
send Mon a HUP signal to make it reload its config.
Stephane Bortzmeyer | 18 Mar 15:13 2008
Picon

Re: Putting a server into maintance

On Wed, Mar 12, 2008 at 12:07:38PM -0400,
 Ed Ravin <eravin <at> panix.com> wrote 
 a message of 23 lines which said:

> In most cases, our engineers log into Mon and use the "host disable"
> or "service disable" to stop montoring the stuff that's about to go
> down, and re-enable them when the maintenance is over.
> 
> Sometimes, we just ACK whatever's broken when Mon starts alarming.

The good thing about "doing nothing when there is a planned
maintenance" is that it allows you to test that monitoring indeed
works.

I had several times the bad experience of an undetected failure
because the monitoring had an hidden problem.
Chris Hoogendyk | 18 Mar 20:22 2008
Picon

Re: Putting a server into maintance


Stephane Bortzmeyer wrote:
> On Wed, Mar 12, 2008 at 12:07:38PM -0400,
>  Ed Ravin <eravin <at> panix.com> wrote 
>  a message of 23 lines which said:
>   
>> In most cases, our engineers log into Mon and use the "host disable"
>> or "service disable" to stop montoring the stuff that's about to go
>> down, and re-enable them when the maintenance is over.
>>
>> Sometimes, we just ACK whatever's broken when Mon starts alarming.
>>     
>
> The good thing about "doing nothing when there is a planned
> maintenance" is that it allows you to test that monitoring indeed
> works.
>
> I had several times the bad experience of an undetected failure
> because the monitoring had an hidden problem.

mon is just so quiet and minimal when things are running alright. 8-)

sometimes I feel a need to go look, or even to kick it, to reassure 
myself that it is alright itself.

at some point I plan to implement a backup server in another department 
and have the backups backup each other and the mons mon each other. Then 
I could maybe have mon issue a "Good morning, Sysadmins!" with a summary 
of things that have been checked and are running alright. It would come 
on right before NPR's Morning Edition (in the US -- for other locations, 
(Continue reading)

Ben Ragg | 18 Mar 22:06 2008
Picon
Picon

Re: Putting a server into maintance

Over time we've been slowly modifying the code a little and adding our 
own features.

Two we've found really useful... "Ack All" to ack everything in the 
current view and a hold feature... so we can stop alerts going out for 
up to 180 mins (but still see what's failed). The hold feature includes 
who put Mon in to hold and their reason. At the end of the 180mins (or 
timeframe specified less than that) Mon automatically comes out of hold 
and the alerts automatically resume, so someone can't accidentally leave 
it on hold like we could when we stopped the scheduler (which had the 
disadvantage of not knowing what was down).

Stephane Bortzmeyer wrote:
> On Wed, Mar 12, 2008 at 12:07:38PM -0400,
>  Ed Ravin <eravin <at> panix.com> wrote 
>  a message of 23 lines which said:
>
>   
>> In most cases, our engineers log into Mon and use the "host disable"
>> or "service disable" to stop montoring the stuff that's about to go
>> down, and re-enable them when the maintenance is over.
>>
>> Sometimes, we just ACK whatever's broken when Mon starts alarming.
>>     
>
> The good thing about "doing nothing when there is a planned
> maintenance" is that it allows you to test that monitoring indeed
> works.
>
> I had several times the bad experience of an undetected failure
(Continue reading)

Augie Schwer | 19 Mar 00:45 2008
Picon

Re: Putting a server into maintance

Darn auto-complete.

---------- Forwarded message ----------
From: Augie Schwer <augie.schwer <at> gmail.com>
Date: Tue, Mar 18, 2008 at 4:44 PM
Subject: Re: Putting a server into maintance
To: mon-devel <at> lists.sourceforge.net

On Tue, Mar 18, 2008 at 12:22 PM, Chris Hoogendyk
 <hoogendyk <at> bio.umass.edu> wrote:
 >  at some point I plan to implement a backup server in another department
 >  and have the backups backup each other and the mons mon each other. Then
 >  I could maybe have mon issue a "Good morning, Sysadmins!" with a summary
 >  of things that have been checked and are running alright. It would come
 >  on right before NPR's Morning Edition (in the US -- for other locations,
 >  substitute appropriate national/regional/local morning news). If it
 >  could query the coffee maker as well, then we'd be all set. ;-)

 You should definitely have a second box monitoring mon; if your main
 mon box dies and you don't know about it in the middle of the night,
 then you also won't know about all the other stuff that may have just
 failed too.

 --
 Augie Schwer - Augie <at> Schwer.us - http://schwer.us
 Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072

--

-- 
Augie Schwer - Augie <at> Schwer.us - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072
(Continue reading)

Augie Schwer | 19 Mar 00:47 2008
Picon

Re: Putting a server into maintance

On Tue, Mar 18, 2008 at 2:06 PM, Ben Ragg <bragg <at> internode.com.au> wrote:
>  Two we've found really useful... "Ack All" to ack everything in the
>  current view and a hold feature... so we can stop alerts going out for
>  up to 180 mins (but still see what's failed). The hold feature includes
>  who put Mon in to hold and their reason. At the end of the 180mins (or
>  timeframe specified less than that) Mon automatically comes out of hold
>  and the alerts automatically resume, so someone can't accidentally leave
>  it on hold like we could when we stopped the scheduler (which had the
>  disadvantage of not knowing what was down).

The "hold" feature sounds pretty interesting. At my site I find people
putting things into "disabled" and then forgetting all about them,
which is dangerous and annoying.

Care to share that code?

--

-- 
Augie Schwer - Augie <at> Schwer.us - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072
Ben Ragg | 19 Mar 01:20 2008
Picon
Picon

Re: Putting a server into maintance

Augie Schwer wrote:
> putting things into "disabled" and then forgetting all about them,
> which is dangerous and annoying.
>
> Care to share that code?
>
>   
> The "hold" feature sounds pretty interesting. At my site I find people
Always happy to share :)

Version we're using is..."mon,v 1.22 2006/07/13 12:03:39 vitroth Exp $" 
...it's been run through perl tidy to clean it up a little, so the line 
numbers won't match up (hence I won't even bother ;)

Attached our copy of the main mon program and mon.cgi

Changes to "mon"...

Under the global definitions, add a new array...

my  <at> HOLD_ALERTS;       # dont send alerts, 0) end, 1) start, 2) by, 3) 
reason

In the main monitoring loop "for ( ; ; ) {" near the top add a check for 
an expired hold timer...

for ( ; ; ) {
  debug( 1, "$i" . ( $STOPPED ? " (stopped)" : "" ) . "\n" );
  $i++;
  $tm = time;
(Continue reading)

Ben Ragg | 22 Mar 00:01 2008
Picon
Picon

Re: Putting a server into maintance

Sorry forgot about that, it's been so long since I touched any of this :)

955c955
<       return (1, $l);
---
>       return (0, $l);
957c957,960
<       return (0, $1);
---
>       return (1, $1);
>     } elsif ($l =~ /alerts held by (\S*) from (\d+) to (\d+) comment
(.*)$/) {
>         my $comment = _un_esc_str ($4);
>         return (2, $2, $1, $comment, $3);
1342a1346,1371
>     return $r;
> }
>
> sub hold {
>     my $self = shift;
>     my ($time, $comment) =  <at> _;
>
>     undef $self->{"ERROR"};
>
>     if (!$self->{"CONNECTED"}) {
>         $self->{"ERROR"} = "not connected";
>         return undef;
>     }
>
>     $comment = _esc_str ($comment, 1);
(Continue reading)


Gmane