Update RFE: 

Following up on my article, I am pleased to announce that the Nagios development team will incorporate my RFE (Request For Enhancement) as of Nagios version 3.0b1.

Response from the Nagios Development team:

Great idea! I’ll add two macros in 3.0b1: $MAXHOSTATTEMPTS$ and
$MAXSERVICEATTEMPTS$

Intro

This article describes how Nagios can attempt to automatically recover services on clients that are considered down (the services that is). It assumes extensive knowledge of Nagios. The enhanced script this article describes evolved from an example script from the Nagios documentation.

I suggest you open the script for reading in a second window and then resume this article.

The average sized Nagios cluster I use this script in is setup without the use of passive checks. It only uses the nrpe daemon on clients to centralize management. Maybe more on that in another article.

Eventhandlers

This script is controlled from the Nagios server via an eventhandler. Nagios runs eventhandlers, if so configured, when a service is in a soft state. To illustrate what actually happens from ‘hard up’ via ‘soft down’ back to ‘hard up’ a log example is shown below.

[1174574313] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574313] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;1;remote-event-handler-sc3
[1174574373] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574373] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;2;remote-event-handler-sc3
[1174574433] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;3;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574433] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;3;remote-event-handler-sc3
[1174574493] SERVICE ALERT: host01;rsync;OK;SOFT;4;PROCS OK: 1 process with args '/usr/bin/rsync'
[1174574493] SERVICE EVENT HANDLER: host01;rsync;OK;SOFT;4;remote-event-handler-sc3

In the example Nagios checks a host for a process named “/usr/bin/rsync” and detects that it is missing. Since an event handler is configured for this service, after the first check in ‘soft down’, the eventhandler named ‘remote-event-handler-sc3’ is executed. (More on a the “-sc3” suffix in a while). The eventhandler log lines show an increasing number after “SOFT” which is the Nagios macro $SERVICEATTEMPT$. Nagios stops checking when this counter hits the service definition’s “max_check_attempts”, after which it goes into ‘hard down’.

Service definition

The service definition on the Nagios server, using this eventhandler, is shown below.

# NRPE: Rsync daemon check
define service{
    use                             generic-service         ; Name of service template to use
    host_name                       host01
    ; If 'remote-event-handler-scX' is used this name must be exactly the same as it is in /etc/init.d on the client
    service_description             rsync
    max_check_attempts              4       ; Make it 4,6 or 11 (see eventhandler line below)
    normal_check_interval           5
    retry_check_interval            1
    contact_groups                  office-group
    notification_interval           120
    notification_options            w,u,c,r
    ; The sc'3' at the end can be 3,5 or 10, and 'max_check_attempts' MUST be one HIGHER
    event_handler                   remote-event-handler-sc3
    ; The usual: remote comand!warn-range!crit-range!ps string (range below 1 also triggers)
    check_command                   check_nrpe!check_proc_string!1:1 1:3 /usr/bin/rsync
}

(Comments within a service definition have a ‘;’ prefix).

The interesting lines, and thus described, are the following:

service_description = rsyncThe eventhandler has a generic setup but for it to work the name of the service, “rsync” in this case, must be identical to the name of the script on the client in the init.d directory.

max_check_attempts = 4 – In this example, make it 4, 6 or 11 (more on that in a while).

event_handler = remote-event-handler-sc3 – This is the name of one of the custom made eventhandler commands.

There’s a reason why the last digit in the eventhandler name is one lower then the value of max_check_attempts. That’s because you want the eventhandler script to restart the daemon one check before Nagios starts notifying.

Eventhandler configuration

For this automatic service recovery to be operable a part of the eventhandler configuration resides on the server. The other part is on the client side, and has commands defined in the nrpe daemon configuration file “nrpe_local.cfg”. The eventhandler definition file shown below is the server part of the configuration.

# handlercommands.cfg - Handler commands for use with eventhandlers
# The scX at the end of the name stands for the last parm that is send to the client for handling the "soft state check attempt count"

define command{
    command_name    remote-event-handler-sc3
    command_line    /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c remote-event-handler-sc3 -a $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ 3
}
define command{
    command_name    remote-event-handler-sc5
    command_line    /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c remote-event-handler-sc5 -a $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ 5
}
define command{
    command_name    remote-event-handler-sc10
    command_line    /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c remote-event-handler-sc10 -a $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ 10
}

(Don’t forget to include the file in nagios.cfg with e.g. “cfg_file=/etc/nagios/handlercommands.cfg”)

As you can see the need for max_check_attempts to be either 4, 6 or 11 has to do with the last parameter in the three “command_line” definitions. The fifth argument, thus the -scX suffix, is directly related to the “max_check_attempts” in the service definitions (that is, it’s always one lower). If the “max_check_attempts” would be available as a Nagios macro we’d only need one eventhandler definition, but more on that at the end of the article. The server contacts the nrpe daemon on the client, which has the following (remote) commands defined.

# Eventhandlers (remote)
command[remote-event-handler-sc3]=/etc/nagios-plugins/eventhandlers/restart-daemon $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
command[remote-event-handler-sc5]=/etc/nagios-plugins/eventhandlers/restart-daemon $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
command[remote-event-handler-sc10]=/etc/nagios-plugins/eventhandlers/restart-daemon $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$

The last value in the eventhandler command_line definition on the server is passed to the restart-deamon script via “$ARG5$”. From then on, within the script, it’s handled as “$maxattempts”. (Note that this is the not the “max_check_attempts” from the service definition). As the script states:

# The tries (in soft state) before a local service restart is invoked.
# Note: This value is set in an eventhandler definition on the Nagios server.
maxattempts=$5

The server controls when to invoke a restart of a daemon just one check before people will get notified. The advantage in this setup is that you don’t need to set the value remotely in the script, but simply by choosing the appropriate eventhandler. Let’s take another look at the log example from the beginning of this article.

[1174574313] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574313] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;1;remote-event-handler-sc3
[1174574373] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574373] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;2;remote-event-handler-sc3
[1174574433] SERVICE ALERT: host01;rsync;CRITICAL;SOFT;3;PROCS CRITICAL: 0 processes with args '/usr/bin/rsync'
[1174574433] SERVICE EVENT HANDLER: host01;rsync;CRITICAL;SOFT;3;remote-event-handler-sc3
[1174574493] SERVICE ALERT: host01;rsync;OK;SOFT;4;PROCS OK: 1 process with args '/usr/bin/rsync'
[1174574493] SERVICE EVENT HANDLER: host01;rsync;OK;SOFT;4;remote-event-handler-sc3

As the service definition showed, the max_check_attempts was “4” (in combination with eventhandler …-sc3). So counting the checks the script does it’s magic the third time it’s invoke. The rsync daemon is succesfully (re)started, and when being checked for the fourth time. The service is “OK” again never reaching “Hard down” and thus people are not being notified unnecessary.

The restart-daemon script has a mail functionality built in in case it does successfully recover a service, because you want to know about a service restart nevertheless.

Gotcha’s

This auto restart construction seems very nice, and ofcourse it is … 🙂 … but it will not work properly in all situations.

– Assume a daemon process hangs but is still visible to Nagios. Then the script will never be invoked and this whole setup is purely academic. Thus, you still loose the functionality of the service, even though it’s still in memory.

– Assume you have a service dependancy defined and nagios ‘knows’ there’s a relation between http checks and a process called apache2 on that same client. Then assume http requests are not being served because apache hangs, and Nagios will try to do a restart with the described eventhandler construction. The only way to get back to a proper “hard up” is when the init.d script in question is able to do a “kill -9 …” to remove apache from memory. Since not all init scripts handle unconditional proces removal during service shutdown you should ofcourse test this.

RFE for nagios

This solution might seem somewhat cumbersome, even though it has a very generic setup. With, once implemented on the client, most of the ‘management’ taking place on the central Nagios server. It could be a lot simpler though if the value “max_check_attempts” within a service definition is available as Nagios macro as well. Then you would simply decrease the $MAX_CHECK_ATTEMPTS$ by one to always attempt a restart on the forlast check. Which would result in only one eventhandler instead of … well … as much as you have different values defined for “max_check_attempts”.

Conclusion

When trying to create a self healing environment this service eventhandler setup covers quite some calculated failures in services not doing what the are supposed to. To take full advantage of it’s generic nature, the only requirement is to set the name of the service (“service_description”) the same as the name of the init script on the client. In our example the rsync service is handled by “/etc/init.d/rsync”, and thus the service is called “rsync” under Nagios. It was tested on Debian Etch but with minor adjustments the eventhandler restart script should work on all your N*X boxes.

That’s all folks …