Automating Nagios service checks via SSH

by Rudd-O published 2006/09/17 06:00:00 GMT+0, last modified 2013-06-26T03:24:18+00:00
Nagios can let you sleep sound at night. Here's a collection of tips you can use to make its services even better.

Hello! Here's a nifty article on the Server management series. Today, we'll learn how to automate Nagios service checks via SSH, even when the remote server does not support password-less (key-based) SSH logins.

We'll cobble a lot of technologies together in order to achieve this amazing feat. While no experience is required with most of the technologies we'll be using, you'll need to be proficient in configuring Nagios. I'm also assuming you already have a Nagios instance up, running and operational.

Beware that this guide will require you to either generate SSH keys inside the Nagios home directory, or save your SSH password for the remote server in a file -- the second choice is not awfully secure. These keys and files will, however, be unreadable to anyone except the Nagios service checker -- not even the Web interface has access to it. I consider this an acceptable compromise under the circumstances, but you may not.

Without further ado, here's the guide.

Why this guide?

Now, what prompts me to do this guide? Of course, it's an itch to be scratched. My home computer runs Nagios, and I've set it up to periodically check for the health of the Apache service that runs this site, because it tends to go haywire and very slow during traffic storms.

The check_http Nagios plugin does work. So why don't I use it? Because it doesn't work as I intended. This is better explained by live example.

Most Nagios plugins are simple command-line programs which accept two timeout threshold arguments: the warning and the critical threshold. They report back the success, failure, warning or critical status to the Nagios server in a line of text.

This makes for powerful, modular server supervision and testing. Remote testing of service health is a piece of cake: once you've set it up, it's fire and forget (until any problem with the server arises). In this particular case, Nagios periodically runs the check_http plugin, which times the response time of the Apache server; if the service takes too long to respond, Nagios issues a WARNING or a CRITICAL message to me.

Of course, in a perfect world, air poses no resistance, cows are perfectly round, and network congestion is a non-issue. But my computer is at 18 high-latency hops from the server I'm checking. To boot, I usually have a BitTorrent client open and, no matter now much tuning you do, that kind of traffic breeds congestion. While the service itself may be serving one page in 0.5 seconds, Nagios usually reports times above 10 seconds (in my book, that's CRITICAL).

So check_http, as shipped, is useless to me -- I cannot reliably determine if my remote host is truly slow, or if it's just me.

The solution: time server response times directly at the source

With that in mind, I set out to find a solution. And the first one that sprang to mind is the one that I implemented. Basically, running check_http directly on this server would yield good timing information, and drastically cut back on warnings. And you can run check_http via SSH, which is an even better proposition.

Keep reading to find out how I did it.

"Installing" check_http on the server

I could not install Nagios on the remote server, because it lacks a compiler, I don't have the root password and, besides, I'm too busy to go foraging for software and installing it. Damn.

Would copying the check_http binary suffice? As it turns out, it didn't. Lots of library errors, because my binary was linked against a different version of OpenSSL and the GNU C library. Damn. See for yourself:

[rudd-o@amauta2 ~]$ nagios/bin/check_http
   nagios/bin/check_http: error while loading shared libraries: cannot open shared object file: No such file or directory
Solution to this problem: the brute-force way. I copied the binary, the GNU C library, the OpenSSL libraries and the program that runs ELF binaries. Check this out:
[rudd-o@amauta2 ~]$ pwd
   [rudd-o@amauta2 ~]$ ll nagios/bin nagios/lib
   total 56
   -rwxr-xr-x  1 rudd-o rudd-o 50660 Sep 16 23:25 check_http

   total 3168
   -rwxr-xr-x  1 rudd-o rudd-o  120352 Sep 16 23:31
   -rwxr-xr-x  1 rudd-o rudd-o 1247272 Sep 16 23:28
   -rwxr-xr-x  1 rudd-o rudd-o 1572400 Sep 16 23:30
   -rwxr-xr-x  1 rudd-o rudd-o  279384 Sep 16 23:26
Great, now it's time to make it run. Keep reading.

Making check_http run properly on the remote server

OK, time to turn to the LD_LIBRARY_PATH trick. This tells the system to look for libraries in the directory where I copied them first.

[rudd-o@amauta2 ~]$ export LD_LIBRARY_PATH=$HOME/nagios/lib
[rudd-o@amauta2 ~]$ nagios/bin/check_http
nagios/bin/check_http: relocation error: /home/rudd-o/nagios/lib/ symbol dltls_get_addr_soft, version GLIBC_PRIVATE not defined in file with link time reference

Relowhat? I couldn't run check_http directly, since the linker is linked against version 2.2 of the C library, and the program is linked against version 2.4. Damn.

But here comes the linker I copied (part of the GNU C library) to the rescue:

[rudd-o@amauta2 ~]$ nagios/lib/ nagios/bin/check_http
   check_http: Could not parse arguments
   Usage: check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
          [-w <warn time>] [-c <critical time>] [-t <timeout>] [-L]
          [-a auth] [-f <ok | warn | critical | follow>] [-e <expect>]
          [-s string] [-l] [-r <regex> | -R <case-insensitive regex>] [-P string]
          [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>] [-A string] [-k string]
Great! Let's create a nagiosrun script to automate this odious task, and save it in my bin/ directory:
[rudd-o@amauta2 ~]$ cat bin/nagiosrun

export LD_LIBRARY_PATH=$HOME/nagios/lib
exec $HOME/nagios/lib/ "$@
This snippet (once made executable) simply automates the setup of the library path, then execs the arguments I pass. Now I can do:
[rudd-o@amauta2 ~]$ nagiosrun nagios/bin/check_http
Keep reading to find out how I run this command from my home workstation.

Invoking check_http on my home workstation

So, I got this to work -- partially. Let's see if it still runs when I use SSH from my computer:

[rudd-o@andrea] [01:00:56]
   [~/bin] > ssh bin/nagiosrun nagios/bin/check_http's password: <I enter the password>

check_http: Could not parse arguments
   Usage: check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
          [-w <warn time>] [-c <critical time>] [-t <timeout>] [-L]
          [-a auth] [-f <ok | warn | critcal | follow>] [-e <expect>]
          [-s string] [-l] [-r <regex> | -R <case-insensitive regex>] [-P string]
          [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>] [-A string] [-k string]
Great! But SSH is a big roadblock. It's prompting me for a password. How do I automate that? Keep reading.

But SSH prompts you for a password!

The next hurdle: making SSH not prompt me for a password. Nagios is fully automated, and there's no way in hell it can type passwords for me. Plus, SSH is very finicky about how it gets its passwords: either you type them in, or you can't use it. Piping the password does not work (because the input is reserved for the program that was invoked in the command line). Damn!

There are two solutions to this problem:

  1. Using SSH public key authentication.
  2. Creating a script that talks interactively to the SSH command, providing the password when needed.

SSH public key authentication

There are many guides about that around the Internet, so I'll skip this part and assume you have no way of enabling public key authentication in the SSH server.  However, these are the specifics for this use case:

  1. Create the Nagios home directory as listed in the /etc/passwd file.  Make sure it's mode 0700 and owned by the Nagios user.
  2. Use su - nagios -s /bin/bash to become the Nagios user temporarily.  Then create the public/private SSH key pair.
  3. Follow the procedure to register that key into the SSH server and enable public key authentication.
  4. Log in at least once to the server you're going to check as the Nagios user, so that SSH has a chance to register the public key of the server.
  5. Use the check_by_ssh Nagios plugin to run commands on the remote server.

Here's a useable example of a Nagios command you can register in this scenario:

define command{
           command_name    check_http_via_ssh
           command_line    check_by_ssh -H $HOSTNAME$ -l rudd-o bin/nagiosrun nagios/bin/check_http -H $HOSTNAME$ -w $ARG1$ -c $ARG2$
# the -l argument should be the user name on the remote server

But what if you cannot enable SSH public key authentication?  No problem, there's an alternative solution.

Expect script

Expect to the rescue! Expect is a nifty program that automates tasks based on input and output. You should have installed Expect on your computer by now.

Using heavily Googled code, I concocted my own version of the sshexpect script:

#!/usr/bin/expect -f

#Expect script to supply root/admin password for remote ssh server
#and execute command.
#This script needs three argument to(s) connect to remote server:
#password = Password of remote UNIX server, for root user.
#ipaddr = IP Addreess of remote UNIX server, no hostname
#scriptname = Path to remote script which will execute on remote server
#For example:
#./sshlogin.exp password who
#Copyright (c) 2004 nixCraft project <>
#This script is licensed under GNU GPL version 2.0 or above
#This script is part of nixCraft shell script collection (NSSC)
#Visit for more information.
#set Variables

set ipaddr [lrange $argv 0 0]
   set password [lrange $argv 1 1]
   set scriptname [lrange $argv 2 2]
   set arg1 [lrange $argv 3 3]
   set arg2 [lrange $argv 4 4]
   set arg3 [lrange $argv 5 5]
   set arg4 [lrange $argv 6 6]
   set arg5 [lrange $argv 7 7]
   set arg6 [lrange $argv 8 8]
   set arg7 [lrange $argv 9 9]

#setting a timeout for the password prompt 5 seconds larger than the SSH ConnectionTimeout parameter

set timeout 35

#now connect to remote UNIX box (ipaddr) with given script to execute

set pid [spawn -noecho ssh -o "ConnectTimeout 30" -o "CheckHostIP no" -o "StrictHostKeyChecking no" $ipaddr $scriptname $arg1 $arg2 $arg3 $arg4 $arg5 $arg6 $arg7]
match_max 5000

#look for password prompt

log_user 0
   expect {
      "denied"                       {puts "CRITICAL: wrong SSH password" ; exit 2}
      "Name or service not known"    {puts "CRITICAL: cannot resolve SSH server name $ipaddr" ; exit 2}
      "Connection refused"           {puts "CRITICAL: SSH connection to $ipaddr refused" ; exit 2}
      "Connection timed out"         {puts "CRITICAL: SSH connection to $ipaddr timed out" ; exit 2}
      timeout                        {puts "CRITICAL: SSH server timed out while prompting for password" ; exit 2}

# send password

send -- "$password"

# send blank line to make sure we get back to gui

send -- "r"
   expect "r"
   log_user 1

# now we wait up to 30 seconds

set timeout 30
   expect {
       timeout                   {puts "CRITICAL: execution of $scriptname timed out after 30 seconds" ; exit 2}

set waitret [wait]
   catch {close}

set state [lindex $waitret 2]
   exit [lindex $waitret 3

What this script does is fairly easy to understand (once it's been explained to you!). It starts ssh with the passed arguments (a maximum of 8), against the server you specify and with a password you specify as well. It returns the status value of the remoted (remotely invoked) command.

The script suppresses any SSH output not related to the command, so beware: if the password is wrong, you will not be told. The script also make SSH not prompt for host authentication, so if you're finicky about security, perhaps this is the wrong approach for you. But it works for me, so let's go on. Again, keep reading.

Testing our Frankenstein concoction to see if it works

We're going to use this as a "Nagios plugin" to remote the check_http command on the server, like this:

[rudd-o@andrea] [01:12:50]
[~/bin] > /usr/local/bin/sshexpect password123 bin/nagiosrun nagios/bin/check_http -H -w 2 -c 5
HTTP OK HTTP/1.1 200 OK - 42531 bytes in 0.793 seconds |time=0.793211s;2.000000;5.000000;0.000000 size=42531B;;;0
This time, I took the liberty of supplying check_http with some arguments it wants under normal operation. This is of no actual concern at this point, because Nagios does that for you, but I'll explain what they are:
  • -H that's the host that check_http will check
  • -w 2: any request taking more than 2 seconds will be classified as a WARNING condition
  • -c 5: any request taking more than 5 seconds will be classified as a CRITICAL condition

You can clearly see that check_http reports OK, with page generation times of 0.793 seconds. Great!

Configuring Nagios to do this for me automatically

I opened the Nagios configuration file (in my case, /etc/nagios/minimal.cfg), and added the following command definition:

# 'check_http' command definition

# arg1 user arg2 password arg3 warn arg4 crit

define command{
           command_name    check_http_via_ssh
           command_line    /usr/local/bin/sshexpect $ARG1$@$HOSTNAME$ $ARG2$ bin/nagiosrun nagios/bin/check_http -H $HOSTNAME$ -w $ARG3$ -c $ARG4$

You can see that this Nagios command takes four arguments:

  1. the user name used to login
  2. the password
  3. the -w argument
  4. the -c argument

Further below, I added this service definition:

define service{
           use                             generic-service         ; Name of service template to use
           service_description             Apache via SSH
           is_volatile                     0
           normal_check_interval           300
           retry_check_interval            60
           check_command                   check_http_via_ssh!rudd-o!$USER3$!2!5
You can see the arguments in check_command. But wait, whoa, what's that $USER3$? It's an user-defined string that you can configure on /etc/nagios/private/resource.cfg, a private file. I add the password there:
# Store some usernames and passwords (hidden from the CGIs)
And now, it's time to restart the Nagios server. You can use your distro's service management commands to do that. Me, I'm perfectly happy with /sbin/service nagios restart.


Would you like to see a screenshot of it working as intended? Keep reading!

Voilà! Nagios is remoting a plugin through SSH!

Nagios is running!



Further ideas and tips

Using this approach, you can automate almost any Nagios plugin. I've already automated these:

  • checks of the SMTP server
  • checks of the average load

You can automate any check, and even perform checks on specialized software running on servers very, very far away. Go wild, and make sure you share your tricks with me!

Homework time!

But there are a few kinks with this approach.

  • There are race conditions in the expect script sshexpect.
  • Some error handling is not being done in the sshexpect script, so if something goes wrong, Nagios will act up, presenting WARNINGs or unknown status.
  • Of course, the SSH password is saved in a plain text file: /etc/nagios/private/resource.cfg.
  • The SSH password gets echoed in the command line. An attacker logged in to the computer running Nagios can see the password by typing ps axwww.

Here's where you come in. I challenge you to solve these issues (and share the solutions with me).

About error handling: All errors should be reported as "CRITICAL: appropriate error message" and make sshexpect return 2 to the operating system. You can follow the Nagios plug-in development guidelines to find out what Nagios expects. Pun intended: I fully expect that experienced expect programmers will do lots better than I did.

And that's it!

As usual, flames to /dev/null, while comments are always welcome. Over and out.