Automating Nagios service checks via SSH
Nagios can let you sleep sound at night. Here's a collection of tips you can use to make its services even better.
Hello! Here's a nifty article on the Server management series. Today, we'll learn how to automate Nagios service checks via SSH, even when the remote server does not support password-less (key-based) SSH logins.
We'll cobble a lot of technologies together in order to achieve this amazing feat. While no experience is required with most of the technologies we'll be using, you'll need to be proficient in configuring Nagios. I'm also assuming you already have a Nagios instance up, running and operational.
Beware that this guide will require you to either generate SSH keys inside the Nagios home directory, or save your SSH password for the remote server in a file -- the second choice is not awfully secure. These keys and files will, however, be unreadable to anyone except the Nagios service checker -- not even the Web interface has access to it. I consider this an acceptable compromise under the circumstances, but you may not.
Without further ado, here's the guide.
Why this guide?
Now, what prompts me to do this guide? Of course, it's an itch to be scratched. My home computer runs Nagios, and I've set it up to periodically check for the health of the Apache service that runs this site, because it tends to go haywire and very slow during traffic storms.
The check_http
Nagios plugin does work. So why don't I use it? Because it doesn't work as I intended. This is better explained by live example.
Most Nagios plugins are simple command-line programs which accept two timeout threshold arguments: the warning and the critical threshold. They report back the success, failure, warning or critical status to the Nagios server in a line of text.
This makes for powerful, modular server supervision and testing. Remote testing of service health is a piece of cake: once you've set it up, it's fire and forget (until any problem with the server arises). In this particular case, Nagios periodically runs the check_http
plugin, which times the response time of the Apache server; if the service takes too long to respond, Nagios issues a WARNING or a CRITICAL message to me.
Of course, in a perfect world, air poses no resistance, cows are perfectly round, and network congestion is a non-issue. But my computer is at 18 high-latency hops from the server I'm checking. To boot, I usually have a BitTorrent client open and, no matter now much tuning you do, that kind of traffic breeds congestion. While the service itself may be serving one page in 0.5 seconds, Nagios usually reports times above 10 seconds (in my book, that's CRITICAL).
So check_http
, as shipped, is useless to me -- I cannot reliably determine if my remote host is truly slow, or if it's just me.
The solution: time server response times directly at the source
With that in mind, I set out to find a solution. And the first one that sprang to mind is the one that I implemented. Basically, running check_http
directly on this server would yield good timing information, and drastically cut back on warnings. And you can run check_http
via SSH, which is an even better proposition.
Keep reading to find out how I did it.
"Installing" check_http
on the server
I could not install Nagios on the remote server, because it lacks a compiler, I don't have the root password and, besides, I'm too busy to go foraging for software and installing it. Damn.
Would copying the check_http
binary suffice? As it turns out, it didn't. Lots of library errors, because my binary was linked against a different version of OpenSSL and the GNU C library. Damn. See for yourself:
[rudd-o@amauta2 ~]$ nagios/bin/check_http nagios/bin/check_http: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
Solution to this problem: the brute-force way. I copied the binary, the GNU C library, the OpenSSL libraries and the ld-linux.so.2
program that runs ELF binaries. Check this out:
[rudd-o@amauta2 ~]$ pwd /home/rudd-o [rudd-o@amauta2 ~]$ ll nagios/bin nagios/lib nagios/bin: total 56 -rwxr-xr-x 1 rudd-o rudd-o 50660 Sep 16 23:25 check_http nagios/lib: total 3168 -rwxr-xr-x 1 rudd-o rudd-o 120352 Sep 16 23:31 ld-linux.so.2 -rwxr-xr-x 1 rudd-o rudd-o 1247272 Sep 16 23:28 libcrypto.so.6 -rwxr-xr-x 1 rudd-o rudd-o 1572400 Sep 16 23:30 libc.so.6 -rwxr-xr-x 1 rudd-o rudd-o 279384 Sep 16 23:26 libssl.so.6
Great, now it's time to make it run. Keep reading.
Making check_http
run properly on the remote server
OK, time to turn to the LD_LIBRARY_PATH
trick. This tells the system to look for libraries in the directory where I copied them first.
[rudd-o@amauta2 ~]$ export LD_LIBRARY_PATH=$HOME/nagios/lib [rudd-o@amauta2 ~]$ nagios/bin/check_http nagios/bin/check_http: relocation error: /home/rudd-o/nagios/lib/libc.so.6: symbol dltls_get_addr_soft, version GLIBC_PRIVATE not defined in file ld-linux.so.2 with link time reference
Relowhat? I couldn't run check_http
directly, since the linker is linked against version 2.2 of the C library, and the program is linked against version 2.4. Damn.
But here comes the linker I copied (part of the GNU C library) to the rescue:
[rudd-o@amauta2 ~]$ nagios/lib/ld-linux.so.2 nagios/bin/check_http check_http: Could not parse arguments Usage: check_http -H | -I [-u ] [-p ] [-w ] [-c ] [-t ] [-L] [-a auth] [-f ] [-e ] [-s string] [-l] [-r | -R ] [-P string] [-m :] [-4|-6] [-N] [-M ] [-A string] [-k string]
Great! Let's create a nagiosrun
script to automate this odious task, and save it in my bin/
directory:
[rudd-o@amauta2 ~]$ cat bin/nagiosrun #!/bin/bash export LD_LIBRARY_PATH=$HOME/nagios/lib exec $HOME/nagios/lib/ld-linux.so.2 "$@
This snippet (once made executable) simply automates the setup of the library path, then execs the arguments I pass. Now I can do:
[rudd-o@amauta2 ~]$ nagiosrun nagios/bin/check_http
Keep reading to find out how I run this command from my home workstation.
Invoking check_http
on my home workstation
So, I got this to work -- partially. Let's see if it still runs when I use SSH from my computer:
[rudd-o@andrea] [01:00:56]
[~/bin] > ssh rudd-o.com bin/nagiosrun nagios/bin/check_http
rudd-o@rudd-o.com's password:
check_http: Could not parse arguments
Usage: check_http -H | -I [-u ] [-p ]
[-w ] [-c ] [-t ] [-L]
[-a auth] [-f ] [-e ]
[-s string] [-l] [-r | -R ] [-P string]
[-m :] [-4|-6] [-N] [-M ] [-A string] [-k string]
Great! But SSH is a big roadblock. It's prompting me for a password. How do I automate that? Keep reading.
But SSH prompts you for a password!
The next hurdle: making SSH not prompt me for a password. Nagios is fully automated, and there's no way in hell it can type passwords for me. Plus, SSH is very finicky about how it gets its passwords: either you type them in, or you can't use it. Piping the password does not work (because the input is reserved for the program that was invoked in the command line). Damn!
There are two solutions to this problem:
- Using SSH public key authentication.
- Creating a script that talks interactively to the SSH command, providing the password when needed.
SSH public key authentication
There are many guides about that around the Internet, so I'll skip this part and assume you have no way of enabling public key authentication in the SSH server. However, these are the specifics for this use case:
- Create the Nagios home directory as listed in the
/etc/passwd
file. Make sure it's mode0700
and owned by the Nagios user. - Use
su - nagios -s /bin/bash
to become the Nagios user temporarily. Then create the public/private SSH key pair. - Follow the procedure to register that key into the SSH server and enable public key authentication.
- Log in at least once to the server you're going to check as the Nagios user, so that SSH has a chance to register the public key of the server.
- Use the
check_by_ssh
Nagios plugin to run commands on the remote server.
Here's a useable example of a Nagios command you can register in this scenario:
define command{
command_name check_http_via_ssh
command_line check_by_ssh -H $HOSTNAME$ -l rudd-o bin/nagiosrun nagios/bin/check_http -H $HOSTNAME$ -w $ARG1$ -c $ARG2$
}
# the -l argument should be the user name on the remote server
But what if you cannot enable SSH public key authentication? No problem, there's an alternative solution.
Expect script
Expect to the rescue! Expect is a nifty program that automates tasks based on input and output. You should have installed Expect on your computer by now.
Using heavily Googled code, I concocted my own version of the sshexpect
script:
#!/usr/bin/expect -f #Expect script to supply root/admin password for remote ssh server #and execute command. #This script needs three argument to(s) connect to remote server: #password = Password of remote UNIX server, for root user. #ipaddr = IP Addreess of remote UNIX server, no hostname #scriptname = Path to remote script which will execute on remote server #For example: #./sshlogin.exp password 192.168.1.11 who #------------------------------------------------------------------------ #Copyright (c) 2004 nixCraft project #This script is licensed under GNU GPL version 2.0 or above #------------------------------------------------------------------------- #This script is part of nixCraft shell script collection (NSSC) #Visit http://bash.cyberciti.biz/ for more information. #---------------------------------------------------------------------- #set Variables set ipaddr [lrange $argv 0 0] set password [lrange $argv 1 1] set scriptname [lrange $argv 2 2] set arg1 [lrange $argv 3 3] set arg2 [lrange $argv 4 4] set arg3 [lrange $argv 5 5] set arg4 [lrange $argv 6 6] set arg5 [lrange $argv 7 7] set arg6 [lrange $argv 8 8] set arg7 [lrange $argv 9 9] #setting a timeout for the password prompt 5 seconds larger than the SSH ConnectionTimeout parameter set timeout 35 #now connect to remote UNIX box (ipaddr) with given script to execute set pid [spawn -noecho ssh -o "ConnectTimeout 30" -o "CheckHostIP no" -o "StrictHostKeyChecking no" $ipaddr $scriptname $arg1 $arg2 $arg3 $arg4 $arg5 $arg6 $arg7] match_max 5000 #look for password prompt log_user 0 expect { "denied" {puts "CRITICAL: wrong SSH password" ; exit 2} "Name or service not known" {puts "CRITICAL: cannot resolve SSH server name $ipaddr" ; exit 2} "Connection refused" {puts "CRITICAL: SSH connection to $ipaddr refused" ; exit 2} "Connection timed out" {puts "CRITICAL: SSH connection to $ipaddr timed out" ; exit 2} timeout {puts "CRITICAL: SSH server timed out while prompting for password" ; exit 2} "?assword:" } # send password send -- "$password" # send blank line to make sure we get back to gui send -- "r" expect "r" log_user 1 # now we wait up to 30 seconds set timeout 30 expect { timeout {puts "CRITICAL: execution of $scriptname timed out after 30 seconds" ; exit 2} eof } set waitret [wait] catch {close} set state [lindex $waitret 2] exit [lindex $waitret 3
What this script does is fairly easy to understand (once it's been explained to you!). It starts ssh
with the passed arguments (a maximum of 8), against the server you specify and with a password you specify as well. It returns the status value of the remoted (remotely invoked) command.
The script suppresses any SSH output not related to the command, so beware: if the password is wrong, you will not be told. The script also make SSH not prompt for host authentication, so if you're finicky about security, perhaps this is the wrong approach for you. But it works for me, so let's go on. Again, keep reading.
Testing our Frankenstein concoction to see if it works
We're going to use this as a "Nagios plugin" to remote the check_http
command on the server, like this:
[rudd-o@andrea] [01:12:50] [~/bin] > /usr/local/bin/sshexpect rudd-o@rudd-o.com password123 bin/nagiosrun nagios/bin/check_http -H rudd-o.com -w 2 -c 5 HTTP OK HTTP/1.1 200 OK - 42531 bytes in 0.793 seconds |time=0.793211s;2.000000;5.000000;0.000000 size=42531B;;;0
This time, I took the liberty of supplying check_http
with some arguments it wants under normal operation. This is of no actual concern at this point, because Nagios does that for you, but I'll explain what they are:
-H rudd-o.com
: that's the host thatcheck_http
will check-w 2
: any request taking more than 2 seconds will be classified as a WARNING condition-c 5
: any request taking more than 5 seconds will be classified as a CRITICAL condition
You can clearly see that check_http
reports OK, with page generation times of 0.793 seconds. Great!
Configuring Nagios to do this for me automatically
I opened the Nagios configuration file (in my case, /etc/nagios/minimal.cfg
), and added the following command definition:
# 'check_http' command definition # arg1 user arg2 password arg3 warn arg4 crit define command{ command_name check_http_via_ssh command_line /usr/local/bin/sshexpect $ARG1$@$HOSTNAME$ $ARG2$ bin/nagiosrun nagios/bin/check_http -H $HOSTNAME$ -w $ARG3$ -c $ARG4$ }
You can see that this Nagios command takes four arguments:
- the user name used to login
- the password
- the
-w
argument - the
-c
argument
Further below, I added this service definition:
define service{ use generic-service ; Name of service template to use host_name rudd-o.com service_description Apache via SSH is_volatile 0 normal_check_interval 300 retry_check_interval 60 check_command check_http_via_ssh!rudd-o!$USER3$!2!5 }
You can see the arguments in check_command
. But wait, whoa, what's that $USER3$
? It's an user-defined string that you can configure on /etc/nagios/private/resource.cfg
, a private file. I add the password there:
# Store some usernames and passwords (hidden from the CGIs) $USER3$=password123
And now, it's time to restart the Nagios server. You can use your distro's service management commands to do that. Me, I'm perfectly happy with /sbin/service nagios restart
.
Would you like to see a screenshot of it working as intended? Keep reading!
Voilà! Nagios is remoting a plugin through SSH!
Further ideas and tips
Using this approach, you can automate almost any Nagios plugin. I've already automated these:
- checks of the SMTP server
- checks of the average load
You can automate any check, and even perform checks on specialized software running on servers very, very far away. Go wild, and make sure you share your tricks with me!
Homework time!
But there are a few kinks with this approach.
- There are race conditions in the
expect
scriptsshexpect
. - Some error handling is not being done in the
sshexpect
script, so if something goes wrong, Nagios will act up, presenting WARNINGs or unknown status. - Of course, the SSH password is saved in a plain text file:
/etc/nagios/private/resource.cfg
. - The SSH password gets echoed in the command line. An attacker logged in to the computer running Nagios can see the password by typing
ps axwww
.
Here's where you come in. I challenge you to solve these issues (and share the solutions with me).
About error handling: All errors should be reported as "CRITICAL: appropriate error message" and make sshexpect
return 2 to the operating system. You can follow the Nagios plug-in development guidelines to find out what Nagios expects. Pun intended: I fully expect that experienced expect
programmers will do lots better than I did.
And that's it!
As usual, flames to /dev/null
, while comments are always welcome. Over and out.