Hello all, this is @mythryll , I hope everyone is staying safe! This is going to be a long one, so strap in and enjoy the ride (hopefully)!
A little back story
A few years back I had started a conversation with the great Michael Friedrich who was at the time working at Icinga (I think), about how that product seemed better organized and structured than Nagios and that it lacked generally available guides on how to reach a practical real-world installation that could get more people involved and help the community grow. Michael complained (rightfully) that they were very deeply involved with developing the product and didn’t have time for guides (of course that’s not a job for the developers, I would say it’s the marketing’s job to push for the product to become accepted), so he challenged me to produce any automation stuff I could about Nagios first.
As it happens, I didn’t have time for that either, but I did manage to produce an Ansible playbook for upgrading Nagios and posted it here . That became the first in a series of posts about Automation and the reason to clean up my act in my blog here, so it can be a little more presentable and perhaps even helpful to others.
As it happens, I was very much involved with projects and learning about network automation that I didn’t find the time to create an Ansible playbook for a clean Nagios Installation, as the necessary immersion in the Nagios development context seemed like a luxury.
I am now at the point where I am working on an important project, combining tools from different worlds, so I need a way to reproduce a Nagios installation quickly, in a standard and documented way and without the risk of human error. So naturally I dug up my Ansible war axe and starting hacking at the problem..
So what did you find?
I have found an older playbook about some of this here by Sahil -Sharma. I did change a lot of things and adapted some others. Anyway it was very useful as an example but it only depicts the installation process mentioned in the docs, as does mine of course, so nothing too fancy here. Feel free to use that with the Nagios versions that referenced however, as a lot of things no longer work that way (you will get failures with the current versions). Sahil’s work includes a role where a main.yml file invokes all the other ones. I assume you can follow the same path and do one for yourself. I didn’t really want to go that far. I am not claiming to be an Ansible expert (far from it) and some of the plugins do not satisfy a major Ansible advantage: itempotency. So be careful and use snapshots if using VMs for your infrastructure.
I have tested the playbooks I will include below both with Ubuntu 18.04.4 LTS and Ubuntu 20.04 LTS. My initial target was to go with 20.04 all the way so that I can maintain the new car smell as much as possible.. Perl and Python and almost Ansible too foiled my plans at a few points. Here is what the installation includes:
- Nagios Core 4.4.5 (Latest at this point) – Here is the github link.
- Nagios Plugins 2.3.3 (Latest at this point) – Here is the github link.
- NRPE v.4.0.2 (Latest at this point). – The playbook here is only about the plugin (architecture is reversed with NRPE, the Nagios Server is the NRPE client). The documentation describes how to install v3 from source. The version 4 is only on github. There are sad news there.. Also watch the openssl part, nobody includes it properly in instructions. Finally, the windows component for nagios can launch an NRPE server module, but you may need to either activate encryption properly or deactivate it entirely (that broke some checks for me on my production system that was using NRPE 3.2.1). I will try to reproduce the rest of the process in a playbook for the NRPE servers (the distributed probes) and will update here.
- check_esxi_hardware.py plugin that can check ESXi standalone hosts for problems (including hardware issues)
- check_hp_blade_chassis – to check status for our HP Blade chassis in our DCs, I have yet to find and include a similar one for the Cisco UCS platform, they are heavy on PowerShell and Ansible, I am sure something can be built. Will check with Devnet’s John McDonough for that.
- check_ipmi_sensor – this is to check on impi alerts and sensors on platforms that support the IPMI specification. It uses the free-ipmi library and can work with a lot of hardware appliances. Here is the github link.
So what didn’t work?
Well a couple of things.
- check_cisco_ip_sla – This plugin will only work on the 18.04.4 LTS version. There seem to be issues with the easysnmp python library when the python compliler used is 3.7+. The python 3 version on Ubuntu 18.04.4 LTS is today 3.6.9. I have found a bug here. So although I didn’t make an Ansible playbook, it’s pretty easy to make one for Ubuntu 18.04.4, if you follow along the other ones. Just make sure you install the prereqs with APT and use the command Ansible module to install the easysnmp python library instead of the pip Ansible module.
- check_vmware_api – This plugin is based on the Perl SDK from Vmware, since version 4, the vmware vcli is included in the installation. There are versions 6.7 and 7.0 of this SDK right now on Vmware. My tests were successful when the plugin was doing single checks using the new SDK versions. When I loaded configs on the server it started to timeout. Apparently there is an issue with a random number generator or something like that (I guess that’s used to separate sessions?). These two posts helped me understand and make up my mind to drop this:
- So I decided to drop that and switch to check_vmware_esx. which works but I had to change my commands as a lot of things are defined differently, some things are still accessible but need different commands, others have disappeared entirely (runtime net, io) and others are available that were not available before (snapshots). The problem is this plugin can not be installed via Ansible Playbook, I mean the plugin sure, the SDK no. There is a need for an interaction with the Perl SDK installer which needs a series of keys to be pressed/typed (Enter, q and yes) to proceed with the installation, even when the default options are requested for a silent installation. If I find a way to make this work, I will update the post and include that playbook as well (provided the SDK is downloaded before hand). A possible solution could come from using the expect Ansible module. One thing to keep in mind here, is that the plugin will not work if the connections to the VCenter servers are going through a proxy. So no proxy please 🙂
Any other disclaimers?
No configuration for Nagios is included in this post. The post is about installing Nagios automatically. The nagios configuration can take up a lot of posts and none of it will be about automation, unless you are configuring an event handler to launch a task after an alert. I could include a post about that, however after my research here and even if I have the goal to move to an Icinga docker based installation for our infrastructure, I have the feeling that this whole ecosystem of plugins has served the industry for more than 10 years but is now old and sick and needs to be replaced. Perhaps Prometheus can serve as a replacement, I don’t know enough yet, but will research that in time.
Something else also not included is the https implementation for the Nagios server, as well as using AD credentials to login to the gui. Although I have deployed those in our production environment, I feel I would not do that task automatically (it’s that a big deal, it’s about changing a few lines in some configs, transferring files over to the Nagios server and restarting services). I will only do this once per server and prefer to do it manually.
Finally, be advised that the pip module for Ansible remains broken since after version 2.6. You can read about it here, I used the same approach replacing pip module with command.
Enough words. The code?
So here is the code for the playbooks. I used a variable to define the target host for each playbook. The host provided with the command to run the playbook must be included in your hosts file. Be careful when copying the code as wordpress replaces the sequences of ‘-‘ with longer dashes. Every playbook starts with 3 dashes.
ansible-playbook nagios-core-install.yml --extra-vars "target_host=srv-nagios-01"
Nagios Core Installation
We start with the playbook for installing Nagios Core v.4.4.5:
---
- hosts: "{{ target_host }}"
tasks:
- name: get rid of old installation dir
file:
path: /tmp/nagios
state: absent
become: true
- name: get rid of semi-created installation dir
file:
path: /tmp/nagios-4.4.5
state: absent
become: true
- name: get rid of old installation archive
file:
path: /tmp/nagios.tar.gz
state: absent
become: true
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install nagios reqs
apt:
name:
- autoconf
- gcc
- libc6
- make
- wget
- unzip
- libssl-dev
- apache2
- php
- libapache2-mod-php*
- libgd-dev
- build-essential
state: latest
force: true
update_cache: true
become: true
- name: download and uncompress
shell: cd /tmp;rm nagios*.tar.gz;wget -O nagios.tar.gz --no-check-certificate https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.4.5.tar.gz;tar -zxf nagios.tar.gz;mv nagios-4.4.5 nagios
become: true
- name: configure and make all
shell: cd /tmp/nagios;./configure --with-mail=/usr/sbin/sendmail --with-httpd-conf=/etc/apache2/sites-enabled;make all;
become: true
- name: create users and groups
command: chdir=/tmp/nagios make install-groups-users
become: true
- name: create nagios user homedir
file:
path: /home/nagios
state: directory
owner: nagios
group: nagios
mode: 0775
become: true
- name: change nagios user home dir and add group members
command: "{{ item }}"
with_items:
- usermod --home /home/nagios nagios
- usermod -a -G nagios www-data
become: true
- name: make binaries
command: chdir=/tmp/nagios {{ item }}
with_items:
- make install
- make install-daemoninit
- make install-commandmode
- make install-config
- make install-webconf
become: true
- name: Enable apache2 modules
apache2_module: state=present name={{ item }}
with_items:
- cgi
- rewrite
- name: Restart Apache2 service
service:
name: apache2
state: restarted
become: true
- name: Create nagiosadmin User account (apache user)
command: htpasswd -cb /usr/local/nagios/etc/htpasswd.users nagiosadmin nagios
become: true
- name: copy event handlers
synchronize:
src: /tmp/nagios/contrib/eventhandlers/
dest: /usr/local/nagios/libexec
recursive: yes
mode: push
delegate_to: "{{ target_host }}"
- name: change ownership and permisions
file:
path: /usr/local/nagios/libexec/eventhandlers/
owner: nagios
group: nagios
mode: u+rw,g+rw,o+r
recurse: true
become: true
- name: Start Nagios service
service:
name: nagios
state: restarted
become: true
Nagios Utilities Installation
The playbook that follows installs Nagios Utilities version 2.3.3:
---
- hosts: "{{ target_host }}"
tasks:
- name: get rid of old installation dir
file:
path: /tmp/nagios-plugins
state: absent
become: true
- name: get rid of semi-created installation dir
file:
path: /tmp/nagios-plugins-release-2.3.3
state: absent
become: true
- name: get rid of old installation archive
file:
path: /tmp/nagios-plugins.tar.gz
state: absent
become: true
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install nagios plugins reqs
apt:
name:
- autoconf
- gcc
- libc6
- libmcrypt-dev
- make
- wget
- bc
- gawk
- dc
- snmp
- libssl-dev
- libnet-snmp-perl
- gettext
- libldap2-dev
- dnsutils
- build-essential
- smbclient
- fping
state: latest
force: true
update_cache: true
become: true
- name: download and uncompress
shell: cd /tmp;rm nagios-plugin*.tar.gz;wget -O nagios-plugins.tar.gz --no-check-certificate https://github.com/nagios-plugins/nagios-plugins/archive/release-2.3.3.tar.gz;tar -zxf nagios-plugins.tar.gz;mv nagios-plugins-release-2.3.3 nagios-plugins
become: true
- name: configure and make all
command: chdir=/tmp/nagios-plugins {{ item }}
with_items:
- ./tools/setup
- ./configure
- make
- make install
become: true
Install check_nrpe plugin
Here is the playbook for installing the check_nrpe plugin for NRPE v.4.0.2. I remind you that this is only for installing the plugin. With NRPE the architecture is reversed. The Nagios Core server is the client which connects to the NRPE servers in the probes to execute Nagios Plugins remotely and get back the results to be consumed by Nagios Core (reported by the GUI, fed to the alerting engine, etc).
---
- hosts: "{{ target_host }}"
tasks:
- name: get rid of old installation dir
file:
path: /tmp/nrpe*
state: absent
become: true
- name: get rid of semi-created installation dir
file:
path: /tmp/nrpe-npre-4.0.2
state: absent
become: true
- name: get rid of old installation archive
file:
path: /tmp/nrpe.tar.gz
state: absent
become: true
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install nrpe reqs
apt:
name:
- autoconf
- gcc
- libc6
- libmcrypt-dev
- make
- wget
- libssl-dev
- openssl
state: latest
force: true
update_cache: true
become: true
- name: download and uncompress
shell: cd /tmp;rm nrp*.tar.gz;wget -O nrpe.tar.gz --no-check-certificate https://github.com/NagiosEnterprises/nrpe/archive/nrpe-4.0.2.tar.gz;tar -zxf nrpe.tar.gz;mv nrpe-nrpe-4.0.2 nrpe
become: true
- name: configure and make check_nrpe plugin then move it to plugins dir
command: chdir=/tmp/nrpe {{ item }}
with_items:
- ./configure --enable-command-args --with-ssl=/usr/bin/openssl --with-ssl-lib=/usr/lib/x86_64-linux-gnu
- make check_nrpe
- make install-plugin
become: true
- name: add services definition
command: "{{ item }}"
with_items:
- sh -c "echo >> /etc/services"
- sh -c "sudo echo '# Nagios services' >> /etc/services"
- sh -c "sudo echo 'nrpe 5666/tcp' >> /etc/services"
become: true
Install check_esxi_hardware.py plugin
Here is the playbook to install the check_esxi_hardware.py plugin. It uses the pywbem python module. It works with both 2.x and 3.x versions of Python. It was necessary to switch to using the command Ansible module for installing pywbem, due to the Ansible issue described here .
---
- hosts: "{{ target_host }}"
tasks:
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install pywbem reqs
apt:
name:
- python3-pip
state: latest
force: true
update_cache: true
become: true
- name: install pywbem
command: /usr/bin/pip3 install pywbem
become: true
- name: download and copy
shell: cd /usr/local/nagios/libexec;rm check_esxi_hardware.py;wget --no-check-certificate https://www.claudiokuenzler.com/monitoring-plugins/check_esxi_hardware.py
become: true
- name: replace python interpreter
lineinfile:
path: /usr/local/nagios/libexec/check_esxi_hardware.py
regexp: '^#!/usr/bin/python'
line: '#!/usr/bin/python3'
- name: ensure persmissions and ownership
file:
path: /usr/local/nagios/libexec/check_esxi_hardware.py
state: file
owner: nagios
group: nagios
mode: u=rwx,g=rx,o=rx
become: true
Install check_bladechassis plugin
Here is the playbook to install the check_hp_bladechassis nagios plugin. This one uses SNMP to querry the blade controller for status.
---
- hosts: "{{ target_host }}"
tasks:
- name: get rid of old installation dir
file:
path: /tmp/check_hp_bladechassi*
state: absent
become: true
- name: get rid of semi-created installation dir
file:
path: /tmp/check_hp_bladechassis-1.0.1
state: absent
become: true
- name: get rid of old installation archive
file:
path: /tmp/check_hp_bladechassis.tar.gz
state: absent
become: true
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install check_hp_bladechassis reqs
apt:
name:
- libnet-snmp-perl
state: latest
force: true
update_cache: true
become: true
- name: download and uncompress
shell: cd /tmp;rm check_hp_bladechassis*.tar.gz;wget -O check_hp_bladechassis.tar.gz http://folk.uio.no/trondham/software/files/check_hp_bladechassis-1.0.1.tar.gz;tar -zxf check_hp_bladechassis.tar.gz;mv /tmp/check_hp_bladechassis-1.0.1/ /tmp/check_hp_bladechassis/
become: true
- name: copy exec
copy:
src: /tmp/check_hp_bladechassis/check_hp_bladechassis
dest: /usr/local/nagios/libexec/check_hp_bladechassis
remote_src: true
owner: nagios
group: nagios
mode: u=rwx,g=rx,o=rx
become: true
- name: copy man
copy:
src: /tmp/check_hp_bladechassis/check_hp_bladechassis.8
dest: /usr/share/man/man8
remote_src: true
become: true
Install check_ipmi plugin
This one is the Ansible playbook to install the check_ipmi_sensor plugin.
---
- hosts: "{{ target_host }}"
tasks:
- name: get rid of old installation dir
file:
path: /tmp/check_ipmi_sensor
state: absent
become: true
- name: get rid of semi-created installation dir
file:
path: /tmp/check_ipmi_sensor_v3-3.13
state: absent
become: true
- name: get rid of old installation archive
file:
path: /tmp/check_ipmi_sensor.tar.gz
state: absent
become: true
- name: refresh apt
apt:
name: '*'
update_cache: yes
become: true
- name: install check_ipmi_sensor reqs
apt:
name:
- libipc-run-perl
- freeipmi
state: latest
force: true
update_cache: true
become: true
- name: download and uncompress
shell: cd /tmp;rm check_ipmi_sensor*.tar.gz;wget -O check_ipmi_sensor.tar.gz https://github.com/thomas-krenn/check_ipmi_sensor_v3/archive/v3.13.tar.gz;tar -zxf check_ipmi_sensor.tar.gz;mv check_ipmi_sensor_v3-3.13 check_ipmi_sensor
become: true
- name: copy exec
copy:
src: /tmp/check_ipmi_sensor/check_ipmi_sensor
dest: /usr/local/nagios/libexec/check_ipmi_sensor
remote_src: true
owner: nagios
group: nagios
mode: u=rwx,g=rx,o=rx
become: true
- name: copy conf
copy:
src: /etc/freeipmi/freeipmi.conf
dest: /etc/freeipmi/dnsipmi.conf
remote_src: true
owner: nagios
group: nagios
mode: u=rwx,g=rx,o=rx
become: true
- name: add credentials to conf
blockinfile:
path: /etc/freeipmi/dnsipmi.conf
insertafter: EOF
block: |
username noc
password 6Cb506a!
privilege-level user
What about the rest? Come on..
Well… ok. I will tell you how to install the check_vmware_api plugin. First make sure you install the prerequisites.
apt-get install lib32z1 lib32ncurses6 build-essential uuid uuid-dev libssl-dev perl-doc libxml-libxml-perl libcrypt-ssleay-perl libsoap-lite-perl libmodule-build-perl libmoosex-types-path-class-perl libuuid-tiny-perl libarchive-zip-perl libmonitoring-plugin-perl libio-socket-inet6-perl libnet-inet6glue-perl
Then also install one more prerequisites using cpan.
cpan install UUID
Now go get the Perl SDK v.7.0 (yes it works with lower versions of VCenter too). Use the address I gave you before. I am afraid you need a login, so you have to register with VMWare (another reason you can’t automate this). Get it in /usr/local/src. Then uncompress it and run the install.
tar -zxvf VMware-vSphere-Perl-SDK-7.0.0-15889270.x86_64.tar.gz
vmware-vsphere-cli-distrib/vmware-install.pl
Download the plugin and copy it to /usr/local/nagios/libexec. Make sure the ownership and attributes are set correctly.
wget --no-check-certificate https://github.com/ITRS-Group/system-addons-plugins-op5-check_vmware_api/archive/v2019.j.1.tar.gz
tar -zxvf v2019.j.1.tar.gz
cd system-addons-plugins-op5-check_vmware_api-2019.j.1/
cp check_vmware_api.pl /usr/local/nagios/libexec/
cd /usr/local/nagios/libexec/
chmod 755 check_vmware_api.pl
chown nagios:nagios check_vmware_api.pl
You are set! Read the doc and try it out (don’t forget about the issue with the proxy server or you will never find out what’s wrong). A basic command should be for example like:
./check_vmware_api.pl -D vcenterserver -u user@domain -p password -l runtime -s status
Done! What’s next?
Well.. working on that took a while. Using these to install Nagios from scratch should get you a working system within the day. I won’t say something fancy like 30 minutes. It’s possible. But sometimes package installations take a while, especially with cpan, and you may run into other issues that have nothing to do with the playbooks. The Ansible version is 2.9 btw.
For the next steps I am thinking about dockerizing the Nagios Server but I have a feeling it will not be practical. First of all it’s difficult to get around the interactive part. Then I have a feeling that the container will be huge.. I may give it a try soon. If I do and succeed then you will get a sequel to this post.
My own plans include migrating to the Icinga platform (will use my friend’s Gian Paolo suggestions on that) but since the plugins are shared more or less.. I may seek a way to replace the whole thing with something totally different. Siming Yuan of Cisco Pyats has suggested Prometheus, so I intend to take a look at that too.
Enjoy, and if you got questions you know where to find me (@mythryll on Twitter)