Migrating from MRTG/RRD/CGI to Grafana/InfluxDB/Telegraf on Docker containers for Enterprise Network Performance Monitoring with SNMP – Part 2

In the previous episode…

In the previous part of this series of posts, I described how it came to be that our SNMP monitoring infrastructure based on MRTG/RRD/CGI had reached the end of its days and how I decided to use Grafana with InfluxDB and Telegraf to replace it. The last post stopped right before moving to a Docker based installation for the main components: Telegraf, InfluxDB and Grafana.

Do we really need Docker? Why? Oh.. and.. what is Docker btw?

Well, it’s understandable that once you start off trying to “patch a hole” in your monitoring ecosystem, especially in production, chances are you want it to be done ASAP and move on because you have more things to do and time is limited.

Normally, that probably should have been the case here, so why go the extra mile and deploy everything on Docker?

Two reasons mostly:

The first reason I have already mentioned in the previous post. The normal OS based installation is not automated enough, not fast enough, not clean enough (software dependencies), although it may be simpler to setup, follow and build on.
The second reason is that I am curious by nature. That means that I will often get out of my way to try new things, having a strong belief that the extra time and effort spent, will give me skills that will eventually allow me and my team to become more effective and broaden our capabilities. Like learning a new trade, you never know when it’s going to come in handy.

As I mentioned in the previous post too, there was no way to make a soft hand-off between the old platform and the new, so I took a leap of faith and went for the docker versions. Here is what I already knew about Docker:

Docker is essentially a platform for application virtualization in contrast with platforms that do host virtualization, like VMware or KVM.
The equivalent of a virtual machine in Docker is a container, which is an “isolated space” for applications, where only the necessary files and libraries are kept with the container, while all the rest are translated as calls to the underlying OS (Docker host).
Containers are very light-weight in comparison with VMs and allow for:
- extreme versatility and scaling
- independence from host libraries and applications
- economy on host resources
- increased application performance
They are available as premade packages (images) on central locations that can be deployed to create a container.
Their networking under-layer is more or less a form of bridging
Containers could be started, stopped and otherwise manipulated, either through the command-line, or by using integrated tools like docker-compose, or orchestration platforms like Kubernetes.

I know the above seems a bit vague, and it might even make DevOps people slightly angry but hey, it’s all I knew at the time, and I most certainly still have a lot to learn.

There were of course excellent resources I could use to learn more (I still intend to go back to those) like the NetDevOps Series by Julio Gomez on the Cisco Developer Blog or the Cisco Devnet Learning Labs/tracks like Docker 101 and Advanced Docker features or even the Docker Documentation itself. (Btw, If you want to know more on Kubernetes, a lot of companies offer tutorials, Vmware has even launched a Kubernetes Academy, probably that is releated to their new project, Project Pacific).

But I had no time for that so I dived right in.

Building the bike while riding it

Once again I started looking for what is out there. The docker images for Grafana, InfluxDB and Telegraf were available. In fact there were a lot more available than the defaults. I will elaborate on that later.

I decided to follow the procedure on this post. The basic steps are the following:

Prepare your Docker host (physical or vm where you will install docker in order to host your containers).
Install Docker latest stable version, like in this post.
Create the necessary users that will own the persistent files:
- We want the influxdb database to be persistent even if we re-install the docker container for influxdb. That means we need to share files between the container and the host, essentially mounting directories/files from the host on the container. Same for lib directory
- we also want persistent configuration files for all three components.
Prepare Database initialization script (in iql). This will be run once in the beginning right before creating our Influxdb container, in order to prepare the database, create necessary structures etc.
Run Docker command to download and launch the Influxdb container.
Same procedure for the other components (Telegraf and Grafana): prepare persistent files and directories, define user rights, run Docker command to download and launch the containers.

There are a few points where the procedure described in the post didn’t fit our situation. Firstly, our setup would not be an all-in-one package, as our infrastructure is spread out and it’s a better idea to gather SNMP statistics a little closer to the targets (network devices) whenever possible. In our case there would have to be 3 separate docker hosts with Telegraf+InfluxDB containers and one for Grafana.

That in turn means that it makes no sense for us to hide any of the containers services behind the docker host. There is a part that refers to the networking setup in order to accomplish that, if that is your goal. Personally, I would recommend watching Linux Bridges, IP Tables, and CNI Plug-Ins – A Container Networking Deepdive from NetDevOps Live with Matt Johnson in order to understand Container Networking a little better.

Finally, while defining a retention policy, as described in the post, is a good idea, so that your database does not continue to grow indefinitely, doing that presented real problems down the line when I tried to import grafana dashboards, so I passed on that for the time being. The alternative is to delete data on your own, either manually or by creating a script. I may get back to it at a later time.

After all this, one gets a working installation of the TIG stack on Docker. But there were a few surprises for me, I guess I should have expected it as this post isn’t meant for SNMP monitoring but just a demonstration of the very basic capabilities monitoring the same host where Telegraf is installed (only the basic plugins are used):

I could not install the SNMP mib downloader on the telegraf container after the container was launched, so no MIB II support (not even the basic interface stats). In fact I could not install anything, as there is no sudo command, not even an editor I can use to change config files. I guess one can use docker cp to copy files over (for the mibs) but if you can’t edit configs, that doesn’t really do anything.
The container timezone was wrong. That can be important in the context of performance monitoring (e.g. all the graphs would have an incorrect time for the data).
Adding anything on top of the container image was practically impossible, as that is the basic idea behind the container principle: create a minimal image of the software packaged with just the necessary files, secure and not easy to break.

I figured out how to change the timezone in the container, but if you want to include software before launching the container you have to use a dockerfile (thank you Clay for the tip!). A dockerfile basically describes the procedure that Docker uses to build the image and launch the container. So in order to deviate from the basic, you either build your own additional dockerfile (it’s like importing the original one and then add your own procedure) or you have to package a different image and upload it to the registry. As I was and still am a rookie, I went for the first option. Here is what that dockerfile for Telegraf looked like:

FROM telegraf:latest

RUN echo "deb http://ftp.us.debian.org/debian stretch main non-free" >> /etc/apt/sources.list

RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends snmp-mibs-downloader vim

COPY mibs/* /usr/share/snmp/mibs/

So this way, snmp downloader gets installed before the container is packaged and vim editor is also added in. Right after that, any extra mibs are copied from ./mibs directory to the default mib dir for the container.

I am sorry.. can you do this again?

Still, using a lot of docker run command line sequences as the post describes, although great to get you going, doesn’t look like much of an automated procedure. I guess it’s possible to create scripts that would do that, but there is something better to use instead: docker-compose.

Using docker-compose was what was needed in this case. Get everything in one place so that the whole container creation and launch process is automated. I have had a little help from a colleague and a few tips from
Clay Curtis to get a better understanding but the basic stuff are the following:

It’s a YAML file
You have to define the basic services that you want to group together for your application.
Under each service, you have to define a few things as well:
- image and container name
- environment variables: besides some that the container will make use, you will find that some applications get initial configuration values this way that may overwrite the config file. In our case you can do that for Grafana.
- file paths (volumes) from the host that you want to mount inside the container with the access rights you want to define for them.
- network ports that the container will listen on.
- dependencies between services (e.g. you don’t want to launch telegraf before influxdb)
- command line options.
- a possible dockerfile you want to use additionally.
once you are done you can give the whole thing a go by running: docker-compose up -d and watch the magic happen. If you want to check what else is possible using the docker-compose cli, you can take a look at the documentation.
I had to lookup what are the differences for the file versions, you might want to take a look at that.

So after trying things a bit I ended up writing two scripts, one for initializing the influxdb (the procedure described in the post and one for cleaning up things in case I wanted to start over. Here is the one for initializing the database:

#!/bin/bash
docker run --rm -e INFLUXDB_HTTP_AUTH_ENABLED=true \
         -e INFLUXDB_ADMIN_USER=admin \
         -e INFLUXDB_ADMIN_PASSWORD=yourpassword \
         -v /opt/influxdb/varlib:/var/lib/influxdb \
         -v /opt/influxdb/etc/scripts:/docker-entrypoint-initdb.d \
         influxdb /init-influxdb.sh

and this one is for cleaning up:

#!/bin/bash
docker stop telegraf
docker stop influxdb
docker rm influxdb
docker rm telegraf
docker rmi telegraf:latest influxdb:latest build_telegraf:latest
rm -R /opt/influxdb/varlib/*

So to set up everything from a clean start, I had to create a basic file structure with my mibs and the config files, run the influxdb init script, and then run docker-compose up -d to get everything up. It’s pretty impressive when you finally get to see it. After having spent so much time building up monitoring infrastructure, to see one get build in seconds really blows your mind! The great thing about it, is that if the persistent stuff are intact and you just delete your containers and images and run the docker-compose again, you get your software upgraded.

If you just want to add a few mibs, change the configuration for snmp in order to use the mibs and the config for telegraf in order to gather more SNMP statistics, then you can edit the files in the docker host and then just restart your telegraf container:

docker restart telegraf

This is what the docker-compose.yml file looks like:

version: '3'

services:
  influxdb:
    image: influxdb:latest
    container_name: influxdb
    environment:
      hostname:
      user: "999:999"
    volumes:
      - /opt/influxdb/etc/influxdb.conf:/etc/influxdb/influxdb.conf
      - /opt/influxdb/varlib:/var/lib/influxdb
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "8086:8086"

  telegraf:
    container_name: telegraf
    build: .
    environment:
      - HOST_PROC=/host/proc
    volumes:
      - /proc:/host/proc:ro
      - /opt/telegraf/etc/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /opt/telegraf/etc/telegraf.d:/etc/telegraf/telegraf.d:ro
      - /opt/telegraf/etc/snmp/snmp.conf:/etc/snmp/snmp.conf:ro
      - /opt/telegraf/etc/snmp/mibs:/etc/snmp/mibs
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    depends_on:
      - influxdb
    command:
      - "--config-directory"
      - "/etc/telegraf/telegraf.d"

Of course that has to be done for the Grafana node as well. In that case there were more issues to take care inside the container like how to activate Active Directory (ldap) integration, using https for the grafana server and a few things outside like preparing the configuration, grafana database and provisioning locations, etc. This is what the docker-compose file looks like:

grafana:
  image: grafana/grafana
  ports:
    - "3000:3000"
  volumes:
    - ./data:/var/lib/grafana
    - ./etc/grafana.ini:/etc/grafana/grafana.ini
    - ./etc/ldap.toml:/etc/grafana/ldap.toml
    - ./etc/provisioning/dashboards:/etc/grafana/provisioning/dashboards
    - ./etc/provisioning/datasources:/etc/grafana/provisioning/datasources
    - ./ssl:/etc/grafana/ssl
    - /etc/timezone:/etc/timezone:ro
    - /etc/localtime:/etc/localtime:ro
  environment:
    - GF_SERVER_ROOT_URL=https://yourgrafanawebsite:3000/
    - GF_SERVER_PROTOCOL=https
    - GF_SERVER_DOMAIN=yordomain
    - GF_SECURITY_ADMIN_USER=adminuser
    - GF_SECURITY_ADMIN_PASSWORD=adminpassword
    - GF_SERVER_CERT_FILE=/etc/grafana/ssl/yorserver.cert.pem
    - GF_SERVER_CERT_KEY=/etc/grafana/ssl/yourserver.nopassword.key.pem

Are we done yet?

Ehm, sorry … no. We have covered the infrastructure part up to a great percentage. However once you start working on the whole stack, adding in MIBs, configuring Grafana and building dashboards, reports, alerts, you probably want to automate everything for that too. Like backing up your data, wipe the whole thing off and reinstall everything in as a little time as possible (the goal is in the order of a few minutes), and find everything as you left it.

The Grafana setup along with ldap, https, using the api for dashboards and datasources, json structure for both, alerts and smtp setup will have to wait until Part 3! Right now our Grafana setup is fully operational in our production environment but I still have some more testing to do about backup, restoring data and re-provisioning the dashboards with the correct sources when recreating the environment. I promise to be done for Part 3. If in the meatime you need to ask a question, look me up on twitter under @mythryll .

Until then, take a look at the docker cli reference, and give containers a try!

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31