{"id":1371,"date":"2021-04-10T18:49:44","date_gmt":"2021-04-10T16:49:44","guid":{"rendered":"https:\/\/www.mythryll.com\/?p=1371"},"modified":"2021-04-10T22:08:24","modified_gmt":"2021-04-10T20:08:24","slug":"how-to-clean-up-tags-in-your-malfunctioning-tig-stack-for-snmp-monitoring-or-chasing-runaway-tags-with-influxql-grep-and-a-python-script","status":"publish","type":"post","link":"https:\/\/www.mythryll.com\/?p=1371","title":{"rendered":"How to clean up tags in your malfunctioning TIG stack for SNMP monitoring &#8211; or chasing runaway tags with InfluxQL, grep and a Python script"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Hello folks, this is going to be a quick one (well compared to the others&#8230; ). A while back we had to switch our SNMP monitoring to SNMPv3, which was a bit of a problem as it wasn&#8217;t supported on all of our switches. The newer models supported both auth and priv modes but the older ones mostly supported just auth (no encryption). It was a bit of a pain to separate which was which, as there was also the hashing and encryption algorithms we needed to make sure were supported. After fighting a few battles with our multiple NMS software, we finally claimed victory. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Ninja InfluxDB Assassins or Runaway Tags<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Of course that wasn&#8217;t the case exactly with our TIG stack as the procedure meant that I had to separate the switches into two categories and have each config file in our telegraf.d directory (the folder from where each config file is loaded) split in two and adjust our settings for the SNMPv3 modes. It was a tedious task and took some time and, as I was interrupted multiple times, .. I left some mistakes behind. As most stuff worked when I finished, I never found out about them and went on to the next project with a happy smile on my face.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At some point during last week we faced a few problems with our VPN infrastructure. Naturally we turned to our TIG stack for performance graphs. But something was not right.. For some nodes, graphs were not working. I checked whether the containers were up.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker ps<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Not it. For the same TIG nodes some things worked and some didn&#8217;t. As I had neglected to study how to properly create and assign retention policies to my databases, I felt guilt and fear creeping up on me.. So I turned to the logs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker-compose logs --follow --tail 100<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That&#8217;s when I noticed the message complaining about max tag values:<code> \"lvl=warn msg=\"max-values-per-tag limit may be exceeded soon\"<\/code> . I thought that total data retention had hit me so following what is described in this <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.influxdata.com\/influxdb\/v1.8\/query_language\/manage-database\/#modify-retention-policies-with-alter-retention-policy\" target=\"_blank\">link<\/a> I had altered the default retention policy &#8220;autogen&#8221; with the following :<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER RETENTION POLICY \"autogen\" ON \"telegraf\" DURATION 52w SHARD DURATION 120d DEFAULT<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That of course works after you enter your container and run the influx cli, authenticated and chosen your database:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker exec \u2013it influxdb \/bin\/bash \ninflux\nauth\nuse telegraf<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I also deleted some data before that point in time with <code>DELETE WHERE time &lt; '2020-03-01 01:01:01'<\/code> but I found out that deleting data doesn&#8217;t remove those from the index, the command <code>DROP<\/code> does that and I had some trouble using that. I started reading more from the section of the documentation for InfluxDB where the InfluxDB Query Language is referenced: <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.influxdata.com\/influxdb\/v1.7\/query_language\/\" target=\"_blank\">https:\/\/docs.influxdata.com\/influxdb\/v1.7\/query_language\/<\/a> (version 1.7).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As the log messages kept coming, it dawned on me that deleting data wasn&#8217;t the answer, my problem was that I had too many tags, apparently (duh..), I just didn&#8217;t want to believe it. As the main tag source in my case was the different hostname values from our network nodes, that could not be normal, the number of nodes per TIG host was way below the max tag limit (default is 10000). I used the following command to see what tag values I had:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SHOW TAG VALUES WITH KEY = \"hostname\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That produced too many values indeed but what was strange was that the surplus values were all integers. Something was feeding bad data into my DBs. I needed to do two things:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Find the source of the problem so the flooding of bad values would stop.<\/li><li>Clean up the mess.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Learn to sleep on it<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I was getting tired, it was time to go home so I gave up for the day. Next morning the result of too many google searches, big data and Big Brother watching us bore fruit.. The following post appeared on my google discover page as I was walking my dog before going to work: <a rel=\"noreferrer noopener\" href=\"https:\/\/www.influxdata.com\/blog\/solving-runaway-series-cardinality-when-using-influxdb\/\" target=\"_blank\">https:\/\/www.influxdata.com\/blog\/solving-runaway-series-cardinality-when-using-influxdb\/<\/a> . It was obvious, that was my issue. I didn&#8217;t quite understand what was that language and what were those references about buckets but I soon discovered that it didn&#8217;t apply to my version of InfluxDB but to version 2.0: <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.influxdata.com\/influxdb\/v1.8\/concepts\/glossary\/#retention-policy-rp\" target=\"_blank\">https:\/\/docs.influxdata.com\/influxdb\/v1.8\/concepts\/glossary\/#retention-policy-rp<\/a> (&#8220;<em>A bucket is a named location where time series data is stored in <strong>InfluxDB 2.0<\/strong>. In InfluxDB 1.8+, each combination of a database and a retention policy (database\/retention-policy) represents a bucket.<\/em>&#8220;). So I had to find another way to clean things up.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solve the problem<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Just like the famous phrase from <em>&#8220;A.Friend&#8221;<\/em> in the film <a rel=\"noreferrer noopener\" href=\"https:\/\/www.imdb.com\/title\/tt0109635\/?ref_=nm_flmg_act_39\" target=\"_blank\">Disclosure<\/a>. I had to tackle the real issue. I used the commands <code>SHOW TAG VALUES WITH KEY = \"hostname\"<\/code> and <code>SHOW SERIES FROM snmp<\/code> (a table where I look up the hostname and define it as tag that is later inherited by the interface tables, basic monitoring stuff from my older post series, <a rel=\"noreferrer noopener\" href=\"https:\/\/www.mythryll.com\/?p=922\" target=\"_blank\">part1<\/a> and <a rel=\"noreferrer noopener\" href=\"https:\/\/www.mythryll.com\/?p=963\" target=\"_blank\">part2<\/a>). The second command gave me a quick idea on which nodes were giving me problems. I went back to my telegraf snmp plugin configs and found out I had two kind of problems:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>I had copied configs from other group of switches that did support auth and priv mode into configs for older switches that didn&#8217;t support encryption. So the responses came back encrypted and that&#8217;s how the bogus hostnames \/ tag values were created.<\/li><li>I had cloned configs and modified them correctly to support older switches with SNMPv3 but had left the nodes addresses in the original file too, thus resulting in the same problem as above.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">For both of those issues I had to look inside the configs and do a lot of grep for the ip addresses contained in the &#8220;agent_hostname&#8221; field to find out if each ip address was present in multiple config files. For example, while in the telegraf.d folder, running the following :<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>grep \"10.0.1.10\" *<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">&#8230; will return all the occurrences of the ip address 10.0.1.10 along with the config filenames that contain it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Of course the list of bad tag values was enormous (thousands). So it was too difficult to proceed with finding the all the causes, even if I went first for the &#8220;low hanging fruit&#8221; as the blog post from Influx Data suggested (the values causing the most problems). My eyes literally hurt and I felt I was loosing the battle. I had to find a way to clean up my DBs faster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Clean up the mess<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In an <a rel=\"noreferrer noopener\" href=\"https:\/\/www.mythryll.com\/?p=1004\" target=\"_blank\">older post<\/a> I described how I used the influxdb python library to create data points for the number of Checkpoint VPN users and then graphed the results with Grafana, a solution still in use today, as the pandemic is still fully going on. Using Python was probably a good idea, so I looked the <a rel=\"noreferrer noopener\" href=\"https:\/\/influxdb-python.readthedocs.io\/en\/latest\/\" target=\"_blank\">documentation<\/a> up again but the available methods to manage my db and in particular to <em>drop series with a certain set of criteria<\/em>, <strong>were not implemented<\/strong>. What <em>was<\/em> implemented, was the way to execute queries with influxql, which was better than nothing. It would not be as fast, but it would certainly be faster than me typing those queries by hand and my eyes would hurt less. So after some trial and error, I came up with the following script. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>!\/usr\/bin\/python3\nfrom influxdb import InfluxDBClient\n\n#Define the connection object\nclient = InfluxDBClient(host='servername', port=8086, username='influxdbuser', password='influxdbpassword')\nclient.switch_database('telegraf')\n#set a query to find the tag values that come from the \"hostname\" tag defined in the telegraf config\nquery = 'show tag values from snmp with key = \\\"hostname\\\"'\n#execute the query\nresult = client.query(query)\n#store the different tag values\nvalues = result.raw&#91;'series']&#91;0]&#91;'values']\n#loop through the tag values, find if that value is a positive integer, and if so, drop that series from the index. That takes care of the tag value as well (it's removed)\nfor item in values:\n    if item&#91;1].isdigit():\n        print(\"item:\")\n        print(item)\n        dropquery = f'drop series where \\\"hostname\\\" = \\'{item&#91;1]}\\''\n        dropresult = client.query(dropquery)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After running that, I had to restart the influxdb instance so it was cleaned up and back in business as usual (I used docker-compose in my setup to combine telegraf and influxdb for each node, as is described in the series of posts I mentioned earlier).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker-compose restart<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I made sure (using the logs) that influxdb has booted up normally and was back accepting writes (POST requests) from telegraf before going in to check again.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker exec \u2013it influxdb \/bin\/bash \ninflux\nauth \nshow series \nshow series from snmp\nshow series from interface\nshow series from interface old\nshow tag values with key = \"hostname\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">No strange tag values there, so all good!!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Wrap-up!<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Well that was it! Not too long this time right? Same here, it only took about an hour to write and get everything right. If you want to know more about how to use the influxdb python client, besides taking a look at the <a rel=\"noreferrer noopener\" href=\"https:\/\/influxdb-python.readthedocs.io\/en\/latest\/\" target=\"_blank\">documentation<\/a>, take a look at their github page, where you can find code and examples: <a rel=\"noreferrer noopener\" href=\"https:\/\/influxdb-python.readthedocs.io\/en\/latest\/\" target=\"_blank\">https:\/\/influxdb-python.readthedocs.io\/en\/latest\/<\/a> . Take care, until next time, you can always look me up on Twitter under the handle mythryll.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello folks, this is going to be a quick one (well compared to the others&#8230; ). A while back we had to switch our SNMP monitoring to SNMPv3, which was a bit of a problem as it wasn&#8217;t supported on all of our switches. The newer models supported both auth and priv modes but the&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[50,37,52,51,36],"class_list":["post-1371","post","type-post","status-publish","format-standard","hentry","category-it","tag-cardinality","tag-influxdb","tag-influxdb-python-library","tag-tag-values","tag-telegraf"],"_links":{"self":[{"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/posts\/1371","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1371"}],"version-history":[{"count":36,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/posts\/1371\/revisions"}],"predecessor-version":[{"id":1411,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=\/wp\/v2\/posts\/1371\/revisions\/1411"}],"wp:attachment":[{"href":"https:\/\/www.mythryll.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mythryll.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}