Last one?
In the previous parts we set up the playground and looked at the planning. In this final part, we explain the process.
The process
After deciding what to make, and how it should function, it was time to get cooking.. (here is a nice song for you to listen while we get the fire going..)
Recipe Ingredients
To make this work, besides a functional legacy network, a functional SDA Network, Cisco DNAC, Cisco ISE and a Checkpoint Firewall Management appliance, you would need the following:
- A Nautobot instance, where all of your legacy switches are added with at least one ip address defined as a main ip address for each device. The code assumes sites locations at least, platform definitions and device roles: access & distribution – those are arbitrary, you can define what you prefer it’s about data collection. The code is tested and works against Nautobot version 1.6.7. For version 2.x some changes are needed and there is also a special bug that needs a workaround, I will mention this later.
- A project folder where you will clone this repo, the one with the actual code. I will provide a lot more comments about this repo and the code in a following paragraph.
- A Python virtual environment where you will install the python modules required for the project code to work. Again comments will follow about the required modules.
- Credentials for your platforms
- A file with your user data (first/last name, username, tel number, dep/section, etc) like a directory.
- Lots of patience.
Nautobot
My own instance is on docker with docker-compose, but you can set it up the way you like. Here is a github repo I prepared for a friend to help him get started, be aware of issues in case you start from scratch about creating a superuser (I am reading there can be hickups, look for the NTC slack, Glenn Mathews is up-to-date on the subject). NTC has their own github repo you can use to set up Nautobot on docker-compose too. Mine comes with ldap and a few other stuff as well. Theirs is.. well they are NTC. Also while looking for the link for that repo, I saw they have enriched the doc and features a lot so you may want to check there first. I know I will soon take another look for sure.
Supposedly, now you already have a Nautobot instance, some IPAM information in it, some devices in the DCIM part with main ip addresses defined (you need to have assigned ip address to one interface at least and then declare that ip address as the main one for the device). Also devices roles need to be defined as ‘ac-access’ (for the ones where the users are connected) and their status set to active. Those values are used when filtering both for Nautobot and Nornir. The access switches are queried for mac address and vlan data. For the distribution switches which are used to collect arp data, the role is ‘ac-distribution’.
Further more, platforms are defined where cisco-ios corresponds to devices supporting ssh transport and cisco-ios-telnet for devices supporting only telnet transport (Kirk, I know you call me the guy with the telnet switches, people told me, you can stop as I am getting rid of those with this project, and look, I am using your code to do it: Netmiko and Nornir! )
I am using Sites defined in the organization menu and locations (those don’t play a role in the code unless you move to v.2.x where sites are also turned into locations, I plan to provide new version of the code when I migrate the prod instance to v.2.x, so far only testing). Sites do play a role so as to choose the right distribution switches for it.
I am finally using tags to choose which switches I am migrating every time and filter for those only, when the dynamic Nornir inventory is created. Tags are a great way to group things together using arbitrary criteria. For example we would choose locations to be migrated based on user and hosts distribution, special cases that needed extra attention like IoT, access controllers, media encoders, etc. It usually wasn’t a case of “now we will migrate the 3rd floor of x building”. So using tags really helped. The tag names were chose with practicality in mind, so “migrate-from-old-rack”, “migrate-to-new-rack”, and “migrate-from-legacy-network” were typically used to tag switches for the migration phases.
Nornir’s purpose here is to run collection tasks against the subset of switches that we get after the first filtering during the dynamic inventory creation (using the nornir-nautobot plugin for Nornir, that does exactly that). With a dynamic inventory based on Nautobot we avoid having to create a static one each time we need a different set of switches for the next migration and at the same time we get to choose the switches involved in an intuitive way, by tagging the switches with the appropriate tag within the Nautobot GUI.
Github Repo
Python Version and Modules
Choices
A lot of modules were dictated by the components themselves. Others were used to help in the code production and output processing. And the rest came as dependencies. The packages used in my case are in the file requirements.txt along with their current versions (be careful, current means as I am writing this post, things may be different as time goes by).
I started to work with python 3.8+ but soon went on to 3.9+. Finally settled with version 3.12 but anything starting from version 3.10 should do fine.
Frameworks
Nornir of course, as already mentioned. It’s my goto choice when running tasks against mutiple network devices in parallel. I am not saying there are no other options for it. You can choose asyncio, pyats in paraller, or other frameworks if your platforms allow for it. I like this one, it works, it’s fast and I can understand it well enough to be productive with it. I would love to be able to use it inside Nautobot without launching the code externally, just make an app (Django), define jobs (python), choose the switches and press the button! But I need to study and test more for it and I can’t do that before this project is over.
An advanced feature of Nornir is the ability to filter the hosts you need from the inventory initiated. A subset of the initial inventory is created then and any task created with that inventory as a parameter, will only be run against the subset of hosts. A different filter will result in a different subset of hosts, without the need to re-initialize the dynamic inventory we created from Nautobot with the nornir-nautobot plugin. In our case this is used to separate the switches that support ssh from those that only support telnet, so we can define the rest of the connection parameters appropriately.
SDKs
PyNautobot
I had to be able to ‘talk’ to Nautobot on some occasions, so pynautobot was necessary for that but it was installed as a dependency anyway when installing nornir-nautobot, the plugin that allows Nautobot to be used as the source for a dynamic inventory for Nornir.
Cisco ISE Requests/SDK
‘Talking’ to ISE was also necessary. ISE has a rich collection of REST API endpoints. I usually prefer to use an SDK, as it minimizes the grind and delay and having to deal with a lot of back and forth (like pagination). Such an SDK is this package and here is the documentation: ciscoisesdk.
In this case I just used regular access to the REST API using the requests library as it was pretty simple for the active sessions list. If you want to explore the ISE REST API with Postman or another REST client. Here is Nicolas Russo’s Cisco ISE postman collection.
Checkpoint Management Python SDK
Finally I had to get data from our Checkpoint Infrastructure, as it maintained a fairly up-to-date cross-reference between users and workstations, through the firewall logs. I realize this may seem strange but if one has setup Identity Awareness for Checkpoint Firewalls, then the checkpoint appliances get to find out which AD user-id is behind an ip address (no need to get into that right now, there are obviously more missing pieces in that puzzle I am not telling you about, just accept it as the truth).
To get to that I could use the Checkpoint management REST API (API reference) , but I chose to use the correspondent SDK even if it’s not officially developed by Checkpoint: cp_mgmt_api_python_sdk.
I can’t walk you through of how to activate the REST API on your Management Console, but that’s not so hard to find in the doc. It works after a certain version but the initial version supporting it hasn’t been current for a while so you should be fine. There is also a community where the checkpoint people are very nice and helpful but not all of the members act the same way.. Here is also a blog post that describes how to get started, by Yuri Slobodyanyuk.
One thing to remember is that you need first to create a fingerprint.txt file on the first connection to the management server. You can also use Postman to explore Checkpoint REST API, here is my postman collection.
Checkpoint is not the only firewall vendor integrating APIs into their appliances. Palo Alto, Cisco, Fortigate all have API implementations (or SDKs). Or you could try and apply a more direct approach to finding out which user is behind an ip address, from the source (MS-AD), maybe looking at Kerberos tickets. However, whenever data is already available at one source, I rarely go through the trouble to find another way to collect it directly. Unless there is a specific reason. But that’s just me. You do you. You would have to get access to the AD servers, though and possibly affecting enterprise security in the process..
Individual Packages
There were some unique needs. I used TTP (template Text Parser) for templating when extracting data back from text based files (mainly csv formated). That was incredibly useful while developing the code bit-by-bit,where I was using intermediate files to store info (mac addresses, ip addresses, hostnames, etc).
Again when in middle stages of development, printing data on the screen was needed a lot so trying to do it in table format or using colors was important, so I did use prettytables and rich.
Rich probably deserves a whole section by itself but I am not the one to do it. So here are:
For MS-Teams messaging, pymsteams is used.
Directory Data
An MS-AD user-id on its own is not that useful when trying to find out where the user sits or how to contact him/her. What you need is to look them up in a user directory maintained by HR or other service. In our case this was obviously in the HR database but instead of going through the process to request access to it and set up another part of the application to go get the data, I remembered a nice project a colleague had developed that was extracting the same data daily and feeding them into a directory app made for his section colleagues. I asked him and it turned out he was storing the exported data in a csv formatted text file and I already knew how to manipulate that. So I asked for a recent copy of that directory data and got to work to be able to get the user’s full name, office number, section/department and phone number all from the user-id.
All those were the recipe ingredients. We needed to put them together, to create a consumable product.
Fine. How do I cook this thing?
Start the fire, make the pizza
Set the table
Preparing for each migration we need to tag the switches involved. Let’s suppose we have used a tag called (arbitrary name) ‘migrate-from-legacy-access‘.
Warm up the oven
The next thing we want to do is run the collection process with this script:
python gathermovedata-args.py -s site-1 -t migrate-from-legacy-access -hf hosts_before.txt
This will launch the process. Once the arguments are parsed, the function gather_tagged_switch_data is called carrying over the necessary parameters to the main part of the code. The different tasks that follow, branch out from this main path.
Collect data before the migration
First, we call the function responsible for gathering mac data. This function first creates the dynamic Nornir inventory from Nautobot data, filtering only for the switches that are tagged (also only those that are of type ‘ac-access’ and ‘active’ and have a primary ip address defined – so that we pick only access switches where we can actually login). Then, a task is formed twice. Once by filtering (using a Nornir filter) only for switches that support ssh transport, so that we can set the correct connection parameters for the Nornir task, and a second time filtering only for switches that support only telnet transport, so that we can set the correct connection parameters for the Nornir task.
The same set of commands (sub task) is run for each host on both sets
- show mac address-table
- show cdp neighbors
This is realized by following the multi-command path we explained both in part-1 and part-2 (functions carrying functions).
The results are all added to a list of host data, which for the moment contain only mac address, interface, port number, vlan, switch name, ip address and site name. The mac address data are filtered so we don’t collect mac addresses from neighboring switches.
We know however that’s not enough because if we need to troubleshoot any of the connections later, we need more data. So to make it complete we need to go to the next step: ip addresses.
The same function now branches out to collect arp data from the distribution switches of the access network. The same pattern is followed only this time a single command task path is chose against the multiple command path that was used before: show ip arp.
Results are returned and then a new function is called to get back hostnames for all ip addresses we got from the arp data. All the data are put together into a final list of host data, which we have the option to save in a file (one of the parameters we used at the start) like the results we showed in part-2:
get to the Users
Since this is the first time we are running this ahead of the migration, we probably need to find out who is who, meaning who are the users behind this data, so we can contact the departments involved. This is done using the following script, with the filename we save the data as a parameter:
python user_loc_functions.py -hf hosts_before.txt -df Fake_Employee_data.txt -tf last-24-hours
The other two parameters are the filename containing the employee directory as explained in part-2 and the log filter parameter for Checkpoint Management Logs view. That’s one of a few valid options for defining a custom log filter time window.
collection and comparison after re-patching – still on legacy network
We have already shown in part-2 what the results look after the script is done. The code first loads up the hostnames list from the result we got earlier and queries the Checkpoint Management Console using the Checkpoint Management API Python SDK for the last 2 log entries for each hostname, and then gets the user-id correlated with that hostname through Identity Awareness. We dodge a few bullets there as I explained earlier, as in some cases there are fields missing from the structure. If no entry is found then we move on to the next hostname, until the list is exhausted. The resulting list of user-ids is returned and those are checked against the user directory, so that we can get back user first/last names, phones, section/departments and office numbers. Finally, the unique list of departments is printed to the std-out so we can contact them ahead of the migration.
Once the re-patching is done, we have to check if we got everyone back. If the migration is done towards a new rack, maintaining the same legacy switches, then we use another launch script, meant to run a simply modified version of the same method of operation.
python gethostsaftermove-args.py -s site-1-t moved-to-new-rack -hf hosts_before.txt
This time, before running the same collection algorithm, we load up the previous data from the file we stored them earlier. Once the collection is complete, we check against the previous data. Every possible scenario is covered for each host. For every scenario we built a separate list:
- Hosts that are lost
- Hosts that are new (were not recorded in the previous run but they are there now).
- Hosts that have been recovered but they are now on a different switch in the rack:
- same vlan/subnet: These hosts should be fine but we can take note anyway
- different vlan/subnet: For these hosts either a change or re-configuration might be necessary
- Hosts that have been recovered on the same switch same port (as they were before)
- Hosts that have been recovered on the same switch different port:
- same vlan/subnet: These hosts should be fine but we can take note anyway
- different vlan/subnet: For these hosts either a change or re-configuration might be necessary
There are a few reasons why we could bother with hosts that are recovered but on same or different vlan, although on different port/switch than their original connection.
- Those recovered on the same vlan may have taken the place of other hosts which in turn may be patched incorrectly. Moving them back where they were may open the way for the rest to be corrected as well. In some case this was also important for the building cabling mapping. Let’s not get into that, let’s just say that there are case where any change matters even if the impact for networking is minor or negligible.
- Those recovered on a different vlan/subnet may get access to the network without problems. However if there ip addresses were reserved on the previous subnet so that they can be used to grant access to secure resources, there could be serious impact. Those mistakes should be corrected to avoid problems down the line.
When the lists are complete, a list of messages is constructed so that the report can be made. Depending on the choices made by the launch script (debug/print/log etc) we send the report towards std-out and MS-Teams or not. For MS-Teams messaging we need to have configured a channel for webhook receiver and use that with pymsteams to send our messages to. If you have access to Medium, here is a nice post to get you started with that.
collection and comparison after re-patching – towards SD-Access network
If the migration is made towards the new SD-Access network then like we said in previous parts, all the data are on ISE already. Instead of running the algorithm again to do a collection from the switches, we can get the active sessions data from ISE and look for the hosts we are interested in. Before I explain the data format in the answer from ISE, let’s entertain the question: Why are we not getting the data from the switches again?
We could. At least for the mac address info. To get arp data, that would be a little more complicated in the overlay network architecture. But why bother and waste so much time when we only have to do a single request towards the Cisco ISE REST API and get everything we need in one go?
python getisehostsaftermove.py -hf hosts_before.txt
The answer from ISE gives us the following info for each host:
- user_name: pretty obvious. If we get a user name, then we have 802.1x authentication unless it contains the host mac address which means the host was authenticated with MAB. If the username is the same as the hostname, then the host is in IDLE mode.
- calling_station_id: this is the host mac address which you can use to cross-reference with the original mac address list from the file.
- framed_ip_address: the host ip address which you can use to determine on which network the host has ended up with (for example the guest network).
By processing the result of the comparison, we build a few different lists that depict the different host states:
- hosts lost
- hosts recovered and authenticated with 802.1x / MAB
- hosts recovered but in IDLE mode
- host that have ended up in the Guest network (during the migration is probably means that something went wrong)
- hosts without ip addresses so they probably need their port flapped to restart the authentication process so that they finally get an ip address in the correct network.
Find lost users
Since we went to all the trouble to find a way to recover user names and other necessary data from the hostnames of their computers, then we can use some of the functions built there to recover the user data for the hosts that are lost so we can try and recover them (find the office and seat of the user, reboot their pc ,etc), notify the cabling tech people that those users might be in disconnected state so they can check or at least get a list of who to anticipate calls from on Monday morning and perhaps cont act them proactively.
By processing the lists and checking for the user info we construct again a list of messages. That list can then be printed at the sdt-out and/or transmitted towards MS-Teams.
By using code for all these checks, we can try to recover the users and run the checks again, as many times as we like, while we get the results on MS-Teams. That helped a lot and allowed me to be on the move when trying to recover from patching errors.
We did get very successful results for our migrations, hardly any persisting malfunctions for hosts connections on Monday mornings. Then the necessary changes would be done (changes in ip reservations, modifying settings for the printers on the print servers, etc) and we could prepare for the next migration. It was a great run and the project is almost over.
Lessons Learned
Self-training
I did learn a lot about cutting down code into small sized easily consumable pieces, easier to maintain and expand, better for readability.
- I took my first attempt at building tests.
- I used readily available data when possible to minimize development time and effort to save my strength for the long run, even if that meant I had to compromise on certain areas.
- I learned to stop and go for results instead of endlessly chasing an optimum design
Object-Oriented design
In that regard, I struggled a lot with the question of whether I should use classes instead and use member functions and variables instead of treating every function as a separate entity. I already explained how that would require a lot more time to get it right and I still had no full answers on how to separate functions and variables across classes. How many different classes would I have to make? What would be the exact data model? (Again Ivan came to mind). I decided to leave it for another time.
What are the next steps?
- Check the code again for mistakes (I don’t expect to find a lot, the code has been used so many times but I am sure I missed a few things with minimal impact).
- Complete parts I wanted to work on more
- Try to integrate the code with the Nautobot GUI by using the internal Nornir platform to run the code. Launch the code as a form to launch a job or an app. There are a lot of things to figure out also like how do we store data if we need to do that between steps?
On that last part perhaps the new book about Nautobot, coming out in a few days will help. The goal is to learn how to make Nautobot apps, turn the code into one or more python packages, and integrate jobs into Nautobot by grouping them as an app. That’s a really ambitious goal for me, considering my work load and how much work and study it will take. I guess we’ll see.
If you are reading this, thank you for your patience. I hope you enjoyed the read and are inspired to try things with parts of this architecture (like the nautobot-nornir plugin). If you have questions, contact me on twitter (now x).
Happy automating!!