I love reading technical post mortems from big-name organisations or experts in their respective fields. If you’re interested in reading some, here’s a list. They’re fantastic insights into some very complex and highly technical issues.
I investigated an odd issue recently involving a misconfigured switch which caused some very abnormal symptoms. I don’t think this qualifies as a legit post mortem article as I don’t have the niche expertise in any particular field to produce one, neither do I have in-depth knowledge of the network within which the issue occurred, but it’s about as close as I’ll be able to get to one at this time.
Here’s what happened.
An issue was noticed by the first staff to arrive to their place of work, a nearby school, in the morning. They found that they were unable to access either the wired or the wireless network in either of their two buildings. The on-site tech arrived shortly after and began diagnosing the issue.
After a short while, most of the network came back up except for a random selection of devices in the first building, and the wireless in the second building. Connectivity seemed to be restored on its own as the on-site tech had not made any changes anywhere.
The on-site tech was unable to figure out the issue so logged a support call with their remote support organisation. Unfortunately they were unable to visit quickly, so due to some urgent deadlines that were fast approaching (including the school nativity!) the on-site tech called us up to see if we could help out at all. I had nothing immediately urgent on, and it sounded like a fairly simple issue to resolve (first mistake) so I made my way down there just before lunch (second mistake) to take a look.
The school is composed of two buildings. In the first building they have a small rack with the connection to the Internet and their router, two 48-port HP switches, one 8-port HP PoE switch, and a Ruckus wireless controller. In the second building there are a couple more 48-port HP switches and a second Ruckus controller.
Remember that they’re HP switches – when I talk about Trunking later I’m talking about HP Trunking, not Cisco Trunking! An HP Trunk (warning: FTP link to PDF file) is the name of a method of combining two or more interfaces (network ports) to work as if it is one – three 1Gb bandwidth ports can be combined to provide a single end-point with essentially 3Gb bandwidth, minus overhead. In Cisco terms, a Trunk is used for multiple VLANs, typically when communicating between switches rather than to an end-point.
Also in the second building, they have their single Hypervisor which hosts a DC and some file storage VMs. Outside of these stated technologies the only other type of equipment on the physical network are desktops or wireless APs. A pretty simple setup. I’ve slapped together a really basic diagram using draw.io to better visualise it:
Diagnosing the issue
The first step when dealing with a reported issue: Verify the symptoms for yourself.
There was a random selection of devices in Building 1 that just did not pick up any network connectivity. The devices knew they were connected to something but were unable to talk to anything on the other side of their NIC. Unindentified Network. No IP address handed out by DHCP. Setting a static address on the NIC didn’t change a thing, either.
Oddly though, if I plugged my laptop into the same network port on the wall as a non-functioning device, it worked perfectly.
Initially, I thought it could be a server issue, perhaps a DHCP or DNS (it’s always DNS) issue where a duplicate IP or something had caused some kind of conflict. I gained access to the server and checked it all out.
As far as I could tell, the HyperVisor and the DC VM were operating fine. There were no immediate issues in DHCP nor were there any DNS issues either. Plus the rest of the network was operating fine. Well, it was now. Earlier there were issues, as I mentioned in the Symptoms section above…
Bouncing the machines did not fix the issue, so I then started thinking about the switches. Unfortunately the ports in the wall and in the cabinet were not labelled. I needed to find out which switch the port I had been looking at was connecting back to, so I plugged my device into the network port, then with the trusty USB to Serial cable I got on to the switches in Building 1. Luckily the switches were in the same room as the ICT suite!
To find the switch I was on I attempted to locate my devices’ MAC address.
show mac-address <mac>
I could see that this switch knew about my mac address, and it was located on the 2nd switch in the cab. I moved my Serial cable down to that switch, logged on and discovered that my device was in port 14, which I confirmed by unplugging the cable on my machine and plugging it back in, watching the Link light go off then come back on on the switch. I plugged the computer that wasn’t working directly into this port just to confirm that it absolutely was a port with issues, as my laptop was working fine, but still no luck.
I then got a list of all mac addresses the switch knew about:
and it turned out that Switch 2 was correctly seeing the mac addresses of devices on each active port. I double checked a couple of mac addresses of machines that I knew weren’t working and was surprised to find that the switch could in fact see the mac addresses of those devices, too. I looked into the other mac addresses the switch was reporting and discovered that all of the computers that weren’t working were in this switch.
Interestingly, all of the known working computers were in Switch 1. Except for my machine, which was in Switch 2. Something was causing existing machines in Switch 2 to not work but new machines to work and route fine.
I checked out the config of Switch 2 as well as the logs, believing the issue to be isolated to this switch. Perhaps some kind of issue with the arp table or something?
Although I am not an expert, I couldn’t see anything wrong. I did a reset on the switch as it was just running the default config, expecting that perhaps there was some kind of power fluctuation during the night that caused some kind of memory corruption. Not something I had seen before but I had heard tales of funky behaviour from issues related to “dirty power“
As I was booting the switch back up, the on-site tech mentioned some key information. The previous night, their remote support organisation (who were unavailable to assist at this time – this is what you get if you only pay for the lowest-tier cheapest support contract folks) had someone visit the site to, and I quote, “Make the servers load faster.”
Ding dong! Alarm bells. What could they have done?
I jumped back on to the server and after digging around found that there were two virtual NICs set up on Hyper-V Manager. This isn’t anything strange, but as I was describing what I was doing to the on-site tech I mentioned this, at which point they responded by saying that the support engineer mentioned “something about a team of Nicks”. A team of Nicks?
Ah. NIC Teaming. Okay.
I double checked the servers, they were working fine. No NIC teaming was in place on the HyperVisor, only one physical NIC was active, and there were three which were disabled. I double checked the Hyper-V Virtual Switch Manager, and that too seemed to be fine, two virtual switches set up on the same physical NIC. But swinging my head around the back of the server I could see two network cables running into the rear of the machine. Tracing these back up to the cabinet I found one was neatly organised and plugged into port 48 of Switch 1, and the second was rougly flung up under the door of the cab and into port 41 of Switch 1. This latter port had no activity.
I rolled my eyes. I then connected to Switch 1 in Building 2 to see if any Trunk ports had been set up in error. Nothing. That’s a surprise. Why go through all the work of running a cable but not properly configuring Trunking on the switch or Teaming on the OS?
Dead end? Not quite…
I was suspicious. This was a change that happened the evening before these odd symptoms came about. Sure the network was out entirely this morning but had mostly recovered on its own. Switches are smart, perhaps Spanning Tree or something figured out what was going on once the clients started loading up and connecting. There’s still an issue with Building 1, though. The issue must be there somewhere!
I had a hunch. I logged into Switch 1 in Building 1, dumped the running config:
and what do I find but a mystery Trunk port configured on ports 48 and 41! These are of course the two ports which just happen to match up to the two cables that run from Switch 1 in building 2 to the server. Unfortunately, port 48 is being used as the uplink into Switch 2 in Building 1, hence all the issues on that particular switch!
Now, I’m not a network engineer but I know how to find my way around a switch config. Setting up a Trunk on an uplnk port to another switch and a random client device is going to cause problems. Here, the problems were as described in the Symptoms section above. Very strange behavior
Fixing the fault
The second I disabled the Trunk on ports 41 & 48 on Switch 1 in Building 1, all the issues went away. Both the clients on Switch 2 and the Wireless came back to life.
I can only guess that the support engineer on the previous evening tried to configure NIC teaming, set it all up but accidentally configured the wrong switch. After they were unable to figure out why this didn’t work, undid most of their changes, disabled some NICs on the server, changed the Hyper-V virtual switch to point at the remaining enabled physical NIC, then went on their merry way assuming that everything was peaches.
Clearly it was not peaches.
I don’t understand exactly why the symptoms experienced appeared as a result of this configuration change. I can understand why clients in Switch 2 wouldn’t be able to talk to DHCP on the DC or anything else, but I’m not sure why new devices would work fine, and why the wireless would be affected (it was plugged into an unrelated port on Switch 1)
Either way, the issue was resolved. Hopefully I’ll have the opportunity to recreate this at some point in the future and try to understand what went on here. Unfortunately, as it’s not my network, once the issue was resolved I was thanked and ushered out so I didn’t get too much time to look any further
Although this isn’t my network I like to try and deal with problems permenantly. I’m not sure you can really solve the issue of humans making mistakes, however there are some things I’m planning on doing here in the future, when I get some time to volunteer my services, to mitigate the chance of a similar mistake happening again.
The first step will be to sort out their switch configs. As they’re all pretty much defaults, they need to be password protected first. I’d also use this as an opportunity to both label the switch ports and the network cabs. This includes labelling important ports (Server, Wifi, etc) but also naming the switches. As the switches have the default names, it’s not easy to identify which switch you’re SSHing on to at a glance. If I made these changes any future network engineer that wanted to make a change would be able to easily see that they’re on the right switch.
On top of that, they will need static IP addresses in a range outside of their existing DHCP scope. Whilst looking around to try and fix the issue I learned that everything bar the server itself is on DHCP. No reservations. Even the printers!
I expect that these small, quick changes will likely help out long term, but I have no idea when I’ll be able to get in there and do this. Or even if they’ll let me in to do them.
As I already mentioned, it wasn’t my network, but there are always lessons to be learned from a problem you have experienced or tried to resolve. Here, primarily that you need to monitor and somewhat understand changes made to your network, even if they’re done by a trusted third party. On top of that, if a third party comes in after changes were made, you should probably communicate those changes to that third party (me, in this case) sooner than later as it may help diagnose and resolve the issue sooner.
Secondly: documentation, documentation, documentation. The questions that I did get answered were answered from memory by the on-site tech. If the on-site tech wasn’t about I’d have started from a place of zero knowledge. It would have likely taken even longer to get the issue resolved, especially without access to any passwords. Alternatively, if these things were documented, I wouldn’t have needed to constantly ask the on-site tech questions. I should add that I don’t blame the on-site tech for any of this, they’re incredibly knowledgable but suffer from low resources, both time and money. Not uncommon.
The third and final lesson: CHECK YOUR CHANGES! It’s not always possible to test things in a development environment, especially in the SMB or education sector, but you should at least know the system or network state before changing behaviour, then make your change, then test to see if your intended changes have occurred and that, importantly, any unforeseen changes haven’t occurred.
The issue was resolved, everyone was happy and I got to finally eat my lunch just before heading home for the day.
I enjoyed writing this little post mortem and will likely do more in the future, although my job is mostly management and policy stuff now I still find myself diagnosing some really interesting or odd issues.