Over the past few years, I have been trying to train myself out of First Responder mode and into the ability to take a step back and evaluate the situation when it comes to network outages. Being in the ISP field for so long, you get into a mentality of just fixing the problem. This can cause you to just jump in and focus on putting out whatever the latest fire is. This mindset differs from engineering a new point of Presence (POP). My last 10ish days have been putting out a lot of fires. That brings us to my rookie mistake this morning.
A transport circuit goes down this morning. I immediately jump into “let’s get it fixed” mode. I called the provider after checking my gear and opened up a ticket. They call back a while later and say they want to reboot the switch where the circuit is handed off to us at. Still, in firefighter mode, I don’t step back and think about the ramifications of rebooting that switch. I learned long ago that you don’t expect the provider to know about your network or even about the other services they are providing you. In this case, I forgot about the redundant circuit being on the same switch. It was still functional. By rebooting the switch, I took everything down for 3ish minutes.
Now, during a normal situation, I would have remembered this. But because I was focused on fixing a problem, I forgot to consider the ramifications of this reboot. At the very least, I should have discussed with the team whether we should wait until after hours or reboot now. If we chose to go ahead with the reboot, the call staff would have been more prepared. You always weigh the pros and cons. We would be in trouble if we held off on the reboot and the redundant circuit went down. At the same time, rebooting a production switch during the day has its ramifications. The provider was waiting on the reboot as their first troubleshooting step. Not rebooting could have delayed the troubleshooting. Remembering the redundant, WORKING circuit on the same switch would have also helped to tell them. There is no right answer here except what is right for your network. My mistake was I should have had that discussion.
I will leave you with something to ponder. Will AI be able to alleviate some of these issues in the future? I think it can. Imagine typing in “I am going to reboot XXXX switch,” and the AI says, “This will cause Sites A, B, and c to go offline.” The AI can remember things and dependencies we might not be able to. It can be that super-smart employee that never forgets anything.j2networks family of sites