Over the years I have been able to narrow the most common reasons a service provider goes down or has an outage. This is, by no means, an extensive list. Let’s jump in.
Physical layer outages are the easiest and where you should always start. If you have had any kind of formal training you have ran across the OSI model. Fiber cuts, equipment failure, and power are all physical layer issues. I have seen too many engineers spend time looking at configs when they should see if the port is up or the device is on.
DNS is what makes the transition from the man world to the machine world (queue matrix movie music). Without DNS we would not be able to translate www.j2sw.com into an IP address the we-servers and routers understand. DNS resolution problems are what you are checking when you do something like:
PING j2sw.com (184.108.40.206): 56 data bytes 64 bytes from 220.127.116.11: icmp_seq=0 ttl=52 time=33.243 ms 64 bytes from 18.104.22.168: icmp_seq=1 ttl=52 time=32.445 ms --- j2sw.com ping statistics --- 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 32.445/32.844/33.243/0.399 ms
Software bugs typically are always a reproducible thing. The ability to reproduce these bugs is the challenge. Sometimes a memory leak happens on a certain day. Sometimes five different criteria have to be met for the bug to happen.
When two or more routers talk to each other they talk best when they are on the same software version. A later version may fix an earlier bug. Code may change enough between version numbers that certain calls and processes are speaking slightly differently. This can cause incompatibilities between software versions.
“Fat fingering” is what we typically call this. A 3 was typed instead of a 2. This is why good version control and backups with differential are a good thing. Things such as cables getting bumped because they were not secured properly are also an issue.
What can we do to mitigate these issues?
1.Have good documentation. Know what is plugged in where what it looks like and as much detail as possible. You want your documentation to stand on its own. A person should be able to pick it up and follow it without calling someone.
2.Proactive monitoring. Knowing problems before customers call is a huge deal. Also, being able to identify trends over time is a good way to troubleshoot issues. Monitoring systems also allow you to narrow down the problem right away.
3.When it comes to networking know the OSI model and start from the bottom and work your way up.
Books can and are written about troubleshooting, This has just been a few of the common things I have seen.