Common ISP outage causes

Over the years I have been able to narrow the most common reasons a service provider goes down or has an outage. This is, by no means, an extensive list.   Let’s jump in.

Layer1 outages
Physical layer outages are the easiest and where you should always start. If you have had any kind of formal training you have ran across the OSI model.  Fiber cuts, equipment failure, and power are all physical layer issues.  I have seen too many engineers spend time looking at configs when they should see if the port is up or the device is on.

DNS related
DNS is what makes the transition from the man world to the machine world (queue matrix movie music). Without DNS we would not be able to translate into an IP address the we-servers and routers understand. DNS resolution problems are what you are checking when you do something like:

PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=52 time=33.243 ms
64 bytes from icmp_seq=1 ttl=52 time=32.445 ms
--- ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 32.445/32.844/33.243/0.399 ms

Software bugs
Software bugs typically are always a reproducible thing.  The ability to reproduce these bugs is the challenge.  Sometimes a memory leak happens on a certain day.  Sometimes five different criteria have to be met for the bug to happen.

Version mismatches
When two or more routers talk to each other they talk best when they are on the same software version. A later version may fix an earlier bug.  Code may change enough between version numbers that certain calls and processes are speaking slightly differently.  This can cause incompatibilities between software versions.

Human mistakes
“Fat fingering” is what we typically call this. A 3 was typed instead of a 2. This is why good version control and backups with differential are a good thing. Things such as cables getting bumped because they were not secured properly are also an issue.

What can we do to mitigate these issues?
1.Have good documentation.  Know what is plugged in where what it looks like and as much detail as possible.  You want your documentation to stand on its own. A person should be able to pick it up and follow it without calling someone.
2.Proactive monitoring.  Knowing problems before customers call is a huge deal. Also, being able to identify trends over time is a good way to troubleshoot issues.  Monitoring systems also allow you to narrow down the problem right away.
3.When it comes to networking know the OSI model and start from the bottom and work your way up.

Books can and are written about troubleshooting,  This has just been a few of the common things I have seen.

LibreNMS syslog.ibd cleanup

Recently I ran into an issue where a librenms install was taking up a crazy amount of disk space. This was tracked down to the syslog.ibd file . Even thought I had set my options to be less than 10 days per this link I still was having a huge file.

Here is what I did to fix it. My root partition was too full to start MariaDB. I went into var and cleaned out enough log files to make space to start MariaDB. The following are tee commands I ran on CentOS 7 to fix it.

mysql -u username -p

Once at an MYSQL prompt after logging in I issued the following command to verify I could see my librenms database.

show databases;

The I issued

use dbname;

In may case it was “librenms”. Once connected to the database I ran the following command

DELETE FROM syslog WHERE timestamp < '2021-1-28 08:00:00';

This commands removes all syslog entries from the database before the date and time specified. In my case this was close to 40 GIG.

After I restarted MariaDB, ran and all was good.

Network troubleshooting tools

Recently, there was a thread on the NANOG list asking what were somne favorite network troubleshooting tools. I have taken many of these tools and created the following list.
Simple pingport and dig commands
BGP Looking glass
Traceroute from various hosts on the net
IPV6 tools (ping,traceroute,etc)
Carious DNS tools
Routing Registry object explorer
DNS and Mail tools

Frequently Asked Questions On OTDRS

Frequently Asked Questions On OTDRS And Hints On Their Use

OTDRs, also known by their technical name optical time domain reflectometers, are valuable fiber optic testers when used properly, but improper use can be misleading and, in our experience, lead to expensive mistakes for the contractor. We have been personally involved in several instances where misapplication of OTDR testing has cost the contractor as much as $100,000 in wasted time and materials. Needless to say, it’s extremely important to understand how to use these instruments correctly.

A little reboot now and then..

Just a reminder that rebooting does help. My home network was experiencing slowness and lag.  Xbox games were having issues, etc.  Started pings to various sites and they all looked this way. Even to the provider’s DNS.  I rebooted the CPE and all is well.  Sometimes it’s the simple things.

Before the reboot

After the reboot

MTR Traceroute and you

For those of you who don’t know about MTR traceroute it can be a very helpful diagnostic tool.  MTR is a visual application that combines the functionality of the traceroute and ping in a single network diagnostic tool.

If you are a Mac user like me MTR is available through homebrew.

mtr to from my comcast connection.

Windows Users


Troubleshooting a fiber optic link on Cisco switches

This content is for Patreon subscribers of the j2 blog. Please consider becoming a Patreon subscriber for as little as $1 a month. This helps to provide higher quality content, more podcasts, and other goodies on this blog.
To view this content, you must be a member of Justin Wilson's Patreon
Already a qualifying Patreon member? Refresh to access this content.

Podcast: Quick troubleshooting for ISP networks

Been a little bit so I wanted to do a short talk about troubleshooting in ISP networks. I see too many folks waste a lot of time when they should be starting qt the lower levels of the OSI model and working up.