Over night, we had an outage for a few hours due to an error made by an engineer at our Internet Service Provider. They corrected it at about 3:35 AM Pacific time and all is working normally again. We deeply regret the problem and have taken steps to ensure it will not happen again. For those of you interested in the details, the following explains precisely what happened:
Our system is made of multiple, highly redundant, and constantly monitored systems to help ensure a general outage cannot occur because there is no single point of failure. We use the same provider and same system that Netflix uses to distribute movies all around the world. None of our servers or data distribution nodes had a problem.
However, there is still one single point of failure — human error.
Every Web site and Web service on the Internet uses something called a DNS (Domain Name Service) to map the URLs you type (like http://www.seattleavionics.com) to the internal numbering system of the Internet — what’s called an IP or Internet Protocol number. This is a paid-for service and Internet Service Providers routinely do this for their clients. It has to be renewed every year or two for a small fee and the fee is generally paid automatically to ensure no outage. This was precisely the case here — our domain name (SeattleAvionics.com) was automatically charged to our account before the name expired, as usual.
However, the system used by our ISP to feed the renewal into the global Internet DNS, apparently failed for some reason and the ISP did not notice it. Because computers maintain a memory (what’s called a cache) of known IP addresses for recently visited Web sites, the renewal failure was not immediately apparent because computers were still using the IP address they had in memory. At some point late last night, those caches began to expire and when computers asked other computers who SeattleAvionics.com was, the other computers began to answer that they did not know and, for all intents and purposes, all our systems became invisible although they were still running fine.
We have automated alerts and human monitoring that frequently checks for any problems and they detected this problem. We immediately contacted our ISP (it was very, very early in the morning where they’re located) and their emergency technicians determined the cause of the problem and corrected it manually. Due to the nature of the Internet, it then took a little time for all the computers, iPads, and iPhone that use our system to get the new connection information. At about 3:35 Pacific time, most devices would have been able to see and connect to SeattleAvionics.com again.