The following is a full incident report for the downtime experienced by clients on some servers on January 11th - January 12th 2015.

On Sunday January 11th 2015, at around 7.30pm GMT the server that hosts the D9 Hosting Websites, Client Area, and DNS zones for ns* nameservers went unresponsive. This server did not house any of our clients' websites, only our own website, and the client area (where you login to pay your bills and submit support tickets etc.).

We rebooted the server and connected via KVM to troubleshoot the issue. We found that a large portion of the RAID array had become corrupted, so we rebooted the server and forced an FSCK to run on the server in an attempt to fix the RAID array.

It quickly became apparent that the RAID array was corrupted beyond the point of recovery in an acceptable time frame, so we initiated our disaster recovery plan for that server. This was to replace all drives in the server with new drives, then reinstall the Operating System and then restore data from a backup.

Whilst the disaster recovery was on-going, the D9 Hosting Websites and nameservers we're switched over to a secondary backup server that we have running for situations where the primary server is unavailable and this is where the issues arose.

Upon switching to the backup server, the DNS zone for ns* nameservers as well as the DNS zone for our white-label server hostname domain (the URL starting with h and ending in hub!) contained incorrect (outdated) IP addresses for some servers, so clients with websites on these servers found they had issues when connecting to websites and email.

When we noticed the IP address inconsistency issues we corrected these on the backup server, but customers then had to wait for the correct IP addresses to be picked up by their local Internet Service Provider (ISP). In most cases this only took a couple of hours, but for some customers the IP address propagation took much longer. Unfortunately, we cannot control how long your internet service provider takes to update, some ISPs are slower than others, this is why some of our clients felt that it took a long time to see their websites

Once we had completed the installation of new hard drives and the Operating System on the server with the failed RAID array, we restored all of our data from a backup that was taken approximately 30 minutes before the outage, so only 30 minutes worth of data was lost. Once we verified everything was back up and running correctly on the primary server, we switched everything back over to it (D9 Hosting website, client area and DNS zones) on Monday January 12th at around 7.00pm GMT.

To ensure the inconsistent IP address issue on the backup server doesn't reoccur again, we have added an extra step to our IP Address provisioning documentation to ensure that any new nameserver IP addresses that are added to the DNS zone on the primary server are always added to the backup server and cross checked to ensure the same IP addresses appear on both servers.

Tuesday, January 13, 2015

« Back