Bullseye Server Incident report - 22nd February 2015
- Friday, 27th February, 2015
- 09:56am
The following is a full incident report for the downtime experienced by clients on the bullseye.d9hosting.com server on the 22nd February - 26th February. All times mentioned in this report are EST.
At 19:48 on the 22nd February the server went unresponsive and we rebooted the server to attempt to bring it back online. After 3 reboots the server would still not come back online, and wouldn't boot into the OS so we escalated the issue to the datacenter.
At 21:41 the datacenter was able to boot the server up into maintenance mode. We could then see that the issue was with the /var partition of the server and a manual FSCK was needed to attempt to fix the issue with the /var partition. At 22:29 we proceeded with the manual FSCK.
After numerous manual FSCK's we were still unable to get the corrupted file system back online, so at 23.43 we proceeded to reinstall the server to get it back online, we would then proceed to reattach the drives to the server to attempt to retrieve data after the re-installation.
At 00:29 the server was re-installed and a RAID rebuild started on the server. Due to the RAID array being a large RAID 10 array with multiple drives, the RAID rebuild took some time to complete but was eventually completed at 13:20 on the 23rd February.
After the RAID rebuild has completed we were still unable to retrieve any data from the corrupt file system after numerous attempts using different methods so at 01:32 on the 24th February we took the decision to start our disaster recovery plan and swap out all drives and replace them with new ones, then restore all accounts from our most recent backup.
At 11:48 we proceeded to sync all data from our backup server and started the account restorations alphabetically. Due to the amount of data that needed to be restored this was a lengthy process but all account restorations were completed just prior to midnight on the 26th February.
We sincerely apologise for the extended downtime that was caused by the RAID failure. Despite all of our servers running RAID 10 arrays (one of the most redundant RAID levels available) failures like this although very rare (this was the 1st in 7 years!) can and do happen but thankfully due to our Disaster Recovery plan working correctly we were able to recover from this with minimal data loss suffered by clients.
At 19:48 on the 22nd February the server went unresponsive and we rebooted the server to attempt to bring it back online. After 3 reboots the server would still not come back online, and wouldn't boot into the OS so we escalated the issue to the datacenter.
At 21:41 the datacenter was able to boot the server up into maintenance mode. We could then see that the issue was with the /var partition of the server and a manual FSCK was needed to attempt to fix the issue with the /var partition. At 22:29 we proceeded with the manual FSCK.
After numerous manual FSCK's we were still unable to get the corrupted file system back online, so at 23.43 we proceeded to reinstall the server to get it back online, we would then proceed to reattach the drives to the server to attempt to retrieve data after the re-installation.
At 00:29 the server was re-installed and a RAID rebuild started on the server. Due to the RAID array being a large RAID 10 array with multiple drives, the RAID rebuild took some time to complete but was eventually completed at 13:20 on the 23rd February.
After the RAID rebuild has completed we were still unable to retrieve any data from the corrupt file system after numerous attempts using different methods so at 01:32 on the 24th February we took the decision to start our disaster recovery plan and swap out all drives and replace them with new ones, then restore all accounts from our most recent backup.
At 11:48 we proceeded to sync all data from our backup server and started the account restorations alphabetically. Due to the amount of data that needed to be restored this was a lengthy process but all account restorations were completed just prior to midnight on the 26th February.
We sincerely apologise for the extended downtime that was caused by the RAID failure. Despite all of our servers running RAID 10 arrays (one of the most redundant RAID levels available) failures like this although very rare (this was the 1st in 7 years!) can and do happen but thankfully due to our Disaster Recovery plan working correctly we were able to recover from this with minimal data loss suffered by clients.