At 5:30am Thursday morning, a hard drive malfunction was observed on our webserver. This hard drive was part of a redundant hard drive array that typically protects against this kind of failure. If one drive goes down, the other one takes over immediately without missing a beat. This unfortunately did not happen and the second drive went into offline mode instead of taking over. This is what caused the server to go down.
The on-site team immediately swapped the bad drive and set the server to start rebuilding the drive array. This process can take anywhere from 2-6 hours but due to some file system errors that the bad drive caused, this took well over 24 to complete. During this time a restore from backup was initiated to a new server. Which ever one finished first would be the one that went live.
The restore from backup process also went much slower then anticipated. After a few hours of little progress, the team manually started restoring account settings onto the server so that mail would start queuing instead of being bounced back. You may have noticed this when attempting to retrieve your mail on Friday or Saturday and it rejected your password. We started seeing some of our sites come online as early as 7:30am Friday morning, but most did not come online until the early morning on Saturday.
As of 7:30am Saturday morning all sites and email are back online. If you are still having issues with your site/email/settings please let me know.
We apologize for this inconvenience and please rest assured that we are already working on a better disaster plan so that in the case of a major failure, we can get things back online in a matter of hours, not days. All the hardware components are already being checked and rechecked and new software to speed data transfers of backups is already in the works.
As changes to our support structure are made, I will be posting information about them here. Check back to see how we are working hard to keep you up and running around the clock.
-Shane Flynn
IT Administrator
Posted by Shane Flynn
| Filed under: Support Blog |