We feel it’s important to be open, honest and transparent about our service. In the rare event of a systems failure we try to keep our users updated via our status page. That is mostly useful for small updates of real time information as a case unfolds.
Which is the point of this post, a more detailed overview of a fault we experienced this week. On September 30th 2015 we experienced a series of unrelated errors with our usually robust Mercury email hosting platform. Mercury is our next generation email system that is 100% built and maintained in house; based on our 15+ years of managing email services, servers and third party email platforms.
Hardware always fails given enough time, there’s no way around it; a hard disk may simply break or a processor may burn out, it’s unavoidable. Mercury is designed to not have any Single Point Of Failure (SPOF). This means everything in the array has at least N+1 redundancy and in many cases much more. Should some failure occur, the system has automatic fail over capability to protect users from the fault. Indeed, during 2015 there have been plenty of behind the scene failures that have not impacted service to any detectable degree (at least for the end user). The system failed over as designed, and our engineers worked quietly in the background to replace the failed part unnoticed.
Like the old adage of waiting for a bus, sadly this week, after months of seamless service we had 48 hours of unexpected faults. Things started on the morning of September 30th when an engineer at our data center noticed that the ‘head’ system of the storage array had failed – the head is the master control system for the whole storage array. This piece of the system does indeed have N+1 redundancy and is also monitored by an external service provider, who are paid to proactively correct any fault in either head unit should a failure transpire. This monitoring had failed and required the whole system be taken offline for a quick 10 minute fix. At this time we’re not sure why the external monitoring had failed and this has already been raised with the company in question. Suffice to say, we expect a fix here very quickly.
While reviewing this fault we detected another failed drive in the array. This time an SSD drive used to cache user access patterns. While this does not affect Mercury’s ability to deliver email, it does help boost performance by a substantial degree. This leads us to our next issue. The secondary caching drive had not been enabled, and with this fault, the whole array lost caching abilities. This represented an unknown SPOF in the setup we had not detected until this time. The specific SAS SSD we required to replace this failed drive is not a common part that could be sourced locally immediately – even in central Manchester. We had to contact our storage hardware vendors who promised to next day deliver the part to us, with install planned for first thing in the AM of the following day. While less than ideal a solution, nothing was down, email was still being received and delivered to users, albeit in a delayed fashion.
Come October 1st, we detected a third and final bug in the CPU of one of the mailhosts in the array. This had locked up in such a fashion as to report to our monitoring services everything was fine. Everything was not fine. The mailhost was actively receiving email, but due to the CPU error, not relaying it to users – the result being a backlog of email. This was spotted around noon on October 1st and immediately fixed. The nature of the fault was noted and we have now implemented a new monitoring rule to detect a future fault.
Under normal circumstances this CPU error would have been detected almost immediately, simply by our support team responding to inbound customer queries. However, with the previous caching disk failure noted and on record, one error was being chalked up to another and ultimately was missed.
Finally, the SAS SSD cache disk was delivered (sadly with some delay due to issues with the shipping vendor) and installed around 4PM on October 1st. The caching was re-initialized and immediately began to improve performance for users. As the system needs to re-analyse user email access patterns to build an effective cache we estimate performance will be restored to full speed by October 2nd at 4PM or so.
I should add that during this event, no email was bounced, lost or otherwise manhandled. The system largely performed as expected, albeit we did have a window of 24-48 hours (varies per user) where email would have been delayed in reaching the end user mailbox. This window of slowed performance was also the first customer impacting event to occur on Mercury in 2015. To be clear, web services are completely unrelated to the Mercury platform and were not affected.
So what’s the takeaway from all of this?
- We’re immediately working with our storage vendors to understand why their monitoring of our array failed, and why the proactive failover of the head unit did not occur.
- We’re assessing why a secondary caching disk was not ready to take the strain in the event of failure. While disk failure is rare, this is a SPOF and we will be looking to eliminate this ASAP.
- We have implement more robust checks against individual mail host CPUs to ensure that quasi-failed systems are now better detected and fixed.