Mercury email platform fault review

If you’ve experienced a blip with your Mercury hosted email of late, we wanted to take a moment to go over recent events; to outline what happened, why and what we’re doing to address the larger points moving forward.

First up, Mercury itself. Many users have queried the redundancy of the system. Architecturally, Mercury is one of our most complex and stable platforms. We know how ultra important email is to people’s websites and businesses, so the system is designed from the ground up for maximum fault tolerance. This means each component in the system is replicated many times so that if one part fails, another seamlessly takes over.

Each part of the mail process is broken out into a separate micro service, and in turn replicated multiple times. Mail authentication, spam filtering, webmail processing, SMTP processing and IMAP/POP processing all operate as completely independent micro services, each in turn powered by many different virtual machines. We often see one of these virtual machines completely fail, with no impact to the end user experience.

The same goes for the storage of email on Mercury. At any given time we host double digit terabytes of email data; this means we rely on quite specialized storage appliances, leagues apart from the disk setup you might recognize in your laptop or PC. Disks are added to the device in pairs, so that when data is written to the storage device is done in such a manner that a disk in the pair can fail with no impact. The system also uses hot spares – live blank disks in the storage appliance that detect failures, and immediately copy data to themselves in the event of a disk pair failure. Data is also striped across the larger array of disks, for even greater fault tolerance: the device can sustain multiple simultaneous disk failures with no impact on performance. No one disk stores your email. Finally, we also run two of these devices, to separate email storage across two large pools.

So what went wrong recently and why the service degradation? On March 23rd, one of our storage pools crossed a magic line of sorts. The device’s storage utilization crept above the 80% (still 20% free disk space) mark, and our service became impacted. The storage devices utilize ZFS – a specialized file structure for large scale storage applications. Before the 80% mark, ZFS simply writes data to the end of the free disk space contiguously. Above 80%, ZFS switches to a different model, and starts to calculate where pockets of free space might exist on the disk – and instead writes to those empty spots on the disk. This dramatically increases load on the system due to the overhead of processing, and is the event we faced on the 23rd.

Under normal circumstances we simply add more storage long before this ever becomes an issue. The storage appliances allow for us to deploy new storage at any time without taking the system offline. Leading up to the event of the 23rd an internal monitoring system had failed, and ultimately we did not spot the magic line being crossed. As soon as we did, we worked with our providers to dispatch new disks directly to us. Given the unique nature of the disks, none were readily available in our immediate area, and as such they had to be express couriered across the country. This took some time, but we had the disks installed by the evening of the 23rd and services restored to full operation thereafter.

The impact of the ZFS based slow down meant that access became intermittent for some users. The whole Mercury email system as a whole becomes overloaded incrementally when it cannot access the back end storage quickly enough. The main impact of this means connection time outs for users accessing the system. It does not however mean we bounce email, the system is still receiving email just fine, and writing it to disk, albeit slowly.

Following the 23rd we thought we had the main issue resolved, but we ran into additional issues with the storage device a little over a week later. A small number of users started to report connection issues again. Again we saw increasing load on the system and made a number of configuration tweaks that we believed would help the situation. As we went through Friday the 31st load on the system did come down, access improved and was largely faultless through the weekend of April 1st and 2nd.

However, come the afternoon of Monday April 3rd and the increased demand for email (Monday’s one of the busiest times for email access) the storage system again began to overload. Much like the event of the 23rd users started to receive connection errors again.

At this point the system was well under the 80% storage utilization mark following the recent hardware additions, and despite multiple configuration tweaks we were struggling to reduce the load on the system. As our remaining avenues to reduce load began to dwindle we started to liaise with the vendors of the storage appliance on what options might be open to us.

While awaiting their response and review of the situation we continued to make best effort improvements to the system to reduce load on the affected pool of disks. We started to look at specific mailboxes and users across the system – those with the biggest usage, and how we could work with those accounts to perhaps lessen overall load. Strikingly, we came across a mailbox with some 2 million emails stored. Emails are simply stored as individual small files on the disk. Theoretically, 2 million files in one directory is way under the ZFS file limits, by orders of magnitude in fact, but still we wanted to reduce load on the affected pool.

The sheer amount of files meant removing 2 million of them was an overnight process, but load was immediately improved as the process completed. In the interim we’d also advised users with particularly large inboxes to similarly reduce their inbox numbers; if accessing by webmail for example, each email is read by the system before loading in the browser. The overall work started to really make an impact on performance and by early a.m. Tuesday April 4th we started to see dramatically improved disk performance and access. Most users were now back on their feet.

Two final compounding events were masked by the overall storage issues. When combating such large events smaller items can be hard to spot among the sheer volume of support load we receive. In short order then, a new hack has been making its presence felt on certain accounts and spam was leaving a subset of accounts in large volume, this caused some SMTP blacklisting and mail delivery delays. We also saw some corruption of dovecot files – users could access their webmail, but would find an iOS device stubbornly refusing to connect. These simply require a quick configuration rebuilt at the user level to fix – and again – result in no email loss, just access issues.

So what next? Despite the wobbles, Mercury remains a competent and capable system. Some of the limitations of our ZFS implementation were admittedly new to us, but with the knowledge in hand, we’re taking steps to move forward as follows:

* Significant new disk capacity has been deployed to the storage arrays, to help combat any capacity constraints for the immediate future.

* Monitoring of the device’s storage utilization has already been revamped and new processes implemented. We’re also in the process of adding in even more fail safes to the monitoring tools and process.

* We will be reviewing usage of certain user accounts and shortly be deploying new limitations on usage on the platform. Nothing that will impact any standard usage of Mercury, but will prevent any extreme edge cases causing deleterious events.

* A third pool of disk storage is being planned.

* We also have a ceph platform internally, that’s recently moved from alpha to beta status. While not quite ready for prime time usage, we’re evaluating ceph for our long term storage needs and are presently very optimistic about the capabilities.

We’d like to thank everyone affected for their patience and understanding. We know email is critical, we ourselves use Mercury and rely on it critically too. When these events happen we work around the clock to mitigate them as best we can.

If you have further thoughts or queries on the event or your affected services, please do send an email marked FAO me personally – Stuart; I’ll be happy to discuss all of the above in more detail directly with you.