Given some of the recent instabilities on our Mercury email platform, we wanted to take a moment to go into some detail on what’s been happening of late. Certainly it’s been one of the top queries we’ve been fielding during the past few weeks. To start with, we wanted to answer the number one question, “Why isn’t there enough capacity? Why haven’t you added more servers?” Which means we need to start with…
How exactly is Mercury built
Mercury was built a number of years ago to address some of the fundamental issues we encountered in our earlier shared hosting days. Single servers would host email for a discrete number of clients. In instances of hardware or software failure, the impact would be severe. Emails would bounce the moment a failure occurred.
Mercury was implemented to avoid this scenario using a highly redundant and fault tolerant design. At its most simple you can imagine it as follows:
Mercury is a cluster of servers and services, tens and tens of physical servers that work on different tasks: authentication, serving email, filtering spam and viruses, delivering outbound email – and the most complex part of all – storing email.
The design of Mercury is such that a failure of any one part of the cluster does not impact clients. Unlike the old days of single server shared email hosting, a component of Mercury can fail without a user ever noticing. This actually happens pretty regularly given the size of the platform, and users never bat an eyelid. Things fail over seamlessly and we work quietly in the background replacing failed servers quickly and discretely.
The clustered implementation also means we can scale up any one service in terms of demand and load, at any time. Need more IMAP processing servers? Not a problem, just drop them into the cluster and hey presto: more capacity. This was never possible using the older shared hosting approach, where a system running at capacity needed to be replaced entirely leading to major disruption.
O.k. that’s great, so what’s with all the faults?
Since its inception, Mercury has relied on a very complex and sophisticated disk array (provided by Nexenta) to store emails for users. The system uses two physical collections of disks (internally we call these tanks), separated into different machines. Each machine houses dozens of physical hard disks. The data is then stored across both machines and multiple disks in each. The idea being that a failure anywhere in the setup is irrelevant and again, disks have failed with no client-facing impact. We’ve simply replaced them and life has gone on. Also, just for good measure, the whole storage is duplicated a third time as a backup.
In total we currently manage around 80TB of primary email storage (not counting the 40TB of triplicated backups). That array of disks is in turn managed by two physically separated head controllers – two physical servers that oversee all of that data storage; again should one fail, the other is a live standby.
The problem we’ve been facing – repeatedly – is with this storage system. Let’s look at that.
The issues to date…
The first issue encountered with the storage hardware was related to the ZFS file architecture; performance of the disk clusters can degrade when overall capacity exceeds 70%. If you recall faults from 2017, this is what you were experiencing. Due to a failure of internal reporting, we were caught flat footed as the disks crept above 70% usage, impacting performance. This particular problem was resolved by layer upon layer of reporting and monitoring – we even now have a big screen in our offices reporting live Mercury usage data for everyone to see in real time.
Moving forward a little, to mid 2018, and most recently the last couple of months – a troubling recurrence of a seemingly similar problem has been rearing its head. A fault that can appear at random, and persist for anything from few minutes through to a full day in some extreme cases.
It’s hard to overstate how many hours have gone into understanding this, both from our own team and those of our supplier Nexenta. Several theories arose, each with an attempted fix, but all causing the same net effect, a cascade failure of sorts. This is to say that a smaller issue triggers a load increase; as load increases, services slow down, in turn causing more load as user demand is increased. The cascading effect multiplies and multiplies until the system can’t easily recover, taking time to catch up to pent up load.
A basic analogy would be a traffic jam. The capacity is there, but some event is causing an extreme compression of the load on that capacity. Using the same analogy it doesn’t really matter if the road was five or ten lanes wide, the capacity would always be overwhelmed by the compression of load. The real issue is preventing the accident in the first place.
Suffice to say, a vast number of causes for this triggering event have been explored and remedies deployed, time and again. At any rate, the decision was taken in 2018 to draw a final line under the issue.
One solution to rule them all
Since the Autumn of 2018 we’ve been testing non-Nexenta based storage. This new system (tank3) was built in house and has been in testing for many months; we had tried to replicate some of the sophisticated redundancy ourselves. Naturally we didn’t want to make matters worse for clients moving them from a mostly working system, to one with further unknown challenges. We’d spent significant time and energy in detailed testing. This was still ongoing during March’s most recent Mercury event.
Given the unusually prolonged event, we took the drastic step of tasking a completely new system into immediate use. This tank4 storage device was one built around RAID organized SSD disks, with a mantra of “Keep It Simple, Stupid” in mind. The vastly simplified storage architecture meant we could keep testing to a minimum, and get real world results ASAP.
Tank4 has now been in successful use for several weeks. We’re happy with the performance and if you’ve purchased a new account in recent weeks – your data is stored here. In the background we’re also moving data from tank2 (Nexenta) over to tank4. We hope by reducing strain on the existing Nexenta storage, we can in turn reduce the likelihood of the recurring load event. Moreover, in time we hope to decommission the Nexenta storage entirely as data is migrated to tank4 and a series of additional SSD based alternatives as capacity is required.
One more thing
We’re currently also working on offering our clients a Gmail based solution. Our status as a potential partner for Google’s platform is presently being evaluated, but all going to plan, we’ll be offering an additional email service backed by Google’s massive global footprint.
Our plan is to offer the G Suite business package from within our control panel using just a single click. That means you can port your existing email over to Gmail without any technical know how – we will do the rest. You’ll be able to keep using any existing IMAP/POP devices as you do now, and you’ll be free to use Gmail apps/sites too if you like.