February networking failure event assessment

As per our policy of open and honest communication I wanted to take a moment to go over the most recent major issue we experienced – the notable network outage on Sunday, February 9th.

First of all, our apologies for the disruption many of our clients saw. We work our hardest to avoid any and all interruptions to service, but we did encounter a complex networking error that affected all of our systems. The error in question resulted in generalised packet loss of around 10-15% for a subset of our users (the packet loss only affected certain VLANs on our network). As a result, those affected by the packet loss saw anything from slowed service to completely inaccessible service.

The issue came to light around 11.30 a.m. and our team worked through the day and into the small hours of the morning to assess, identify and ultimately rectify the problem. We also engaged with third party networking consultants that we retain on hand as well, to bolster our own ability to work through a problem at pace. We saw improved access through the evening with many affected users reporting success as the night wore on.

Ultimately we identified a problem with IPv4 BGP routing and one of our three transit providers therein. This part of our network was isolated and disabled. After this correction, full service was confirmed restored to all users as of February 10th around 9.30 a.m. 

Ultimately we discovered that two simultaneous and independent failures had occurred, compounding and slowing the resolution on the day in question.

Firstly, two different SFP transceiver modules failed. These are essentially small interfaces that allow interconnection between devices on our network like switches and routers. These were initially physically routed around before ultimately being replaced in full with new units (from a different batch).

The second fault we saw was a failure of a network interface card (NIC) in one of our core routers. Some extensive research concluded that this was in fact an established problem with certain versions of the NIC firmware and Linux kernel drivers, resulting in network traffic halting or becoming unreliable after some time. For the time being, a reboot of the equipment has temporarily resolved matters and we will shortly be upgrading the NIC firmware to a known-good version so that we shouldn’t encounter similar issues again.

Suffices to say, with almost twenty years of hosting under our belts, we’ve encountered a *lot* of faults. The cold hard truth of the matter is – things break. When something fails, we’ve usually got a fast fix up our sleeves. Events like February 9th are rare, and can still happen sadly. When they do we work around the clock to understand and mitigate as quickly as we can.

To echo a previous post of mine, make sure you always have an offline copy of your data, just in case; and remember to follow us on your social platform of choice – were keen to stay in touch on those days like February 9th and keep you in the loop: