When we have any large scale issue we like to give users as much detail as possible after it has been resolved. Today we experienced one of those dreaded outages and we wanted to break down what happened and why.
At about 11:45am BST today, one of our core routers rebooted. This isn’t unprecedented but is impactful – particularly as all our office connectivity goes through this specific piece of equipment by default.
When the router went down, the magic that is BGP kicked in so as to avoid any serious disruption. Our network uses multiple routers in a multi-homed BGP configuration so that any one failure should have little to no impact on clients. The remaining routers recalculate the various network routes and take over the networking load.
The issue started to compound though as the failed router came back up, meaning the now recalculated network paths switched back again. Then it went down again, so routes moved again. Then it came back up… This happened for about 35 minutes before we asked Equinix (our datacentre operators in Manchester) engineers to pull the plug on the faulty piece of equipment. This had the intended effect and customer services (ie. sites/e-mail) came back up with all traffic going via functioning systems approximately 40 minutes after the outage began.
Service was restored at that point for customers but we still had the unfortunate situation of our office being without internet access, and therefore also the company with no phone service (we run a VoIP based system out of our office). Our team decamped to work from home whilst a few of our engineering team remained to restore office services.
After further investigation, the faulty router had a software bug identified and corrected, before being put back into the general networking mix. Phone support was turned back on at about 15:10.
We did detail the event over at status.34sp.com but as some users noted this system did become unavailable itself for a moment. While we do run status off network the load caused by the system unavailability caused the status site to fail also. We will be looking to immediately address this and improve the capabilities of the off site system in face of future events.