the_title();

Network issues overview May 5th

When we have any large scale issue we like to give users as much detail as possible after it has been resolved. Today we experienced one of those dreaded outages and we wanted to break down what happened and why.

At about 11:45am BST today, one of our core routers rebooted. This isn’t unprecedented but is impactful – particularly as all our office connectivity goes through this specific piece of equipment by default.

When the router went down, the magic that is BGP kicked in so as to avoid any serious disruption. Our network uses multiple routers in a multi-homed BGP configuration so that any one failure should have little to no impact on clients. The remaining routers recalculate the various network routes and take over the networking load.

The issue started to compound though as the failed router came back up, meaning the now recalculated network paths switched back again. Then it went down again, so routes moved again. Then it came back up… This happened for about 35 minutes before we asked Equinix (our datacentre operators in Manchester) engineers to pull the plug on the faulty piece of equipment. This had the intended effect and customer services (ie. sites/e-mail) came back up with all traffic going via functioning systems approximately 40 minutes after the outage began.

Service was restored at that point for customers but we still had the unfortunate situation of our office being without internet access, and therefore also the company with no phone service (we run a VoIP based system out of our office). Our team decamped to work from home whilst a few of our engineering team remained to restore office services.

After further investigation, the faulty router had a software bug identified and corrected, before being put back into the general networking mix. Phone support was turned back on at about 15:10.

We did detail the event over at status.34sp.com but as some users noted this system did become unavailable itself for a moment. While we do run status off network the load caused by the system unavailability caused the status site to fail also. We will be looking to immediately address this and improve the capabilities of the off site system in face of future events.

2 Comments

Leave a Reply to Olly Sampson Cancel reply

  1. Olly Sampson
    Friday May 6th, 2016

    Still no out and out apology though! I had customers in the middle of the working day screaming at me down my working customer service line, making us look like complete idiots as we could do nothing. This is simply not acceptable and surely you would have tested the scenario whereby a router drops and comes back up briefly and drops again! This is not isolated as we had exactly the same problem last week on Thursday the 28th April at 13.50 BST till 14.40 BST when all services went down completely then causing more calls and stress. Surely this was the warning before this week. And prior to both of these events you had a “cyber attack” and they went down. Once is concerning, twice is fully alarming and three times is unacceptable. Not even a sorry!

    • Olly, this incident was not related to any prior events. It was an isolated and unique event that had not been demonstrated in prior testing; you’re correct that these systems should be tested and indeed they are. This was a very esoteric and unique fault that we believe to now be fixed. Ultimately clients should have been affected upto about 30 minutes, meaning services would still be in SLA parameters over the month for most users.

      In regards the other issues you’ve seen, I will make sure we reach out privately to get to the bottom of what happened in the other cases.