Why is my website down? Investigating downtime step by step

Hello. My site was down for 20 minutes yesterday, what happened?

Whilst we pride ourselves on our uptime commitments, from time to time we get messages like the above from customers and when we get them we need to investigate what happened.

But first, how do people know if there has been downtime?

Normally either they or another person has experienced issues viewing the site, they’ve tried to log in and it didn’t work, or a tool has told them there is an issue.

External monitoring, sometimes called uptime monitoring, is where a service sends an HTTP request to your server every few minutes and warns you if it hasn’t had the correct response. These tools can be really useful for determining when a site is down.

Reports coming from the site’s owner or someone notifying them have the advantage that the person reporting the issue might have access to more information, like the type of error and what they were doing at the time.

What is a website being down, anyway?

That’s actually a much harder question to answer. From a monitoring solution’s perspective, the answer appears simple: If the server doesn’t respond with an HTTP status code 200 (a special code sent along with any HTML and other headers to specifically say I’m ok) then it’s not ok and a problem is occurring.

If the server responds with an HTTP status code of 200 but then encounters an error and produces a WordPress white screen of death or similar, then for an end user the site is down but for the external monitoring it’s up.

Likewise, an issue with an SSL certificate or malware warnings might mean browsers fail to connect when an external monitoring service will happily report everything as fine.

What happens when we get a why was my site down message?

If we get a support ticket asking why a site was down for a specific period we have several things we can investigate. We will often ask a client to confirm the time (and timezone) and why they believe it was down.

If the client says that it was down for them we will also ask for their IP address so we can see if there was something specifically affecting them.

With that information, our first port of call is our status page which is where we post information that affects multiple customers. It will let us know if we were having network issues around that time or something specific to the container or platform.

Our next step is to look at two internal tools; notes on the account, and internal alerts for the container. If a member of staff had needed to do something on the container it will be noted up, and we can also see if our internal monitoring picked up any issues with a given service within the container. Where possible these systems are automated so that they correct any issues without the delay of waiting for human intervention.

With the global bits out of the way, we will look to see if the IP address the client gave us, or the one used by the monitoring tool, has been blocked. We use a range of tools to restrict access to bad bots trying to brute force logins or run malicious content, and sometimes we inadvertently block a user who just forgot their password and tried to log in unsuccessfully many times. If we find a block which is still in effect, we can quickly and easily remove it.

Next we visit the site in question’s statistics/logs folder and specifically we are interested in 3 files: the Nginx access log for the period in question, the Nginx error log, and the php-error log. Our first stop is the access log, where we can quickly navigate to the log entries around the time of reported downtime. We are looking for large jump within the access log, which would indicate that data wasn’t hitting Nginx at all, potentially putting the source of any issue as something outside the hosting account.

Normally we find that there is some activity, so the next step is to check to see if it’s mostly 200 status codes. This would indicate that the issue wasn’t with the server or application but rather an issue specific to the user or monitoring. We can also use a tool like GoAccess to compare the hour during with downtime with the same period the previous day and with the hours around it to get a feel for whether site errors have increased or decreased.

For an increase in HTTP 500 errors, our next place to look is the PHP error log. While the default page for a 500 error says Internal Server Error, a 500 error is nearly always an application error. On our WordPress Hosting this would mean either a plugin or theme is causing the issue. If the issue is 502 or 504, this tends to indicate either an issue with the database or on our WordPress Hosting potentially Redis. If we are seeing 502 or 504 for the period in question we will start investigating further as to the cause.

By this stage we should know:

  • If we had any network-wide issues
  • If the container had any specific known issues
  • If the client’s IP has been blocked by our system
  • If we have any notes of issues that might be related
  • If there was traffic to the container
  • If there was an increase in errors.

Either we have something to go back to the client with, eg. a WordPress plugin seems to be the issue, or that we did have an issue with x, or as is often the case nothing seems to be obviously wrong.

Frustratingly most of the investigative work will prove inconclusive, or not support what the client or the external monitoring has said. It’s not unusual for us to come back to the client with records showing the external monitoring making and receiving a request from the access log at the period it reports the site as down.

Why would external monitoring be wrong?

There are many reasons external monitoring might have false positives. Just like your website, an external monitoring service is an application on the web. It is also on its own hosting and needs to use networks to access your site.

All the reasons your site might be down could also be reasons the external monitoring service is down. To get around this most external monitors check from multiple locations on different networks, and when picking a provider this is a good feature to make sure they have.  That doesn’t prevent them from being down or having issues but it does help mitigate them.

Network issues while communicating between the external monitor and ourselves can happen. A human site visitor might just hit F5 to refresh and all will be sorted, but CURL (the process used by most monitoring systems) will have to completely retry the request.

What happens when we don’t know what the issue is?

Sometimes we reach a stage where we can’t identify an issue. That doesn’t mean we won’t rule out it occurring, normally when something or someone says they can’t access the site they can’t access the site. While we will advise that it’s probably temporary, we always ask that if they see a repeat to contact us. For our WordPress Hosting we go one step further and in most cases where we are sure something is causing an issue, we add the site to our own external monitoring. We use a service called updown.io and at any time we might have dozens of sites being monitored. We normally leave a site on external monitoring for at least a month after it’s been added which allows us to see any downtime and get a monthly uptime average.

One of the advantages of putting the site on external monitoring is that the WordPress team get notified of any downtime directly. This, combined with our internal server alerts, lets us know when there is an issue, usually before the client notices.  We therefore get an opportunity to see the problem as its happening and hopefully fix it.

How can you help?

Most of the checks we do, you can do as well. You can check your access and error logs, along with visiting status.34sp.com.

You can run an uptime monitor for yourself, though as above we do recommend you choose a service with multiple locations and you set the check frequency to every 2 minutes or longer. You might also want to consider a visual regression testing service, to identify if parts of the page are not showing.

If you do experience downtime, the more details you can give us the better:

  • Was it yourself, or your uptime monitor that noticed?
  • When did it start, and when was the site up again?
  • Was the connection intermittent or continuously offline?
  • Did you get an error message, if so what was it?
  • What was your IP address at the time?

All these help us diagnose issues quickly. Stability of a hosting platform is one of its most important features, and on the WordPress Hosting each and every ticket regarding downtime is investigated, so don’t be worried or afraid to ask why your site was down.