Whether it's from a planned upgrade or a blown RAID, your site is going to go down eventually. This was brought to light for a lot of people by a recent outage on Hacker News- an outage that was made worse by HN responding with 200 Status Codes. During the subsequent discussion I posted a few quick pieces of advice which got a bit of attention, so I thought it was worth writing up a real post about it. Since this all started with a website and a status code it's only fair to focus the attention on HTTP and how it can be used to help.
One of the biggest things anyone can do to keep their site up is to put it behind a CDN. These Content Distribution Networks are essentially networks of HTTP Proxy Servers that cache your content all over the world to make delivery much faster for your users. I could ramble about CDNs for hours (this is, unfortunately, not an exaggeration), but the purpose of bringing them up here is that they give you quite a few more options for dealing with downtime so I'm going to direct you to this excellent overview of the basics of HTTP Headers and CDNs and wait for you to take that all in. The very, very abridged version is that CDNs really reduce the load on your systems during an outage and can help serve cached content in the event that systems go down.
A quick sidebar. Even though we're discussing this in the context of CDNs, these rules are good to follow even if you aren't using one. HTTP Proxy servers are not just used by CDNs- they can be found in most corporate environments, through several ISPs, and the caching rules in general are honored by most browsers.
One of my favorite underused status codes is the 503 Maintenance code. This hidden gem of HTTP tells any application- browser, rss reader, API consumer, Google Bot, CDN- that the site is temporarily unavailable but will return. This allows those applications to take a custom action for that state. In the case of browsers, proxies and CDNs that may mean serving stale content, while API consumers (client side applications or other web based services that utilize your own platform) can try back later and take whatever action they need. Google Bots, of course, simply return at a later date and continue using the data they have for their search engine results.
When returning a 503 status code you also have the option of adding the "Retry-After" header to the HTTP request. This header lets clients know how long they should wait before trying to reattempt their requests. This really comes in handy when downtime is for an extended period, as new requests can continue building up as more systems realize they have stale data and try to refresh.
Most web developers are familiar with the headers to tell browsers (and thus CDNs and proxy servers) when their content should expire, but few know that you can tell servers to keep using that content anyways. Using the Max-Stale header you can tell a CDN to keep your content for an extended period of time and serve it even though it's expired. When your content expires the CDN will still try to refresh it, but if that fails with any of the 500 error codes or with something on the network layer (timeout, host not found, etc) then the CDN will simply continue serving that content until that Max-Stale timeout is reached.
Not all content is cacheable, and not all systems are going to hit your cache. In the event of a full outage you still want to be able to issue some sort of message to your users, and that's where error pages come in.
The most important things you can do with your error pages is make sure they are issuing error level status codes. Hacker News issues 200 status codes, indicating that everything was working properly. This resulted in many people thinking the site was down for longer than it was as their browser was still caching the error pages themselves. Having your error page issue a 503 status code will make your recovering a lot smoother.
You don't want to have to set this all up while things are already down, but these things are remarkably easy to set up in advance. The most straight forward and effective way is to setup a web server, preferably in a region away from your main systems, that has your error page already setup on it. Make it so this server responds to any request that comes in with a 302 Temporary Redirect response that points to that error page. In the event of an outage you can simply point your traffic at that server
How you go about automating this is going to depend on your setup. If it's a fairly simple one then there are great DNS based failover systems, like those run by Dyn and Edgecast, which will actually monitor your servers and point your traffic at a failover server (presumably the maintenance server from above) when there's an issue. Even if you're hosting on something like AWS or with your own datacenter where you can do this on the load balancer level it may be worth utilizing the DNS based system as well, since it'll work even if there's a widespread outage.
Putting it all together
- Use a CDN to Cache your content.
- Take advantage of the Max-Stale header to let systems know how long they can use your data while your server is down.
- When in an error state use the 503 Maintenance status code.
- Set up an "error server" in advance so you can easily put this all in place during an emergency.
- Automate it!
At 11am Monday morning your server goes down. Your monitoring system notices this and sends you an alert. At the same time it triggers an action to change the DNS of your origin server to point at your maintenance server, which is issuing your 503 status code. You manage to poke the server at 11:10am, realize that there's a failed drive, and begin restoring things to a separate machine. At 3pm you're back up and running. More importantly, during those four hours of downtime 95% of your website was still functional and the places that had to be dynamic and were not working had an error message explaining the situation. When you came back up your site was usable again quickly and without issue.
This was my first blog post here, and I hope you all got something out of it. This was a rather general overview, so feel free to ask any questions and point me at something you'd like to me to follow up with in more depth.