For some time now we have been looking at ways to increase our availability from 99.99% (down one hour a year) to 99.999% (down no more than five minutes a year). No matter how careful we are we have had a couple of minor outages every year for the last couple of years, normally because of a power failure or somebody pulling out the wrong cable in a router our main datacentre, and luckily always at 03:00 so not many customers noticed.
When this happens we have a “failover” system in another datacentre that is on hot standby and has a replicated copy of all our customers’ data, up to the last second. Our engineers haul themselves out of bed in the middle of the night, try and work out what has gone wrong and if it can’t be fixed quickly we switch everybody to the failover system.
This is all fine (except for the engineers!) but it still takes about 30 minutes to switch, which doesn’t get us to the holy grail of 99.999% availability.
The recent outages at the likes of Amazon EC2 service (Amazon’s proprietary hosting platform for third part applications which went down for 36 hours last month, taking many products down with them) have heightened the awareness of such problems, so over the Easter weekends we have now implemented “master-master” synchronisation between the two datacentres, instead of the “master-slave” sync that we had before. What that means is that users can now log onto their data on either datacentre without us having to initiate a formal process, as both datacentres are now running as live systems instead of one live and one failover. Over the next month or so users may notice that they are logging onto system2.reallysimplesystems.com instead of system.reallysimplesystems.com as we move customers from one datacentre to another.
The next stage in the rollout is to automatically detect when one datacentre goes down and instantly switch users to the other datacentre. This should be in place in the next month.
As well as delivering better uptime this design should also deliver better performance, as we will be making use of all the servers all the time, instead of having half of them twiddling their thumbs in the failover datacentre.
Posted by reallysimplesystems