Apologies for the server glitches today

15 June 2011

We had a couple of 5 minutes blips on one of our servers today that meant we had to log people off that server and onto another one. This was caused by the server in question getting overloaded and failing faster than we could monitor. So we had to close it down and move everybody who was logged onto that server to another one, which meant those people had to log in again.

The good news is that our failover /replication worked perfectly, so no data was lost and users could just log in again and continue working.

This evening we’ve turned on some more servers and smartened up the load balancing so things should run smoothly again tomorrow.

It is a strange fact that servers never seem to degrade slowly: one minute they are fine, the next minute they slip into a death spiral with no warning. We’ve been signing up a lot of new users in the last few months, we had new servers racked and standing by ready to go but the suddeness of the extra load took us by suprise.

So our apologies to those users that were inconvenienced, we are working hard here to make sure it doesn’t happen again.


Infrastructure improvements at Really Simple Systems – going for 99.999%

9 May 2011

For some time now we have been looking at ways to increase our availability from 99.99% (down one hour a year) to 99.999% (down no more than five minutes a year). No matter how careful we are we have had a couple of minor outages every year for the last couple of years, normally because of a power failure or somebody pulling out the wrong cable in a router our main datacentre, and luckily always at 03:00 so not many customers noticed.

When this happens we have a “failover” system in another datacentre that is on hot standby and has a replicated copy of all our customers’ data, up to the last second. Our engineers haul themselves out of bed in the middle of the night, try and work out what has gone wrong and if it can’t be fixed quickly we switch everybody to the failover system.

This is all fine (except for the engineers!) but it still takes about 30 minutes to switch, which doesn’t get us to the holy grail of 99.999% availability.

The recent outages at the likes of Amazon EC2 service (Amazon’s proprietary hosting platform for third part applications which went down for 36 hours last month, taking many products down with them) have heightened the awareness of such problems, so over the Easter weekends we have now implemented “master-master” synchronisation between the two datacentres, instead of the “master-slave” sync that we had before. What that means is that users can now log onto their data on either datacentre without us having to initiate a formal process, as both datacentres are now running as live systems instead of one live and one failover. Over the next month or so users may notice that they are logging onto system2.reallysimplesystems.com instead of system.reallysimplesystems.com as we move customers from one datacentre to another.

The next stage in the rollout is to automatically detect when one datacentre goes down and instantly switch users to the other datacentre. This should be in place in the next month.

As well as delivering better uptime this design should also deliver better performance, as we will be making use of all the servers all the time, instead of having half of them twiddling their thumbs in the failover datacentre.


LINX failure causes CRM access problems

14 December 2009

Last Thursday afternoon (10-Dec-2009) we had a few users reporting that access to www.reallysimplesystems.com was slow and/or inconsistent, with page timeouts occassionally happening. We spent a manic few hours trying to diagnose the problem, and discovered it was a major fault in the UK’s Internet routing system – the London Internet Exchange (LINX) had a major failure and this affected many thousands of sites. It was fixed later that same afternoon.

The Internet is normally pretty robust, when one part parts fails it is normally able to reconfigure itself to find another route to and from end points, but this time it took a while for enginners to fix.

Thanks to all the customers to reported this, and there a full description of the problem on The Register.


Follow

Get every new post delivered to your Inbox.