How Effective is IP Takeover at Amazon EC2?

24 April 2008: Amazon is building a revolutionary cloud computing platform with their Electric Compute Cloud (EC2) service. The recently announced elastic IP feature dramatically expands the possibilities of EC2 as a true hosting environment.

For standard website requirements, the current implementation appears suitable, but for projects that require high-availability, there is at least one significant limitation.

We envision a load-balanced cluster wholly within EC2. The front-end of this setup would be managed by two small EC2 instances that would effectively serve as load-balancers or routers. Requests would arrive at the primary router and would be subsequently directed to the least loaded instance within the cluster. Since a single router serves as a single point of failure, at least one additional router is required for a truly highly available system. A monitor could regularly ping the primary router and if there are any problems, the secondary router should reassign the IP address to itself and take over as the primary router.

In conjunction with Amazon's Availability Zones, Such a system would not have any single-points-of-failure. To test the feasibility of this layout, we spawned two small EC2 instances and monitored the time it took for a second instance to takeover the IP address of the first. In three tests, it took on average 3 and a half minutes and never less than 3 minutes for this to occur.

We run a similar cluster in a traditional hosting environment and IP takeovers take approximately 2 seconds.

Of course the upshot of this is, that in case of failure of the primary router, there would be a theoretical downtime of up to 3.5 minutes while the secondary router is waiting for the IP to propagate. We presume that the large number of routers within Amazon's network makes quicker IP propagation a nontrivial task.

One solution would be to move the routers outside of EC2. The increased ping and latency times make this solution suboptimal.

Ideally, Amazon would offer a dedicated Load Balancing solution designed specifically for such purposes. Unless some other solution is offered, "highly available" clusters wholly within Amazon's EC2 service will not be truly highly available.