[...]
Primary Outage
At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldnÂt handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another.
When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes Âstuck in a loop, continuously searching the cluster for free space. This quickly led to a Âre-mirroring storm, where a large number of volumes were effectively Âstuck while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this Âstuck state.
[...]
The team faced two challenges which delayed getting capacity online. First, when a node fails, the EBS cluster does not reuse the failed node until every data replica is successfully re-mirrored. This is a conscious decision so that we can recover data if a cluster fails to behave as designed. Because we did not want to re-purpose this failed capacity until we were sure we could recover affected user volumes on the failed nodes, the team had to install a large amount of additional new capacity to replace that capacity in the cluster. This required the time-consuming process of physically relocating excess server capacity from across the US East Region and installing that capacity into the degraded EBS cluster. Second, because of the changes made to reduce the node-to-node communication used by peers to find new capacity (which is what stabilized the cluster in the step described above), the team had difficulty incorporating the new capacity into the cluster. The team had to carefully make changes to their negotiation throttles to allow negotiation to occur with the newly-built servers without again inundating the old servers with requests that they could not service. This process took longer than we expected as the team had to navigate a number of issues as they worked around the disabled communication. At about 02:00AM PDT on April 22nd, the team successfully started adding significant amounts of new capacity and working through the replication backlog. Volumes were restored consistently over the next nine hours and all but about 2.2% of the volumes in the affected Availability Zone were restored by 12:30PM PDT on April 22nd. While the restored volumes were fully replicated, not all of them immediately became Âunstuck from the perspective of the attached EC2 instances because some were blocked waiting for the EBS control plane to be contactable, so they could safely re-establish a connection with the EC2 instance and elect a new writable copy.
[...]
I'm sure a lot of network engineers earned their overtime pay during this exercise. :-(
I can't wait for the days when failures like this take place across national (and planetary) boundaries....
It's good that they're being so transparent about what went wrong and how they're going to address their processes to learn from it and keep it from happening again.
Cheers,
Scott.