High availability NAT instances in AWS VPCs

There are plenty of things Amazon is getting right with VPCs, such as peering, VPN gateways, and fine grained security via security groups and network ACLs, but the NAT requirement is a big miss. It’s a nuisance to self-manage (and pay for) multiple systems just to provide outbound Internet access for instances in private subnets. Sadly, it’s been this way for some time now.

There are some reasonable solutions to do this in a way that provides high availability, albeit with some complexity. Google yields a few articles on the topic. The approach documented by Amazon uses a monitoring script to automate the route takeover process. Another article uses an autoscale group per zone.

This article describes another method to achieve on HA NAT. It’s similar to the second link above, but with a single autoscale group with one instance per zone. When an instance in a zone is marked unhealthy, autoscale boots another instance to replace it. I provide all the details in code, comprising the following:

A Packer template to build a NAT AMI, provisioned by Ansible
A CloudFormation template that creates 1) a VPC with three public and six private subnets and 2) an autoscale group that ensures a NAT in each availability zone
A script that runs at boot via user data to takeover routing

Packer

Packer is another tool from the fine folks at Hashicorp that builds machine images for multiple cloud platforms. I use it to create the NAT AMI used by the autoscale group. Find the Packer template and usage information on GitHub. Install Packer, then use the template to create an AMI for the NAT instance. Make note of the AMI ID.

Don’t forget to give Packer an IAM Role to use for creating the AMI. The policy can be found in the documentation.

NAT IAM Role

Once the AMI is created, don’t forget to create an IAM Role for the NAT instance to assume. A sample policy looks like this:

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": [
         "ec2:DescribeInstances",
         "ec2:ModifyInstanceAttribute",
         "ec2:DescribeSubnets",
         "ec2:DescribeRouteTables",
         "ec2:CreateRoute",
         "ec2:ReplaceRoute"
       ],
       "Resource": "*"
     }
   ]
 }

CloudFormation

The CloudFormation template also found on GitHub creates a VPC in the format I describe in a previous blog post. Launch multiple VPCs using this template and then peer them to recreate that model exactly.

Submit the template to AWS. It will ask for a name to use for the VPC, the first two octets of an address range, the name of a keypair to use on the NAT instances, and which AMI ID to use for the NAT instance. Once submitted it will create the VPC, all the subnets, and launch an autoscale group with one instance per availability zone.

The template could be improved to have a map of regions to AMI IDs, but it will need regular updates as the AMIs are rebaked. You do rebake AMIs for patching frequently, right?

HA NAT Bash script

The NAT takeover script was originally written by AWS. The CloudFormation template above tags each private subnet with network=private, and the script uses this to determine which subnets are private. Using that information, the script finds the route table to update, then either creates or updates the route table to point at itself as the default gateway.

Summary

This approach uses a single autoscale group with one instance per zone to ensure that each zone has a dedicated NAT, and that the NAT is self-healing if an instance is marked as unhealthy.

Some time can pass before EC2 marks an instance as unhealthy and autoscale replaces the instance. During this time the private instances in the same zone will have no default route and thus no Internet connectivity. I haven’t measured this to know how long of a period it might be.

There’s also a chance that the NAT fails to route traffic but EC2 doesn’t mark it as unhealthy. EC2 only monitors for a few specific status checks: loss of power or network connectivity, vaguely-defined “software or hardware issues on the host”, failed startup configuration, exhausted memory, corrupt filesystems, or an incompatible kernel. If for some reason the kernel stopped routing packets or something else went wrong, EC2 would not notice and the instance would not be replaced. This could be addressed by running a script on each instance that sends a heartbeat message. If the heartbeat isn’t received, then replace the instance.