I was a bit dumbfounded when I first learned that ec2 instances did not come with an out of the box solution for automating backups. I mean, sure it was easy enough to manually initiate a machine image but what about weekly, daily or even hourly machine backups? So, in my early days dealing with AWS, I cobbled together a set of scripts to automate creating AMIs and then rotating them out on a schedule. To my surprise, years down the road those simple scripts are still chugging away and saving my bacon on a regular basis. Let’s explore this.

For one reason or another we started this in Node.js. All was fine and well until we came across async hell. Luckily the backup script was very simple and worked well in Node.js but the rotation script became unnecessarily complicated, so we rewrote it in python to save time. Sometimes it is easier to start over than to fix something broken.

You are more than welcome to copy paste these two scripts into Lambda functions and use them in your environments.

Backup Script (Node.js 6.10)

The backup script is used to take an AMI of each of your ec2 instances. This could be enchanced to look for an ec2 tag such as "backupFequency".

var aws = require('aws-sdk');
aws.config.region = 'us-east-1'; //Change this to the region you like
var ec2 = new aws.EC2();

String.prototype.lpad = function(padString, length) {
    var str = this;
    while (str.length < length)
        str = padString + str;
    return str;
}

function getDateTime(format){
    var outDate = "";
    var dt = new Date();
    for (var i = 0, len = format.length; i < len; i++) {
        switch(format[i]) {
            case 'Y':
                outDate += dt.getFullYear();
                break;
            case 'm':
                outDate += dt.getMonth().toString().lpad("0",2);
                break;
            case 'd':
                outDate += dt.getDate().toString().lpad("0",2);
                break;
            case 'H':
                outDate += dt.getHours().toString().lpad("0",2);
                break;
            case 'M':
                outDate += dt.getMinutes().toString().lpad("0",2);
                break;
            case 'S':
                outDate += dt.getSeconds().toString().lpad("0",2);
                break;
            default:
                outDate += format[i];
                break;
        }
    }
    return outDate;
}

//Lambda handler
exports.handler = function (event, context) {
    ec2.describeInstances(function (err, data) {
        if (err) {
            console.log(err, err.stack);
        } else {
            for (var i in data.Reservations) {
                for (var j in data.Reservations[i].Instances) {
                    var instanceid = data.Reservations[i].Instances[j].InstanceId;
                    var name = "";
                    for (var k in data.Reservations[i].Instances[j].Tags) {
                        if (data.Reservations[i].Instances[j].Tags[k].Key == 'Name') {
                            name = data.Reservations[i].Instances[j].Tags[k].Value;
                        }
                    }
                    if (data.Reservations[i].Instances[j].State.Name != 'terminated') {
                        console.log(name + "_" + instanceid + "_" + getDateTime('YmdHMS'))
                        var params = {
                            InstanceId: instanceid,
                            Name: name + "_" + instanceid + "_" + getDateTime('YmdHMS'),
                            Description: 'Autobackup with Lambda function (ec2AmiCreation)',
                            NoReboot: true
                        };
                        ec2.createImage(params, function (err, data) {
                            if (err) console.log(err, err.stack); // an error occurred
                            else console.log(data);           // successful response
                        });
                    }
                }
            }
        }
    });
}

Rotation Script (Python 3.6)

The rotation script is used retain x number of AMIs and delete the rest along with their corresponding snapshots.

import time
import boto3
from botocore.exceptions import ClientError

ec2 = boto3.client('ec2')
retain_count = 14

# log each ami that is retained
def log_ami_retention(i_name, i_id, i_amis):
    print('Retaining', len(i_amis), 'AMIs for', i_name, '(' + i_id + ')')
    for ami in i_amis:
        print(' - Retaining AMI:', ami['ImageId'], 'created', ami['CreationDate'], '(' + i_id + ')')
    return

for reservation in ec2.describe_instances()['Reservations']:
    for instance in reservation['Instances']:
        try:
            # time_stamp = time.strftime("%Y%m%d%H%M%S")
            instance_id = instance['InstanceId']
            instance_state = instance['State']['Name']
            instance_name = ''
            name_tag = next((x for x in instance['Tags'] if x['Key'] == 'Name'), None)
            if name_tag is not None: instance_name = name_tag['Value']

            img_filters = [
                {'Name': 'description', 'Values': ['Autobackup with Lambda function (ec2AmiCreation)']},
                {'Name': 'name', 'Values': ['*' + instance_id + '*']}
            ]

            if instance_state is not 'terminated':
                instance_images = ec2.describe_images(Filters=img_filters)['Images']
                print(len(instance_images))
                for ami in instance_images:
                    name_items = ami['Name'].split('_')
                    if len(name_items) != 3 or name_items[1] != instance_id:
                        print('Error: Parsing timestamp for AMI',
                                ami['ImageId'],
                                '('+ami['Name']+')',
                                'for instance',
                                instance_id)
                        instance_images.remove(ami)
                    else:
                        ami['CreationTimestamp'] = name_items[2]

                instance_images.sort(key=lambda x: x['CreationTimestamp'], reverse=True)

                if len(instance_images) > retain_count:
                    log_ami_retention(
                        i_name=instance_name,
                        i_id=instance_id,
                        i_amis=instance_images[:retain_count]) 
                        # up to but not including retain_count (wierdness)

                    images_to_delete = instance_images[retain_count:]
                    
                    print('Deregisting',
                            len(images_to_delete),
                            'AMIs for',
                            instance_name,
                            '(' + instance_id + ')')

                    for ami in images_to_delete:
                        try:
                            print(' - Removing AMI:',
                                    ami['ImageId'],
                                    'created',
                                    ami['CreationDate'],
                                    '(' + instance_id + ')')
                            ec2.deregister_image(ImageId=ami['ImageId'])
                            for device in ami['BlockDeviceMappings']:
                                try:
                                    print(' - Removing snapshot:',
                                            device['Ebs']['SnapshotId'],
                                            '(' + ami['ImageId'] + ')')
                                    ec2.delete_snapshot(SnapshotId=device['Ebs']['SnapshotId'])
                                except:
                                    print('Encountered an error while removing snapshot (' + device['Ebs'][
                                        'SnapshotId'] + ')')
                        except:
                            print('Encountered an error while deregistering AMI (' + ami['ImageId'] + ')')
                else:
                    log_ami_retention(
                        i_name=instance_name,
                        i_id=instance_id,
                        i_amis=instance_images)

        except:
            print('Encountered an error while processing instance (' + instance['InstanceId'] + ')')

Things of interest

  • retain_count - the number of days you would like to retain of ami snapshots

Permissions

Both scripts require write permission to CloudWatch logs and various ec2 permissions. It is recommened to configure least permission policies for each script, such that the backup script can create AMIs but not delete and the retention script can delete AMIs but not create.

Triggers

To automate these lambda funtions you need to attach a CloudWatch event to each. For example I created a CloudWatch cron event cron(0 6 * * ? *) which is used to trigger each script every morning.

Enhancements

In the future I would like to build a notification system around the CloudWatch logs to be more proactive in monitoring backup failures. Along with notifications I think it would be beneficial to enhance the backup script to allow for different schedules for each ec2 instance to be defined through tags. Finally, the retention script should also consume an ec2 tag to configure x number of AMIs to retain.

As always, we are looking forward to hearing your feedback and comments.