Operational Excellence Means Automation

People use the term “operational excellence” in a lot of different ways. In its vaguest sense, it means continuous improvement as applied to operations. But you’re interested in what it means in the context of technology operations. And I’m here to tell you that it means automation.

Operational Excellence is one of the five pillars of the AWS Well-architected Framework. The AWS whitepaper lists six design principles for achieving operational excellence. I’ve paraphrased these principles for clarity. Here they are:

Define everything as code

This is easily the most obvious. Turn everything into code that can be automatically executed by a machine. This includes the building of infrastructure, application deployments, testing, recovery, and anything that requires or benefits from being defined in a runbook. If it’s a repeatable process, code it and let a machine do it.

Documentation as input and output

The delightful side-effect of defining everything as code is that code can serve as documentation. It becomes trivial to have a machine take code as input, execute it, and then generate some pretty documentation based on a template. The resulting documentation can then be used by another machine. All automatically, of course.

Changes should be as small and frequent as possible

Without getting into the rationale behind this, the point is that the only way to make small changes as frequently as possible is to use automation. Pushing a code change to a repo whence it’s automagically built, deployed, and documented is faster than doing any of that manually. Reversing a change automatically is faster, too.

Look for things to automate

If you’re not automating something, and you can, then do it. Of course, you should avoid automating a bad process. Fix the process and automate it. And if there’s nothing to automate right now, keep looking, because changes will inevitably bring opportunities for automation.

Inject failures

Break things to cause failures. If recovering from those failures requires manual intervention, automate the recovery steps.

Tell other people in the organization to automate

The idea is to share what you’ve learned with others. Of course, what you’ve learned is that automation is the key to achieving operational excellence. So just keep it simple and tell them to automate.

But isn’t operational excellence more than just automation?

What operational excellence actually looks like depends on the organization. But no matter how you slice it, you’re closer to operational excellence if you automate than you are if you don’t. So yes, there is more to it than just automation, just as there’s more to driving than going from point A to point B. But if operational excellence is the goal, you need the vehicle to get there, and the only vehicle that will do it is automation.

Using AWS Systems Manager to Upgrade WordPress

After years of manually upgrading my self-hosted WordPress installation, I decided it was finally time to apply some devops principles (namely automation) to this process.

This site runs on an EC2 instance on AWS, so I decided to use AWS Systems Manager (aka SSM). I started out by creating the following Command Document (which happens to be in YAML format because JSON is ugly):

---
schemaVersion: "2.2"
description: "Download and install WordPress"
mainSteps:
- action: "aws:runShellScript"
  name: "example"
  inputs:
    runCommand:
    - "wget https://wordpress.org/latest.zip"
    - "mv latest.zip /var/www/html"
    - "cd /var/www/html"
    - "service httpd stop"
    - "unzip -o latest.zip"
    - "service httpd start"
    - "rm -f latest.zip"

The Command Document executes the bash commands in the runCommand section. It downloads the latest version of WordPress, stops Apache, unzips the files, restarts Apache, and then cleans up.

SSM uses an agent to carry out the bash commands. My instance runs Amazon Linux which comes with the agent preinstalled, so I didn’t need to install it.

Systems Manager can execute the Command Document at regular intervals to keep up with the typical WordPress release schedule of every 1-2 months. I can also trigger it manually if there’s a security or bugfix release I need.

To avoid catastrophe, I have the Amazon Data Lifecycle Manager for EBS Snapshots take daily snapshots of the instance, just in case something goes terribly wrong with an upgrade.