Test Driven Operations

Test Driven Development has become reasonably ubiquitous in the software world, even in early stage startups. Failing to do so is a big ingredient in the kind of technical debt that will crush a startup. But development isn't the only place where "test driven" can be used. In fact, I'll argue that if you're following Lean principleseverything you do should be based on tests. (I'll write more about this sometime soon.)

Today, I'm going to apply the "test driven" philosophy to the Operations part of DevOps. It's an approach that we've been doing more of lately at Startup Services and it's been working pretty well. The process is really very much like TDD, but with some extra documentation thrown in along the way. Let's walk through an example of how we set up a basic website health monitor.

Now, at this point, there is no website, although there's no reason that these steps can't be implemented on an existing site, as long as the tests are failing when they are first created. The key is to collect all the critical information about the conditions that are necessary for the test to pass, so that when there's a failure at a future time, the responding person has all the information at their fingertips.

We aggregate all our alarms into VictorOps. I've been a big fan of their service for years now. (As much as I love them, we also push alerts directly to Slack, because: redundancy.) VictorOps has this cool thing called the Transmogrifier which, among many other things, will take the node name of an alert and link it to a GitHub Wiki page. This page is where we'll keep our notes about the system.

One of the several tools we use for monitoring is NodePing. NodePing will check all kinds of things (including ping, of course). For our purposes, we want to ensure that our new website is working properly. To do this, I know I'm going to create a dedicated page called /healthcheck, and it will include a randomized string of content. NodePing is going to check the page each minute to ensure that this string is present on the page. (This isn't the only test, of course, but it's a nice low-level test that will fail for all kinds of good reasons: server down, DNS problem, DDoS attack, etc.)

First, we create the monitor: a content test that will poll the /healthcheck page each minute and alarm if the "contains" text is missing.

We haven't launched the site yet, so of course this test will fail. When it does, it will push an alarm to VictorOps, and we'll move on to our next step.  (If we don't get an alarm, that means our monitoring tool isn't configured correctly, and that's exactly the kind of information that we crave in this approach.)

Lucky for us, the alarm fired as expected and we now have an incident open in VictorOps. The Transmogrifier has Transmogrified the entity_id into a nice clickable link that takes us to a GitHub Wiki page. On this page, we'll collect all the information about the site that will be useful when this alarm fires for real.

On this page, I'll make notes about this particular alarm, but I'll also create links to a wiki page for all the alarms related to the website [[cbdds.website]], and for the client herself: [[cbdds]].

As we build, we'll document everything related to the process at the appropriate level. (The domain info, DNS and CloudFlare settings at the top level, the webhost and platform information and the next level, and then specifics about this test at the lowest level.)

Then, it's "operations" as usual. We register the domain, launch the hosting platform, and build our healthcheck page with its custom string. As soon as it's published, like magic:

Now our site is operating as expected, but I'm confident that should there be a problem, we'll know right away and have all the information we need to troubleshoot.