Our CTO Sleeps Well At Night

July 30, 2018

[devops] [scaling]

I am PacerPro’s chief technical officer, which means that I get the first phone call at 3 am when things go pear-shaped. Fortunately, I don’t get many of those calls: I’ve built a development team and process that delivers reliability as a core feature.

No one gives a damn about where the electrons come from until the lights go out.

The number one core value of PacerPro, from the very beginning, has always been rock solid reliability. “You only get your customer’s trust once,” says Gavin McGrane, CEO, “and if you lose that, you’ll never get it back.” So we’ve built our product with reliability as a primary design constraint. That’s a very, very tough requirement. Most startups (as we were, way back in 2011) write their code with duct tape and bubble gum. We took the opposite approach, writing our code using a “Test Driven Development” (TDD) methodology ⊕We write the tests first, even before we write the first line of production code. . We run a lot of tests, thousands of them. We run our test suites dozens of times a day, sometimes hundreds. We only ship code to production when all of our tests pass with flying colors.

I’ve been working in the software development industry since 1985. Back then, in the bad old days, production deploys were fraught with worry and prayer. Even so, I had a reputation for shipping bug-free code–annoyingly so, since sometimes that required delaying the software’s release until the code was ready to go.

Stay focused on product & core competencies.

We’re a lean company. Infrastructure is not our product; I don’t want to spend precious headcount budget on operations. These days, with everything in the cloud, PacerPro gets to leverage the expertise of some of the largest IT corporations in the world: Amazon, Salesforce, Google. As CTO, I find “best in class” providers to deliver computing infrastructure for our product. I don’t have to pay for a DevOps engineer or an expert DBA. Instead, I get my provider’s expertise in their core competencies. So, for a fraction of the cost of a single DBA, for example, I have scalable databases with redundancy, hot backups, and immediate roll forwards. All ACID, encrypted at rest, and privacy compliant.

You’re not paranoid if the world is out to get you.

There’s an interesting side-effect to building cloud-based applications: You have to engineer with the expectation that any of your service dependencies can and will fail at any moment. This is where an always-testing culture comes in handy. Part of our production test suite has what is sometimes called “enemy testing,” where we simulate that something “bad” has happened. We don’t have to guess whether our application runs or not in the face of failures, because we have already empirically proved that it will. One of our web servers dies? No problem, we have 3 more running. Someone mentions us on Hacker News? Our automated scaling system spins up more servers to handle the extra load. One of the PACER sites goes down? We’ll trip a “circuit breaker” in software to prevent jobs from stacking up on the dead resource until it comes back online again.

Commodity services are just another API

PacerPro is remarkably flexible when it comes to service providers. There’s virtually no lock-in anywhere in our toolchain. As part of our relentless obsession with reliability, our software is built to talk to services, not service providers. e.g. We have a mail service class. Inside the service code, we’ll have pluggable references to our providers, including all authentication and configuration information, so if (and when) we need to switch, we can change a configuration variable in our production environment and continue around the failure. PacerPro generates over 50,000 individualized custom emails every single business day; we can’t wait for a downstream outage to “resolve itself.”

Privacy by design

There’s much chatter these days about security breaches, which is sad because, while a challenging engineering problem, security is by no means unsolvable–or even unsolved. The trouble starts with senior managers who don’t know anything about security ⊕And why should they, after all? . They don’t budget for it and their engineering staff does not or cannot communicate the mission-critical aspect of security. So security becomes an afterthought. PacerPro does several things to correct this: Security design is part of every story (usually it is a non-issue, but we do check it). We try not to store any sensitive data, period. You’d be surprised how much you don’t need. We don’t trust ourselves to do encryption “right,” so we, again, offload it to the experts. For example, we use Stripe.com to handle all of our credit card transactions, so never see a credit card number. We encrypt the data that we must store. We have regular scans for code vulnerabilities and patch them as “Level-1” bugs (usually released the same day). If asked whether all this extra work is vital, my reply is always, “Do you want our company logo to appear next to the headline, ‘Hacked’?

Building a culture of “writing things down.”

No one is irreplaceable. ⊕Not even the CTO. Sometimes we joke about “bus counts,” which is the number of people that have to be run over by a bus before the company can no longer operate. A bus count number of “Ken” is unacceptable–PacerPro has to be able to keep running, even if, goodness forbid, something were to happen to me. So, we write down procedures in a company “runbook.” We automate the heck out of anything that we do more than 3 times ⊕3 is a magic number. We have a company chat channel where we can call an “all hands on deck” emergency, and because we are a distributed company, we’ve got bi-coastal coverage. Our engineering culture is “full-stack”; everyone has the skills to work with any part of the product. Some of us are more expert on some things than others, so we use pair programming to “level-up” our team knowledge. ⊕Pair programming is an “Xtreme Programming” technique where two engineers work on a single story at the same time. I sometimes describe it as “Two brains, one keyboard.” Thus we increase reliability. Pair programming may have a bum rap because you are paying two developers to deliver one feature, but empirical studies show that the code quality is much higher; enough so that the overall code cost (in time and dollars) is about a wash: But you get to higher quality, more reliable code sooner, for the win.

If you don’t measure it, it doesn’t exist.

Another core part of our development process is a priori support for metrics. We measure everything, from email delivery latency to database transactions per second. Sometimes we’ll only care about a metric for a short while, such as when we want to optimize a critical block of code. ⊕Most of the code that we write is optimized for developer productivity, by the way. Between our skills and Moore’s Law, 85% our code base is fast enough, as is. That way, we only have to focus on a few critical points in the software, instead of guessing. Other metrics go onto the production dashboard that we distribute internally. We have scripts that monitor for out of bounds values and then send notices to the engineering and DevOps Slack channels. We used to have a PagerDuty account, but we found that to be too heavyweight for us. Since we already listen to Slack, it created less friction to our regular work process.

No. You can’t have a pony.

Gavin can attest to the fact that I’m a certified pain-in-the-ass when it comes to building new features. Before we start designing, I ask many questions about sales, product, and C-levels. Some of my favorites: “Would this be stupid for us not to do, in the next 90 days?” “Based on the feature, it is going to cost thousands of dollars to build, test, verify, deliver and maintain. What’s our ROI over the next year?” “What’s our opportunity cost if we build this feature instead of that one?” That keeps us focused. 90 days may seem pretty short, but our world changes so quickly, that 90 days from now, everything that we thought we knew about our business will either change or called into question. Long-term planning in this context is a fool’s game. That’s not to say that we don’t have longer-range plans, but they are not specific implementation plans.