Keeping a Product Alive: On-Call & Maintenance

At Coursera, we've split our engineering into three main teams:

User-facing: the engineers creating the interfaces that our students and admins use, which often includes an API layer. ⬅ This is the team I work on!
Infrastructure: the engineers working on our servers, core service layer, monitoring, logging, testing, and deploy processes.
Analytics: the engineers creating tools for data analytics, designing A/B experiments, and writing up their findings for internal learning and research publications.

Product Priorities

As the proud owners of a product with users, we have two main priorities in life (work):

Keep the product alive: make sure its up and working as promised.
Grow the product: keep innovating and adding the features that we think will bring it to the next level.

Sometimes we think of "keeping the product alive" as the domain of the infrastructure team, the engineers that are on-call and ready when server issues arise. But in fact, we face many little issues that are the realm of the user facing team, issues that arise when users use our product in an unexpected way or in code that's not as well tested. When one of those issues fall on our lap, we have to context switch, decide how important it is, put aside what we're currently working on, and re-adjust our schedule after the unexpected distraction. It's easy to get frustrated as an engineer when you feel like you're spending your time just keeping the product alive and not making progress towards growing the product.

So it's up to us to figure out how we balance those two priorities. What percentage of our week do we spend on bugs vs. features? How do our users communicate what the most important things are to work on right now? How do we make sure engineers don't spend all their time wading through maintenance requests?

After asking ourselves these questions many times for the last few months, we think we've come up with an interesting approach to achieving the balance while making our users and engineers happy.

Product Emergency On-Call

We already used PagerDuty for our infrastructure on-call, and now we've added another on-call line-up to it for our "product emergency on-call". We have a primary ("Captain On-Call") and a secondary ("First-mate Firefighter), and the schedule rotates every Wednesday, right before our weekly sprint planning meetups. We make sure everybody knows who's on-call by plastering their photos on the wall next to the lunch line:

Captain On-Call and First Mate Firefighter

When our Course Ops team finds out about an urgent issue, they send it to a designated email address that both creates a Jira ticket and alerts PagerDuty, setting off our page. Captain On-Call does what they can to resolve the issue, calling on the expertise of the secondary or beyond if it's outside of their knowledge, and often sends out a post-mortem after. We send the post-mortem both so that our colleagues know how to tackle the issue if they encounter it during their duty, and so that everyone knows about any underlying flaws in our platform that might have prevented it from happening entirely.

Product Maintenance Duty

But we don't want to just respond to urgent drop-everything requests, we also want to address the many little (yet important) requests that accumulate. So, Captain On-Call also becomes a product maintenance engineer for the week. We have a Jira filter of issues tagged with product-maintenance-duty, with relevant priorities and deadlines, and Captain On-Call works on those while they're not tackling pages. We know during our sprint planning meeting that they'll be fully dedicated to maintenance that week, so we put nothing else on their plate. Hopefully, by the end of their week on-call, they'll have closed 5-20 of our Jira tickets, learnt about more parts of our codebase than they knew before, and increased the number of tests in our codebase.

We're only on our second cycle of this approach, but so far, it's been pretty successful. It is a bit hard to go from working full-time on a project to working full-time on maintenance, but it's also refreshing. I'd love to hear about your approach wherever you work in the comments.

pamela fox's blog

Wednesday, March 13, 2013