How our engineering team improves the on-call experience with Tines

Written by Izabela KuźniarSoftware Engineer, Tines

Published on August 16, 2023

This is the first of a series of posts about ways we use Tines at Tines to simplify our processes. I’m Izabela from the engineering team and will share how we improve the on-call experience with our own product. 

When it comes to on-call, there are differing views. For some, it comes as an easy and enjoyable task. For others, a stressful time on their calendars. At Tines, we have two types of on-call: daytime and out-of-hours. In this blog, we explain the daytime on-call shift, its pros and cons, and how we make it easier by using Tines.

What is on-call? 

On-call is when we hold the pager and respond to any incidents. It is also when we perform other responsibilities and tasks like monitoring the alerts, triaging bugs, responding to any queries the rest of the company has for the engineering team, and ensuring we quickly resolve the incidents so that everything continues to run smoothly.

At Tines, we use PagerDuty and Slack to manage our on-call rota. We ensure we have two on-call engineers per day and, during working hours, the full team is available to support if needed. We use PagerDuty's escalation rota to get additional support from engineering managers during off-hours or when no one is available online.

Who is on call? 

Our daytime on-call rota is mandatory for all engineers after the engineer completes their onboarding milestones. 

On-call is a great learning experience, so even engineering managers can opt-in to the rotation. However, it's optional since they already have a great deal on their plates. 

What are the tasks? 

To ensure our on-call team is as efficient as possible, we defined clear tasks and rules to follow when on-call. And, as an automation company, we implemented automation when and where it makes sense for the process. In most cases, we apply Tines to monitor issues and automate tasks. 

For example, automate handover reminders at the end of on-call week with a summary of the shift week. Below, you can see our Tines story powering the summary message:

On-call at Tines powered by Tines 

Whether on-call duty is overwhelming to you or not, there are lots of ways you can ease the burden of the job and make it fun using Tines. A few areas our team uses the product include: 

  1. Responding to incidents

  2. Checking Slack notifications

  3. Creating a self-hosted release

  4. Answering queries

  5. Triaging and tracking bugs

Responding to incidents 

The main priority for the on-call team is responding to incidents quickly to ensure minimum impact to the platform and user experience. 

On the communication front, this involves notifying the support team, updating the status page, and, when needed, informing impacted customers. On the engineering side, we check the incident's root cause to determine whether we need to revert changes, rollback a migration, or introduce a new change to resolve the issue. Once resolved, we summarize the incident for internal and external visibility. 

This could be highly manual, with many opportunities to make mistakes or miss a step. That's why we use Tines. It automates many communication steps and sets up the relevant communication channels, including a video call, to ensure our internal alignment and transparency. 

Checking Slack notifications 

When you’re on-call, the day starts with checking Slack to see if everything is operating as expected. Some tools have great native Slack integrations, but sometimes we need something more customized.

We use Tines to build and send custom Slack notifications in response to webhooks for lots of important on-call events - Snyk vulnerability reports, new bug reports, flaky tests, and many more. This way, all the notifications are consolidated into one place, Slack, for the on-call person to sort through and work. 

Creating a self-hosting release 

We offer cloud and self-hosted deployment models for our customers. The on-call person is responsible for deploying all releases to self-hosted customers. While our cloud releases are continuous, our self-hosted releases roll every other week. GitHub is our source control with releases happening via the platform. 

We use a dedicated story for our releases to tell us when to deploy self-hosted updates. It is simple, but highly efficient. There’s never concerns about what to do or when to release.

Answering queries 

The on-call team also responds to questions about anything relating to code, infrastructure, feature gating, bugs, etc. Most of these come from our customer and technical success teams surfacing questions directly from our customers. The on-call person prioritizes the questions, finds the problem or answer, and responds or escalates to a more experienced colleague as needed. 

When we see queries repeated or notice a pattern to them, our Ask Engineering app helps the team find their answers quickly and effectively without waiting on an engineer. This saves us a lot of time on answering repetitive questions and helps our customer teams respond faster.

Triaging and tracking bugs