This is the first of a series of posts about ways we use Tines at Tines to simplify our processes. I’m Izabela from the engineering team and will share how we improve the on-call experience with our own product.
When it comes to on-call, there are differing views. For some, it comes as an easy and enjoyable task. For others, a stressful time on their calendars. At Tines, we have two types of on-call: daytime and out-of-hours. In this blog, we explain the daytime on-call shift, its pros and cons, and how we make it easier by using Tines.
On-call is when we hold the pager and respond to any incidents. It is also when we perform other responsibilities and tasks like monitoring the alerts, triaging bugs, responding to any queries the rest of the company has for the engineering team, and ensuring we quickly resolve the incidents so that everything continues to run smoothly.
At Tines, we use PagerDuty and Slack to manage our on-call rota. We ensure we have two on-call engineers per day and, during working hours, the full team is available to support if needed. We use PagerDuty's escalation rota to get additional support from engineering managers during off-hours or when no one is available online.
Our daytime on-call rota is mandatory for all engineers after the engineer completes their onboarding milestones.
On-call is a great learning experience, so even engineering managers can opt-in to the rotation. However, it's optional since they already have a great deal on their plates.
To ensure our on-call team is as efficient as possible, we defined clear tasks and rules to follow when on-call. And, as an automation company, we implemented automation when and where it makes sense for the process. In most cases, we apply Tines to monitor issues and automate tasks.
For example, automate handover reminders at the end of on-call week with a summary of the shift week. Below, you can see our Tines story powering the summary message:
Whether on-call duty is overwhelming to you or not, there are lots of ways you can ease the burden of the job and make it fun using Tines. A few areas our team uses the product include:
Responding to incidents
Checking Slack notifications
Creating a self-hosted release
Triaging and tracking bugs
The main priority for the on-call team is responding to incidents quickly to ensure minimum impact to the platform and user experience.
On the communication front, this involves notifying the support team, updating the status page, and, when needed, informing impacted customers. On the engineering side, we check the incident's root cause to determine whether we need to revert changes, rollback a migration, or introduce a new change to resolve the issue. Once resolved, we summarize the incident for internal and external visibility.
This could be highly manual, with many opportunities to make mistakes or miss a step. That's why we use Tines. It automates many communication steps and sets up the relevant communication channels, including a video call, to ensure our internal alignment and transparency.
When you’re on-call, the day starts with checking Slack to see if everything is operating as expected. Some tools have great native Slack integrations, but sometimes we need something more customized.
We use Tines to build and send custom Slack notifications in response to webhooks for lots of important on-call events - Snyk vulnerability reports, new bug reports, flaky tests, and many more. This way, all the notifications are consolidated into one place, Slack, for the on-call person to sort through and work.
We offer cloud and self-hosted deployment models for our customers. The on-call person is responsible for deploying all releases to self-hosted customers. While our cloud releases are continuous, our self-hosted releases roll every other week. GitHub is our source control with releases happening via the platform.
We use a dedicated story for our releases to tell us when to deploy self-hosted updates. It is simple, but highly efficient. There’s never concerns about what to do or when to release.
The on-call team also responds to questions about anything relating to code, infrastructure, feature gating, bugs, etc. Most of these come from our customer and technical success teams surfacing questions directly from our customers. The on-call person prioritizes the questions, finds the problem or answer, and responds or escalates to a more experienced colleague as needed.
When we see queries repeated or notice a pattern to them, our Ask Engineering app helps the team find their answers quickly and effectively without waiting on an engineer. This saves us a lot of time on answering repetitive questions and helps our customer teams respond faster.
Bugs are a natural part of the software development lifecycle. The on-call engineer is responsible for triaging, prioritizing, adding context, and assigning the team owning the resolution. This context helps the owning team resolve the bug quickly.
In June, we launched an amazing feature called cases, which we use to track all of our bugs. Combining cases with pages in Tines enabled us to introduce ways to make our job, especially when we’re on call, easier. Our page captures the bug information into a case, which is assigned to the on-call engineer who receives a notification in Slack with a link to the case. From there, they triage, prioritize, and assign the case. We use tags to associate the case to different teams.
Since the feature is brand new, we’ve used our own experience with it to improve the usability for us and our customers.
On-call does not have to be overwhelming. Assigning clear task ownership and on-call responsibilities, makes it much easier. Better still, introducing automation where you can improve communication, transparency, and efficiency for the on-call team. So, that is why at Tines we use Tines to make on-call go as smoothly as possible.