Thankful for incidents: embracing chaos to find clarity

Written by Shayon Mukherjee Staff Software Engineer, Tines

Published on July 19, 2024

In this blog post, Tines software engineer Shayon Mukherjee shares how lessons from a recent incident led to improved platform resilience and more comprehensive testing practices.

It was a typical late June afternoon when we embarked on what seemed a routine Redis cluster upgrade across approximately 40 customer stacks. The upgrade was essential, influenced by a previous outage that highlighted the risks of not using more robust instances and better networking support on those instance types. 

This wasn't the first time we performed such an upgrade, nor is it going to be the last time. But little did we know that this upgrade would soon reveal an unseen issue that lay dormant, undetected, and ready to teach us a valuable lesson.

A few moments into the upgrade process, our monitoring systems flagged the Toolkit API Down alarm. Tines Toolkit is a Tines response-enabled webhooks-powered service. This alert was the first signal of something amiss. The webhooks were timing out, affecting crucial customer workflows. The immediate response was swift - our engineers triggered a force deployment to flush out any bad state from our containers, bringing things back online within a few minutes. But this was just the beginning. Now we needed to understand what had actually happened.

How does it all work? 

In our system, response-enabled webhooks play a critical role, especially in how they interact with Redis Pub/Sub to manage real-time data flows. Before we dive further into the bug, let's take a quick minute to understand the system first.

Our application is built using Ruby on Rails, with Puma as our web server and Rack as the web server interface. For response-enabled webhooks, we use a technique known as Socket Hijacking to make async webhooks function like synchronous ones. 

When a webhook request is received, Rack performs what's known as “Full Socket Hijacking.” This process closes the Rails response object, which frees up the request handling thread (inside Puma) to return to the pool of available workers while maintaining the socket connection to the client. This allows the Puma server to accept new web requests without blo