With the self-hosted release of v34.7.0, Tines can send Open Telemetry (OTEL) traces to observability stacks that support OTEL ingestion. Tines provides documentation on how to configure this new feature, but this first article will focus on how to use it. In this series, we will:
Define key OTEL & Observability concepts
Explain why tracing matters in Tines
Walk through an example observability stack in part 2
Design a Grafana dashboard and troubleshoot a problematic story with OTEL data in part 3
Baseline observability concepts and terms
Before we dive into why and how we’re designing our observability stack and dashboard, it’s important to establish an understanding of the terminology we'll use throughout the rest of the article series. Use this glossary of observability terms and concepts for later!
What are we trying to solve with OpenTelemetry?
When something in Tines feels slow, fails intermittently, or just acts differently under load, the default tools may not always have enough information to act on. You can look at system metrics, see that stories are running, and dig through logs. Even after doing all of that, the same questions remain:
What exactly is slow?
Where is the time actually going?
Is it this story or the action acting up?
Why is OpenTelemetry important?
This is where OpenTelemetry helps. It lets us see how a workflow moves through Tines, rather than just whether it finished. With tracing enabled, you can watch:
How requests and background jobs move through the system
Which parts of a run are really driving latency
How the different spans fit together inside a story
What changes when the system is under heavy load or stories back up
That changes conversations from:
“The system feels slow.”
to something more useful, like:
“Queue latency is zero, but a couple of actions in Story 42 are slow. We should fix the story, not the infrastructure.”