---
title: Scaling Tines
url: https://www.tines.com/docs/self-hosted/reference-architecture/sizing-and-scaling/scaling-tines/
updated: 2026-03-18T10:46:13+00:00
---

*[tines.com](https://www.tines.com/llms.txt) › [Docs](https://www.tines.com/llms.txt) › [Self-Hosted](https://www.tines.com/llm/docs/self-hosted.md) › [Reference Architecture](https://www.tines.com/llm/docs/self-hosted/reference-architecture.md) › [Sizing & Scaling](https://www.tines.com/llm/docs/self-hosted/reference-architecture/sizing-and-scaling.md)*

# Scaling Tines

*[View on tines.com](https://www.tines.com/docs/self-hosted/reference-architecture/sizing-and-scaling/scaling-tines/)*

The [deployment tier guides](https://www.tines.com/docs/deployment-tiers/) are a starting point, not a ceiling. As your workload grows, use OpenTelemetry tracing to make data-driven scaling decisions based on actual usage rather than estimates. Tines supports exporting OTEL traces from both tines-app and tines-sidekiq containers — see [Exporting OpenTelemetry Traces](/docs/self-hosted/configuring-tines/opentelemetry-traces/) for setup.

## Enabling Observability

Set the following ENV variables on **both** tines-app and tines-sidekiq containers:

| Variable | Value | Purpose |
| --- | --- | --- |
| `OTEL_ENABLED` | `true` | Enable trace export from Tines |
| `OTEL_AUTO_INSTRUMENTATION` | `true` | Enable auto-instrumented tracing (web requests, GraphQL, Sidekiq jobs, Postgres, Redis, external HTTP) |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Your collector endpoint | e.g., `http://otel-collector:4317` (gRPC) or `http://otel-collector:4318` (HTTP) |
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `grpc`, `http/protobuf`, or `http/json` | Default: `grpc` |
| `OTEL_SERVICE_NAME` | Your identifier | Defaults to “Tines” if unset |

> **NOTE:** Auto instrumentation generates significantly more telemetry data. Sampling is recommended for production — see the OTEL Collector filter processor configuration in the [Tines docs](/docs/self-hosted/configuring-tines/opentelemetry-traces/) for recommended sampling rules.

## Key Metrics for Scaling Decisions

Before adjusting pod counts, first optimize `SIDEKIQ_CONCURRENCY` based on per-container CPU and memory limits. Once concurrency is tuned, use the following metrics to decide when to add or remove tines-sidekiq workers.

### When to Scale Up

| Metric | Condition | Action |
| --- | --- | --- |
| `action_run.default_queue_latency` | \> 1 second, CPU is low | Increase `SIDEKIQ_CONCURRENCY` |
| `action_run.default_queue_latency` | \> 1 second, CPU is high | Add more tines-sidekiq pods/containers |
| `percentage_of_workers_available` | < 20% | Add more tines-sidekiq pods/containers |

-   **`action_run.default_queue_latency`** measures how long jobs wait in the default Sidekiq queue before being picked up. A value of 0 means no jobs are waiting; sustained values above 1 second indicate the workers cannot keep up with the ingest rate.
-   **`percentage_of_workers_available`** tracks worker availability over time (requires auto instrumentation).

### When to Scale Down

| Metric | Condition | Action |
| --- | --- | --- |
| `action_run.default_queue_latency` | < 1 second (sustained) | Remove one tines-sidekiq pod/container |
| `percentage_of_workers_available` | \> 50% (sustained) | Remove tines-sidekiq pods/containers |

### Tracking Story-Level Performance

For deeper investigation into which stories or actions are driving load, use these trace attributes (requires auto instrumentation):

| Attribute | Purpose |
| --- | --- |
| `story_container.id` on `AgentReceiveJob` | Identifies which story an action run belongs to |
| `__trace.action_run_latency_ms` | Total action run latency |
| `__trace.action_run_time_to_start_ms` | Time from creation to execution start |
| `__trace.action_run_time_to_enqueued_ms` | Time from creation to queue entry |
| `scheduled.action_run_enqueue_job_v2.stories_with_pending_jobs` | Number of stories with pending work |
| `scheduled.action_run_enqueue_job_v2.pending_jobs_per_story` | Pending job count per story |

If only one story has a large backlog, the issue is likely story-specific (e.g., a slow external API call) rather than an infrastructure scaling problem.
