You can now export OpenTelemetry traces on self-hosted Tines installations. This additional tracing data will provide you better insights on the performance of your Tines instance. To enable this capability, you will need to set the correct environment variables in both the tines-app and tines-sidekiq containers. For additional configuration you can utilize the environment variables provided by the OTEL SDK.
Tines utilizes the OTLP Exporter to batch export spans to a user provided endpoint. This is the standard OpenTelemetry exporter that sends telemetry data using the OpenTelemetry Protocol.
This solution supports both HTTP and gRPC transport protocols.
The exporter as been configured with the BatchSpanProcessor for efficient export of spans.
The exporter endpoint, as well as other optional settings, are configurable with environment variables provided by the OpenTelemetry SDK configuration.
Environment Variables
The primary means of enabling and configuring the export of traces from Tines is by setting the following environment variables in your tines-app and tines-sidekiq containers.
Required
OTEL_ENABLED=true This is the variable that we use to determine whether to build and export traces for self-hosted installations.
Optional (but recommended)
OTEL_EXPORTER_OTLP_PROTOCOL=grpc, http/protobuf, or http/json The default is grpc.
OTEL_EXPORTER_OTLP_ENDPOINT=<where to receive traces> Typically http://localhost:4318 (HTTP) or http://localhost:4317 (gRPC) are the defaults, unless this environment variable is set to something else.
OTEL_SERVICE_NAME=<your company or service> The app will default this to “Tines” if none is provided.
Only the first variable mentioned, OTEL_ENABLED
, is directly provided by Tines. The others are provided by the OTEL SDK Configuration and are recommended for successful export of OTEL traces in Tines.
Optional: Auto Instrumentation
Tines can enable the following auto-instrumented tracing:
Web Requests: All HTTP requests to your Tines instance
GraphQL Queries: Complete GraphQL operation tracing with field-level paths
Background Jobs: Sidekiq job execution and processing
Database Operations: PostgreSQL queries (with sensitive data obfuscation)
External API Calls: HTTP requests to external services
Redis Operations: Cache and session operations
To enable auto instrumentation, set OTEL_AUTO_INSTRUMENTATION=true.
Details on sampling auto instrumentation can be found below.
Manage Performance with Tracing Data
When considering scaling your Tines instance up or down, we recommend tracking various trace fields. Before tuning the number of worker containers, be sure to select an ideal SIDEKIQ_CONCURRENCY value, which should be determined by container resource limitations (CPU/memory). Once that value has been determined, scaling workers can be done based on the following fields.
Scaling Up
The first field to look at when investigating your Tines app’s performance is the action_run.default_queue_latency. This field is a Sidekiq queue performance metric that measures how long jobs have been waiting in the "default" queue before being processed. It tells us the time difference between when the oldest job in the queue was enqueued and the current time.
Performance Indicators:
Latency = 0: No jobs waiting in queue
Latency > 0: Jobs are waiting to be processed
Recommendations:
We consider high latency as anything over 1 second.
If latency is high but CPU is low → Increase SIDEKIQ_CONCURRENCY
If latency is high and CPU is high → Add more tines-sidekiq workers instead
The second field you can look at is percentage_of_workers_available. This metric can be used to track the availability of workers over time. This metric is only available with auto instrumentation enabled as its parent span is from the auto instrumentation.
Recommendations
percentage_of_workers_available < 20% → Increase tines-sidekiq workers
Scaling Down
The same attributes above can be used to determine whether to scale down.
Recommendations
action_run.default_queue_latency < 1 second → Decrease by one tines-sidekiq workers
percentage_of_workers_available > 50% → Decrease tines-sidekiq workers
Tracking Story Performance
Most of the action run telemetry is only available when using the auto instrumentation. The story_container.id on AgentReceiveJob via the OpenTelemetry Sidekiq integration will allow you to determine which story an action comes from. Without auto instrumentation you can look at the run_action trace durations.
While the percentage_of_workers_available is a good indicator of infrastructure performance it may not provide insights on performance at the story and action level. For example, if only one story has pending jobs, it might be worth investigating the specific story to see if certain actions are running slowly.
__trace.action_run_latency_ms
__trace.action_run_time_to_start_ms
__trace.action_run_time_to_enqueued_ms
scheduled.action_run_enqueue_job_v2.stories_with_pending_jobs
scheduled.action_run_enqueue_job_v2.pending_jobs_per_story
Optional: Sampling Auto Instrumentation
We recommend sampling the auto instrumentation tracing. The auto instrumentation can provide far more data, but it also a lot more to manage. We have some recommendations for sampling rule within the Otel Collector if you plan to utilize auto instrumentation and the Otel Collector. Providing a filter processor will help reduce the amount of traces you’ll receive.
processors:
#Reference: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/filterprocessor/README.md
filter:
error_mode: ignore # Recommended: ignore OTTL evaluation errors and continue processing
traces:
span:
# Drop PostgreSQL empty statements (just semicolon) - ALWAYS DROP
- 'attributes["db.system"] == "postgresql" and attributes["db.statement"] == ";"'
# Sidekiq PostgreSQL Sampling (1 in 2000 = 0.05% retention)
# Filters high-volume PostgreSQL operations from Sidekiq workers to reduce telemetry noise
# Targets common database operations like EXECUTE, SELECT, INSERT, UPDATE, BEGIN, COMMIT, PREPARE
- 'attributes["db.system"] == "postgresql" and resource.attributes["service.name"] == "sidekiq" and IsMatch(name, "^(tines|EXECUTE tines|SELECT tines|INSERT tines|UPDATE tines|BEGIN tines|COMMIT tines|PREPARE tines)$") and (end_time_unix_nano - start_time_unix_nano) < 100000000 and FNV(String(trace_id)) > -9214589400813948007'
# Puma PostgreSQL Sampling (1 in 500 = 0.2% retention)
# Filters high-volume PostgreSQL operations from Puma web servers with moderate sampling
# Targets common database operations like EXECUTE, SELECT, INSERT, UPDATE, BEGIN, COMMIT, PREPARE
- 'attributes["db.system"] == "postgresql" and resource.attributes["service.name"] == "puma" and IsMatch(name, "^(tines|EXECUTE tines|SELECT tines|INSERT tines|UPDATE tines|BEGIN tines|COMMIT tines|PREPARE tines)$") and (end_time_unix_nano - start_time_unix_nano) < 100000000 and FNV(String(trace_id)) > -9131194718863876096'
# PostgreSQL Transaction Control Sampling (keep 1 in 5000 = 0.02% retention)
# Filters transaction control statements that are high-volume but low-value for observability
- 'attributes["db.system"] == "postgresql" and attributes["db.statement"] != nil and IsMatch(attributes["db.statement"], "^(BEGIN|COMMIT|ROLLBACK|SAVEPOINT|RELEASE)") and FNV(String(trace_id)) > -9219685678674763806'
# PostgreSQL Prepared Statement Sampling (keep 1 in 5000 = 0.02% retention)
# Targets prepared statement operations that generate high telemetry volume
- 'attributes["db.system"] == "postgresql" and attributes["db.operation"] != nil and (attributes["db.operation"] == "EXECUTE" or attributes["db.operation"] == "PREPARE" or attributes["db.operation"] == "DEALLOCATE") and FNV(String(trace_id)) > -9219685678674763806'
# Fast PostgreSQL Query Sampling (keep 1 in 50 = 2% retention)
# Aggressive filtering for sub-100ms database operations (< 100 million nanoseconds)
# Preserves slow queries (≥ 100ms) for performance monitoring
- 'attributes["db.system"] == "postgresql" and attributes["db.operation"] != nil and (attributes["db.operation"] == "SELECT" or attributes["db.operation"] == "INSERT" or attributes["db.operation"] == "UPDATE" or attributes["db.operation"] == "DELETE") and (end_time_unix_nano - start_time_unix_nano) < 100000000 and FNV(String(trace_id)) > -8854558173021978624'
# All Redis Operations Sampling (1 in 2000 = 0.05% retention)
# Filters all Redis operations to reduce telemetry noise while preserving error patterns
- 'attributes["db.system"] == "redis" and FNV(String(trace_id)) > -9214589400813948007'
# Redis Pipeline Operations Sampling (1 in 5000 = 0.02% retention)
# Filters high-volume Redis pipeline operations to reduce telemetry noise
- 'attributes["db.system"] == "redis" and name == "PIPELINED" and FNV(String(trace_id)) > -9219685678674763806'
# ActiveRecord ORM Sampling (keep 1 in 100 = 1% retention)
# Reduces noise from Ruby on Rails ActiveRecord operations while preserving error patterns
- 'IsMatch(name, "^ActiveRecord.*") and FNV(String(trace_id)) > -9131194718863876096'
# (keep 1 in 5000 = 0.02% retention)
# Filters out most health check that are localhost:3000 to reduce telemetry noise
- 'attributes["http.target"] == "/is_up" and attributes["http.status_code"] == 200 and FNV(String(trace_id)) > -9219685678674763806'
spanevent:
# Sidekiq Auto-Generated Events - ALWAYS DROP
# Drops all auto-generated Sidekiq instrumentation events that generate high telemetry volume
- 'resource.attributes["process_type"] == "sidekiq" and IsMatch(name, "^(created_at|enqueued_at)$")'