Observability

Monitoring 

AWS 

If you are running on AWS we recommend setting up Cloudwatch alarms for the following metrics

  • AWS Aurora

    • Alert when CPUUtilization is above 80% utilization. Reference here.

    • Alert when FreeableMemory is below 80% utilization of the total memory. Reference here.

  • AWS Elasticache

    • Alert when DatabaseMemoryUsagePercentage is above 80%. Reference here.

    • Alert when EngineCPUUtilization is above 80%. Reference here.

  • AWS ALB

    • Alert when UnHealthyHostCount is above 50% of the desired host count for over 2 mins. For example: If you have set desired task count for tines-app to be 2, then your set the threshold for UnHealthyHostCount to be 1. Reference here.

    • Alert when HTTPCode_ELB_502_Count is above 5 requests. This metric indicates that your load balancer cannot successfully route requests to its backends and you traffic has been dropped. Reference here.

      • If you see frequent occurrences of this alert then increase your desired tasks count.

  • AWS ECS Fargate

    • Alert when CPUUtilization is consistently (5 minutes or more) above 80%. Note: this could also be a sign that you may need to increase the number of tasks on the service. In other words, scale horizontally. Reference here.

If you find that the alert is frequent and any of the metrics are consistently above the mentioned thresholds then its best to scale up the instance type. For example: If you are on db.r7g.large , you should upgrade the Aurora cluster to db.r7g.xlarge.

Non-AWS setup 

For now AWS setups our recommendations are similar to AWS setups. For example

  • You should setup monitoring for your storage system if it is occupying more than 80% of total storage.

  • You should setup monitoring if the CPU utilization of your compute systems is consistently above 80%.

Tracing 

For deeper application-level observability, we recommend enabling OpenTelemetry (OTEL) tracing. This provides visibility into web requests, background job execution, database queries, external API calls, and more — going beyond infrastructure metrics to help you diagnose performance bottlenecks.

At a minimum, set OTEL_ENABLED=true on your tines-app and tines-sidekiq containers and point OTEL_EXPORTER_OTLP_ENDPOINT at your collector. Enabling OTEL_AUTO_INSTRUMENTATION=true captures a broad set of spans with no additional code changes.

See OpenTelemetry Traces configuration for full setup details and collector filter recommendations.

Was this helpful?