AWS
If you are running on AWS we recommend setting up Cloudwatch alarms for the following metrics
AWS Aurora
AWS Elasticache
AWS ALB
Alert when
UnHealthyHostCountis above 50% of the desired host count for over 2 mins. For example: If you have set desired task count fortines-appto be2, then your set the threshold forUnHealthyHostCountto be1. Reference here.Alert when
HTTPCode_ELB_502_Countis above 5 requests. This metric indicates that your load balancer cannot successfully route requests to its backends and you traffic has been dropped. Reference here.If you see frequent occurrences of this alert then increase your desired tasks count.
AWS ECS Fargate
Alert when
CPUUtilizationis consistently (5 minutes or more) above 80%. Note: this could also be a sign that you may need to increase the number of tasks on the service. In other words, scale horizontally. Reference here.
If you find that the alert is frequent and any of the metrics are consistently above the mentioned thresholds then its best to scale up the instance type. For example: If you are on db.r7g.large , you should upgrade the Aurora cluster to db.r7g.xlarge.
Non-AWS setup
For now AWS setups our recommendations are similar to AWS setups. For example
You should setup monitoring for your storage system if it is occupying more than 80% of total storage.
You should setup monitoring if the CPU utilization of your compute systems is consistently above 80%.