Infrastructure metrics
AWS
If you are running on AWS we recommend setting up Cloudwatch alarms for the following metrics
AWS Aurora
AWS Elasticache
AWS ALB
Alert when
UnHealthyHostCountis above 50% of the desired host count for over 2 mins. For example: If you have set desired task count fortines-appto be2, then your set the threshold forUnHealthyHostCountto be1. Reference here↗.Alert when
HTTPCode_ELB_502_Countis above 5 requests. This metric indicates that your load balancer cannot successfully route requests to its backends and you traffic has been dropped. Reference here↗.If you see frequent occurrences of this alert then increase your desired tasks count.
AWS ECS Fargate
Alert when
CPUUtilizationis consistently (5 minutes or more) above 80%. Note: this could also be a sign that you may need to increase the number of tasks on the service. In other words, scale horizontally. Reference here↗.
If you find that the alert is frequent and any of the metrics are consistently above the mentioned thresholds then its best to scale up the instance type. For example: If you are on db.r7g.large , you should upgrade the Aurora cluster to db.r7g.xlarge.
Non-AWS setup
For now AWS setups our recommendations are similar to AWS setups. For example
You should setup monitoring for your storage system if it is occupying more than 80% of total storage.
You should setup monitoring if the CPU utilization of your compute systems is consistently above 80%.