EFS mount watchdog and CPU usage

The moral of today’s story is that there is a direct relationship between the number of EFS mounts on an EC2 instance and the resting level of CPU usage.

We ship our Fargate deployments with a bastion host if they contain an EFS filesystem and/or an RDS cluster. This takes the form of an EC2 autoscaling group. As we only need the bastion host for time-limited tasks, we keep the desired capacity at 0 by default and raise to 1 as needed. We use the scaling policies to ensure that the instance is turned down after a period of inactivity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const bastionCPUThreshold = opts.cpuThreshold ?? 1;
const bastionScalingPeriod = opts.scalingPeriod ?? Duration.minutes(30);

const CPUUtilizationMetric = new Metric({
namespace: 'AWS/EC2',
metricName: 'CPUUtilization',
dimensionsMap: {
AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
},
period: bastionScalingPeriod,
});

const scalingPolicy = this.autoScalingGroup.scaleOnMetric('ScaleToCPU', {
metric: CPUUtilizationMetric,
scalingSteps: [
{ upper: bastionCPUThreshold, change: 0 },
{ lower: bastionCPUThreshold, change: 1 }
],
adjustmentType: AdjustmentType.EXACT_CAPACITY,
});

In the above example, the autoscaling group will scale back to 0 instances after a period of 30 minutes in which CPU usage was less than 1%. For long-running tasks that don’t spike CPU usage, you can also simply disable the “Termination” process on the autoscaling group configuration.

You do need to tune this a bit to account for workload configuration. We standardized on 1% CPU usage based on observation of existing workloads. We recently deployed a more complex workload and found that the bastion host was never terminated. The Cloudwatch metric that showed that the CPU usage was a steady 2.5%. Why? Getting a shell and listing processes showed a python3 execution accounting for the CPU usage. We examined that process more closely and found out what it was doing:

1
2
3
[root@ip-10-2-103-56 ~]# ps -p 3260 -o pid,vsz=MEMORY -o user,group=GROUP -o comm,args=ARGS
PID MEMORY USER GROUP COMMAND ARGS
3260 223756 root root python3 python3 /usr/bin/amazon-efs-mount-watchdog

amazon-efs-mount-watchdog is part of the EFS mount helper package that you install on any EC2 instance interacting with EFS. If you encrypt in transit (which you should), the mount helper brings up an stunnel process, and watchdog ensures that the mounts are healthy. This particular workload mounted multiple filesystems and access points, and that was enough overhead to account for the difference in CPU usage. We redeployed with a higher threshold for the scaling policy and resolved the issue. Lesson learned.