When running applications on Amazon Web Services (AWS), monitoring the performance of CPU and GPU resources helps ensure optimal performance, cost-efficiency, and scalability. AWS provides various tools that enable users to monitor the utilization and health of compute resources running in the cloud. The sections below explore the key concepts and best practices for monitoring CPU and GPU performance on AWS.

Understanding CPU and GPU Metrics

CPU Utilization: Percentage of Allocated CPU Resources in Use

Analyzing CPU utilization is crucial for optimizing performance and managing costs effectively on AWS. CPU utilization refers to the percentage of allocated CPU resources that are actively being used by applications and processes. Monitoring CPU utilization metrics helps identify bottlenecks, scale resources as needed, and ensure efficient resource allocation.

CPU Credit Usage and Balance (for Burstable Instances): Consumption and Accrual of CPU credits

CPU credit usage and balance are important for burstable (T2 and T3) instances on AWS. These instances accrue CPU credits during periods of low usage, which can be used during bursts of activity. Monitoring CPU credit usage and balance helps determine how efficiently CPU resources are being utilized by burstable instances. This translates into informed decisions about instance sizing and optimization.

GPU Utilization & GPU Memory Usage: Percentage of allocated GPU resources being used & Memory Consumption

On the GPU side, monitoring GPU utilization and memory usage is essential for applications that leverage GPU resources, such as machine learning, rendering, and scientific computing workloads. GPU utilization indicates the percentage of allocated GPU resources actively being used, while GPU memory usage tracks the amount of GPU memory being consumed by processes and applications.

AWS Monitoring

Amazon CloudWatch is a powerful tool for monitoring and analyzing various metrics, logs, and events across AWS services. It offers a centralized platform to collect, visualize, and set CPU/GPU alarms along with other performance metrics. CloudWatch can be integrated with other AWS services for comprehensive monitoring and automation of resource management tasks.

Setting Up CloudWatch Metrics

Most AWS Managed AMIs include the CloudWatch Agent pre-installed, simplifying the process of monitoring CPU metrics with just a few clicks. However, if the Agent is not pre-installed, it must be manually installed and configured on the EC2 instances to enable the collection of additional metrics. For GPU monitoring, additional configuration may be necessary to collect and publish GPU-specific metrics.

It is important to ensure that the EC2 instance has the necessary permissions to publish to Cloudwatch Metrics. Users have the flexibility to define and publish custom metrics tailored to their applications. Further customization of metric collection can be achieved by selecting appropriate aggregation periods and granularity levels.

Monitoring Best Practices

Setting Up Alarms and Notifications

CloudWatch alarms can be configured to trigger notifications based on predefined thresholds to enable proactive resource management. For example, alarms can be set to notify when CPU or GPU utilization exceeds a certain threshold, allowing admins to take actions such as shutting down or resizing unused resources to optimize cost and performance.

Utilizing Dashboards

Utilizing custom dashboards can significantly enhance a company’s ability to visualize and analyze key metrics effectively. Dashboards can be created to display real-time and historical data, providing insights into resource utilization trends and performance metrics. 

Optimizing Resource Allocation

Optimizing resource allocation is a continuous process that involves adjusting instance types and sizes based on monitoring data. By analyzing CPU and GPU metrics, underutilized or overburdened resources can be identified, paving the way for informed decisions that help optimize performance and cost efficiency. This process involves resizing instances, choosing appropriate instance types, and implementing auto-scaling policies to dynamically adjust resources based on workload demands.

GPU-Specific Monitoring

For GPU-specific monitoring on AWS, users can utilize the NVIDIA System Management Interface (nvidia-smi), which is a command-line utility designed for monitoring NVIDIA GPUs. This tool provides detailed information about GPU utilization, memory usage, temperature, and more. 

Additionally, GPU performance can be monitored on Amazon EC2 instances that are optimized for GPU workloads, ensuring efficient utilization of GPU resources for tasks such as machine learning, rendering, and data processing.

Examples

1. High-Performance Computing (HPC) Applications

Scenario: Organizations running simulations, scientific research, or other heavy compute jobs on Amazon EC2 instances that utilize GPUs.

Implementation: Setting up monitoring enables the automatic shutdown of instances during inactivity or underutilization, saving costs without human intervention. Python scripts can be utilized to monitor usage and trigger the necessary actions based on predefined thresholds.

2. Machine Learning Model Training

Scenario: Data scientists training models on GPU-enabled instances. These tasks can be resource-intensive and expensive, especially when models are in the training phase for prolonged periods of time.

Implementation: Automating the monitoring of CPU/GPU usage helps maintain efficient utilization rates. If the GPU is underutilized for a specified duration, an auto-shutdown Lambda function can be used to stop or terminate the instance, optimizing resource use and controlling costs.

Setting up Automated GPU and CPU monitoring on Amazon EC2 instances

The ‘Automated GPU and CPU Monitoring on AWS EC2 Instances’ Gist provides a comprehensive solution for setting up automated GPU and CPU monitoring on EC2 instances using a combination of Terraform configurations, Python scripts, and PowerShell scripts. The setup is designed to handle the creation of IAM roles, Lambda functions, and CloudWatch alarms, with a specific focus on Windows systems.

Conclusion

As studios increasingly transition to cloud-based workflows for tasks such as rendering and content creation, the need for efficient CPU and GPU monitoring becomes paramount. Effective CPU and GPU monitoring ensures optimal performance, resource utilization, and cost management, aligning with the demands of studio workflows.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA. 

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows including AWS Studio in the Cloud (SIC), Retail workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.