DevOps Guru: ML-Powered Cloud Operations Service to Improve Application Availability

Amazon DevOps Guru is an easy to implement Machine Learning (ML) service that makes it easy to improve an application’s operational performance and availability. This feature detects behaviors that deviate from normal operating patterns so you can identify operational issues proactively before they happen.

DevOps Guru identifies anomalous application behavior (e.g. increased latency, error rates, resource constraints, etc.) that could cause potential outages or service disruptions. When a critical issue is identified, it automatically sends an alert with a summary of related anomalies, the likely root cause, context about when and where the issue occurred, and remediation recommendations if possible.

DevOps Guru application architecture

DevOps Guru Application Architecture

Source: AWS


♦  Automatically detect operational issues

♦  Resolve issues quickly with ML-powered insights

♦  Easily scale and maintain availability 

♦  Reduce noise and alarm fatigue

Use Cases

Operational Audits
Summarize all the operationally significant events that have been identified, sorted by their severity. Use the System Health Dashboard to search for issues in specific applications and identify trends.
XXXProactive planning for resource exhaustion   Predictive alarming for exhaustible resources such as memory, CPU, and disk space. Notifications for when resource utilization will exceed the provisioned capacity.XXXPreventative maintenance Flags medium and low-severity findings that might not be critical, but may worsen over time, enabling you to prioritize and avoid unforeseen events in the future.