Today’s cloud landscape is characterized by rapid technological advancements and growing data demands. Organizations are turning to Managed Cloud Services Providers (MCSPs) to effectively navigate the intricacies of cloud infrastructure and operations. By partnering with an MCSP, businesses can leverage specialized expertise, reduce operational overhead, and enhance their overall cloud performance. However, one of the crucial factors that is often overlooked when selecting an MCSP is the effectiveness and transparency of their incident management process. 

This article presents TrackIt’s approach to incident management and serves as a guide for companies looking to build their own incident management teams. The subsequent sections below outline the principles of Site Reliability Engineering (SRE) incident management and delve into the key considerations implemented by TrackIt to ensure service reliability for customers. 

SRE Incident Management

Site Reliability Engineering (SRE) is a discipline that blends aspects of both software engineering and IT operations to ensure that systems and services run smoothly. SRE incident management refers to the systematic approach and set of practices employed by engineering teams to manage and mitigate issues that can disrupt the availability, reliability, and performance of systems and applications. 

TrackIt’s Incident Management Approach

TrackIt employs a meticulously structured incident management process to ensure swift and effective resolution of issues while adhering to stringent service level agreements (SLAs). With an industry-standard SLA target response time of 15 to 20 minutes, a dedicated team of five is focused on efficient incident handling and rapid response times.

Application Monitoring: A custom application monitoring system proactively identifies and addresses issues is often deployed. This system continuously tracks the health and performance of critical components, promptly alerting the team to any deviations from normal operation. This proactive monitoring approach ensures swift incident response and minimizes downtime, aligning with TrackIt’s commitment to meeting stringent SLAs.

PagerDuty: The incident management toolkit relies on PagerDuty to streamline incident notification and management. Its configuration is facilitated through Terraform, enhancing adaptability and customization to meet specific needs. This tool plays a pivotal role in ensuring prompt alerts for the on-call team, enabling strict adherence to the demanding SLAs. 

No Follow-the-Sun: Proactive on-call rotations are scheduled to commence on Mondays at 12 PM Pacific Time, ensuring comprehensive coverage at the start of the workweek. Unlike a follow-the-sun model, which relies on “handovers” between teams and can introduce delays and complications during critical incidents, the incident management process prioritizes efficiency and effectiveness. 

Flexibility: SLA calculations are based on a one-month average, providing flexibility to accommodate variations in incident response times, thus ensuring preparedness to handle incidents that often exceed industry-standard response times.

Data-Driven: A crucial element of the incident management process is the emphasis on data-driven decision-making. For every incident or downtime event, the team places a priority on identifying and incorporating relevant metrics. This practice ensures continuous improvement and a proactive approach to incident prevention and mitigation.

Activities

TrackIt’s SRE framework encompasses various essential activities, including: 

Monthly tabletop exercises: Theoretical stress test scenarios to document responses to potential issues (e.g.: load balancer failures). 

Post-mortem evaluations: Conducted after each incident, the documentation for these evaluations is maintained in platforms such as Coda or Notion to facilitate learning and process refinement. 

Disaster recovery exercises (DREs):  Carried out regularly (monthly or quarterly), DREs involve the intentional disruption of the staging environment to assess response and recovery procedures.

Response Process

UFrUYunjmhDnFlEimSa IEHzhvZT49gZPIMVuHTTm K36wvQv1dwEYxPx533 owOr25dnPDrxjeLqGi55rDQ3duf3npwNQGauhK Vbmi3HBkkFFcKlUnoorfj6q6UBNLbQ8FE3XQoKP fTdEU1rQ768

When a Sev0 incident arises, the incident management team springs into action. This team is structured to efficiently manage the incident, with roles including the following: 

Incident Commander: Typically a Technical Program Manager (TPM), is responsible for overall incident coordination. 

A Scribe/Communication Lead: Maintains regular stakeholder communication, providing updates every 30 minutes through a dedicated Slack channel. 

Engineers: Responsible for incident resolution and executing recommended actions from the Incident Commander, ensuring a well-coordinated and effective response to any critical incident. 

About CloudWise – AWS Cloud Managed Services 

rGeLSxQiQz9mSEXb51S4MzggXsLiFM3NUFQtfik8hE30mWvSmr7xeq2MKSxMp3rXYJIVtosmv0rxj4ZUKr0qhAZ ZWM00uDCp0lgf1E SPqP0F0

CloudWise, TrackIt’s AWS Cloud Managed Services offering includes a suite of services such as monitoring, optimization, and support, enabling companies to manage their AWS infrastructure with ease and efficiency. CloudWise allows customers to stay focused on their core business while TrackIt experts handle all the technicalities of cloud infrastructure management. 

Built on in-house custom monitoring software, the offering includes real-time monitoring, customized dashboards, monthly cost analysis and coverage reports, annual architecture reviews, and quarterly security assessments. Customers also benefit from 24/7/365 global support from AWS-certified TrackIt engineers working to ensure that their cloud investments are optimized to their full potential.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA. 

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.