Today’s cloud landscape is characterized by rapid technological advancements and growing data demands. Organizations are turning to Managed Cloud Services Providers (MCSPs) to effectively navigate the intricacies of cloud infrastructure and operations. By partnering with an MCSP, businesses can leverage specialized expertise, reduce operational overhead, and enhance their overall cloud performance. However, one of the crucial factors that is often overlooked when selecting an MCSP is the effectiveness and transparency of their incident management process.
This article presents TrackIt’s approach to incident management and serves as a guide for companies looking to build their own incident management teams. The subsequent sections below outline the principles of Site Reliability Engineering (SRE) incident management and delve into the key considerations implemented by TrackIt to ensure service reliability for customers.
Contents
SRE Incident Management
Site Reliability Engineering (SRE) is a discipline that blends aspects of both software engineering and IT operations to ensure that systems and services run smoothly. SRE incident management refers to the systematic approach and set of practices employed by engineering teams to manage and mitigate issues that can disrupt the availability, reliability, and performance of systems and applications.
TrackIt’s Incident Management Approach
TrackIt employs a meticulously structured incident management process to ensure swift and effective resolution of issues while adhering to stringent service level agreements (SLAs). With an industry-standard SLA target response time of 15 to 20 minutes, a dedicated team of five is focused on efficient incident handling and rapid response times.
Application Monitoring: A custom application monitoring system proactively identifies and addresses issues is often deployed. This system continuously tracks the health and performance of critical components, promptly alerting the team to any deviations from normal operation. This proactive monitoring approach ensures swift incident response and minimizes downtime, aligning with TrackIt’s commitment to meeting stringent SLAs.
PagerDuty: The incident management toolkit relies on PagerDuty to streamline incident notification and management. Its configuration is facilitated through Terraform, enhancing adaptability and customization to meet specific needs. This tool plays a pivotal role in ensuring prompt alerts for the on-call team, enabling strict adherence to the demanding SLAs.
No Follow-the-Sun: Proactive on-call rotations are scheduled to commence on Mondays at 12 PM Pacific Time, ensuring comprehensive coverage at the start of the workweek. Unlike a follow-the-sun model, which relies on “handovers” between teams and can introduce delays and complications during critical incidents, the incident management process prioritizes efficiency and effectiveness.
Flexibility: SLA calculations are based on a one-month average, providing flexibility to accommodate variations in incident response times, thus ensuring preparedness to handle incidents that often exceed industry-standard response times.
Data-Driven: A crucial element of the incident management process is the emphasis on data-driven decision-making. For every incident or downtime event, the team places a priority on identifying and incorporating relevant metrics. This practice ensures continuous improvement and a proactive approach to incident prevention and mitigation.
Activities
TrackIt’s SRE framework encompasses various essential activities, including:
Monthly tabletop exercises: Theoretical stress test scenarios to document responses to potential issues (e.g.: load balancer failures).
Post-mortem evaluations: Conducted after each incident, the documentation for these evaluations is maintained in platforms such as Coda or Notion to facilitate learning and process refinement.
Disaster recovery exercises (DREs): Carried out regularly (monthly or quarterly), DREs involve the intentional disruption of the staging environment to assess response and recovery procedures.
Response Process
When a Sev0 incident arises, the incident management team springs into action. This team is structured to efficiently manage the incident, with roles including the following:
Incident Commander: Typically a Technical Program Manager (TPM), is responsible for overall incident coordination.
A Scribe/Communication Lead: Maintains regular stakeholder communication, providing updates every 30 minutes through a dedicated Slack channel.
Engineers: Responsible for incident resolution and executing recommended actions from the Incident Commander, ensuring a well-coordinated and effective response to any critical incident.
About CloudWise – AWS Cloud Managed Services
CloudWise, TrackIt’s AWS Cloud Managed Services offering includes a suite of services such as monitoring, optimization, and support, enabling companies to manage their AWS infrastructure with ease and efficiency. CloudWise allows customers to stay focused on their core business while TrackIt experts handle all the technicalities of cloud infrastructure management.
Built on in-house custom monitoring software, the offering includes real-time monitoring, customized dashboards, monthly cost analysis and coverage reports, annual architecture reviews, and quarterly security assessments. Customers also benefit from 24/7/365 global support from AWS-certified TrackIt engineers working to ensure that their cloud investments are optimized to their full potential.
About TrackIt
TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.
We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.
Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.