Building an Automated Sports Video Analysis Pipeline on AWS
Author
Raymond Lim
Date Published
Automated sports video analysis is becoming a major focus for organizations looking to scale analytics and content operations. From performance analysis and highlight generation to fan engagement and statistical modeling, the ability to automatically identify actions and events within sports video has become a major operational advantage.
Traditionally, these workflows have depended heavily on manual tagging and annotation. Analysts often spend hours reviewing footage frame-by-frame to identify key moments, categorize events, and generate structured datasets for downstream applications. While effective at smaller scales, manual processes quickly become difficult to maintain as video libraries grow and the demand for faster turnaround increases.
Exploring Automated Sports Video Intelligence
Recent advances in computer vision and machine learning are making it possible to automate large portions of this workflow. By combining cloud-native infrastructure with AI/ML services, organizations can process large volumes of sports footage, detect actions in near real time, and generate structured metadata that supports analytics, searchability, clip generation, and operational efficiency.
TrackIt developed a proof of concept designed to evaluate how AWS-native machine learning pipelines could support automated sports event detection and video analysis at scale. The project focused on building a repeatable, cloud-based workflow capable of ingesting sports footage, training machine learning models, and performing automated inference and evaluation.
The Challenge
The primary objective was to determine whether machine learning models could reliably detect players, objects, and sport-specific actions directly from broadcast footage while maintaining consistency across varying camera angles, lighting conditions, motion blur, and gameplay speed.
A major challenge in sports video analysis lies in the temporal nature of events. Many actions cannot be accurately identified from a single frame alone and instead require contextual understanding across sequences of frames. Fast-moving objects, partial occlusions, inconsistent broadcast quality, and class imbalance between common and rare events further increase complexity.
In addition, the project required a scalable infrastructure capable of:
- Processing large video datasets efficiently
- Automating ingestion and preprocessing workflows
- Supporting iterative model experimentation
- Evaluating predictions against labeled reference datasets
- Reducing manual review effort
The broader goal was to assess whether an automated pipeline could provide a strong foundation for scalable sports analytics workflows while reducing operational overhead.
Solution Overview
TrackIt implemented a cloud-native machine learning pipeline on AWS to support automated sports video analysis workflows from ingestion through inference and evaluation.
The solution evolved through multiple iterations as the models and workflows matured.
Data Preparation
Pre-annotated sports video assets were ingested and normalized through a serverless preprocessing workflow. The pipeline extracted technical metadata, generated lightweight proxy assets, and prepared datasets for machine learning training and evaluation.
AWS Lambda and Amazon S3 were used to orchestrate ingestion and storage operations, enabling scalable and event-driven processing workflows.
Initial Model Architecture
The first iteration focused on single-frame image classification using ResNet50, a deep convolutional neural network (CNN) optimized for high-accuracy image recognition tasks.
Video frames were extracted using FFmpeg and labeled through Amazon SageMaker Ground Truth to support supervised training workflows. While this approach demonstrated promising early results, the architecture struggled to fully capture the temporal relationships required for accurate action recognition across continuous video sequences.
Transition to Video-Based Classification
To improve temporal understanding, the pipeline evolved toward video-based classification using an R(2+1)D spatiotemporal architecture.
Unlike traditional frame-based models, this approach processes sequences of frames and separates spatial and temporal feature extraction into distinct operations. This design improves the model’s ability to understand motion dynamics and event progression across time.
The classification pipeline was further enhanced with a lightweight object-tracking component optimized for fast-moving objects commonly found in sports footage. The tracking layer was designed to better handle challenging broadcast conditions such as motion blur, partial occlusions, and rapid scene transitions commonly encountered in live sports environments. Together, these systems addressed many of the spatial and temporal limitations observed in the earlier implementation.
Technical Architecture

The final proof-of-concept architecture combined multiple AWS services to support ingestion, training, inference, monitoring, and evaluation workflows.
Key components included:
- Video ingestion pipeline using Amazon API Gateway, AWS Lambda, and Amazon S3
- Machine learning workflows built on Amazon SageMaker for preprocessing, training, and model management
- Inference pipelines triggered automatically through Amazon S3 events
- Automated evaluation services comparing model predictions against human-labeled datasets
- Metadata and results storage using Amazon DynamoDB
- Monitoring and observability through Amazon CloudWatch Logs
Amazon SageMaker was additionally used to support model registration, repeatable experimentation workflows, and automated evaluation pipelines, enabling consistent iteration across training and inference cycles.
The architecture followed AWS Well-Architected principles with an emphasis on scalability, operational efficiency, reliability, and cost optimization.
Results
The proof of concept demonstrated that automated sports video analysis pipelines can significantly reduce manual review and annotation effort while establishing a scalable foundation for future AI-driven workflows.
Key outcomes included:
- Improved detection consistency compared to earlier frame-based approaches
- Reliable detection performance for frequently occurring actions
- Automated end-to-end processing across ingestion, training, inference, and evaluation workflows
- Reduced manual tagging requirements for video review operations
- Faster identification of key moments for downstream highlight generation and analytics workflows
Early evaluation demonstrated measurable improvements over the initial frame-based implementation, particularly for frequently occurring actions and sequence-based event recognition tasks.
The project also highlighted several important insights regarding dataset diversity, temporal modeling, and event imbalance within sports video analysis systems.
While additional refinement would be required for production-scale deployment, the proof of concept validated the feasibility of building scalable sports intelligence workflows using AWS-native AI/ML services.
Areas for Future Improvement
To elevate the pipeline to production readiness, several enhancements have been proposed. These steps aim to resolve current bottlenecks and improve overall performance.
- Optimize Object Tracking: Reduce processing time from ~5 hours by tracking every n-th frame and interpolating positions, or using ground truth bounding boxes.
- Improve Team Filtering: Replace frame height-based filtering with player bounding boxes for camera-side focus.
- Add Player Pose Layer: Incorporate pose estimation to enhance event classification accuracy.
- Address Class Imbalance: Collect more diverse data, including rare events, multiple camera angles, and additional games.
- Model Refinement: Retrain with these inputs to potentially boost F1 scores significantly.
Conclusion
Automated sports video analysis represents a growing opportunity for sports organizations, media platforms, and analytics providers seeking to scale content operations and unlock deeper insights from video data.
By combining AWS-native infrastructure with modern computer vision architectures, organizations can move toward intelligent, scalable video processing pipelines capable of reducing manual effort and accelerating analytics workflows.
This proof of concept demonstrates how cloud-native AI/ML services can support the development of automated sports intelligence platforms while providing a flexible foundation for future innovation in video analytics.



