TrackIt – Cloud Consulting & S/W Development
  • Home
  • Solutions
    • Media & EntertainmentMedia & Entertainment
    • HPC – High Performance ComputingHigh Performance Computing
    • Software DevSoftware Development; Serverless, CI/CD, UI/UX, API Integration
    • Migrate to AWS
    • DevOpsDevOps, AWS Cloud Consulting
    • StorageData Storage
    • Financial Services
  • Services
    • AWS Services
      • AWS Well-Architected Framework
      • Focus on Video AI/ML
      • Chime SDK
      • DevOps Guru
      • Virtual Desktop/Studio in the Cloud
      • Amazon API Gateway
      • AWS Lambda
      • Amazon DynamoDB
      • Amazon EKS
      • Amazon OpenSearch
      • Amazon CloudFront
      • AWS Elemental MediaConvert
    • Hammerspace Integration
    • Focus on Penetration Testing
    • Autodesk Service Provider
  • Resources
    • Blogs
    • White Papers
    • Case Studies
    • Press Releases
    • Trainings and Workshops – Unlock TrackIt’s Power
  • Tools & Apps
    • TrackIt Cost Management App
    • Tagbot
    • AI Video Reviewer
    • Trackflix
  • Managed Services
  • About Us
header image

TrackIt Blogs

Read the latest cloud technology news and learn new tips and tricks to get the most out of your cloud.

  • Contact Us

Categories

Automation AWS data rooms Data Storage DevOps DevTools Monitoring Optimization Services TrackIt

Tags

ami API Aurora aws AWS EC2 AWS S3 AWS Studio in the Cloud chargify Ci Cd Pipeline Circleci cloud Cloud Computing Cognito consul cost optimization dedicated Demonstration DevOps docker Dynamodb ec2 ESXi github Hammerspace hashicorp high availability HP Anyware HPA Tech Retreat 2023 HPTeradici json Marketing Media Workflows nomad open source packer pfSense Postgres public cloud reduce cost Remote Serverless terraform tutorial Video Windows Containers

Nov. 25, 2020

—

How to Optimize your Data Pipeline Implementation

One of TrackIt’s clients had implemented a big data pipeline running on AWS that needed to be optimized. The client was leveraging the big data pipeline to enable its data scientists to gain additional insights by exploiting data that originated from CSV files.

This article details the TrackIt team’s approach to optimizing the architecture of the data pipeline.

Data Pipeline Implementation

 

Data Pipeline

Data pipelines are essential for companies looking to leverage their data to gather reliable business insights. Pipelines allow companies to consolidate, combine, and modify data originating from various sources and make it available for analysis and visualization.

However, the numerous benefits that data pipelines provide depends on a company’s ability to extract and aggregate its data coming from different sources and thus on the quality of its pipeline architecture choices.

One of TrackIt’s clients had implemented a big data pipeline running on AWS that needed to be optimized. The client was leveraging the big data pipeline to enable its data scientists to gain additional insights by exploiting data that originated from CSV files.

However, the company was running into certain architecture-related problems with its pipeline that needed to be fixed and sought our expertise to address these issues.

This article details the TrackIt team’s approach to optimizing the architecture of the data pipeline.

Initial Pipeline

data pipeline implementation - image 1

How the initial pipeline worked:
•  The company’s CSV files were first added to an S3 bucket

•  Once the files were added to the S3 bucket, an AWS Glue job was automatically triggered to fetch the data from the CSV files and make it available for a Python Spark script, the next step in the pipeline

•  A Python Spark script modified the data to make it more suitable for use in Redshift Spectrum

•  The modified data was then stored in a new S3 bucket (now in the Parquet file format)

•  Once files were added to the new S3 bucket, another Glue job was triggered that made the data available for use by Redshift, an SQL database

•  Files from this S3 bucket were also replicated into another S3 bucket hosted on a different AWS region using AWS S3’s cross-region replication feature. They needed separate S3 buckets in each region because Redshift Spectrum, which was being used in both regions separately, requires the S3 bucket to be located in the same region.

•  An AWS Glue job then fetched data from the latter S3 bucket and made it available to Redshift

•  Data scientists could then use a Python script to query the data on Redshift

Problems:

The pipeline implemented by the company had certain issues that were hindering its ability to make the most of its data.

Problem #1 – Inability to Individually Test Jobs
The initial pipeline did not provide the company with the ability to isolate and test individual components of the pipeline. Manually triggering one of the steps of the pipeline launched all the other events that followed it.

Problem #2: Too many steps in the pipeline
The initial pipeline included quite a few additional steps – such as the Lambda functions and CloudWatch events before and after the Glue job – that made the pipeline harder to test and manage. These extra steps could have been avoided with different architectural choices.

Problem #3: Cross-region data replication
There was also an issue arising due to the cross-region data replication feature on S3. The data replication between AWS region 1 and AWS region 2 was not instantaneous and took a few minutes. However, the completion of the AWS Glue job in region 1 was immediately triggering the Glue job in region 2 before the data had finished replicating between the S3 buckets.

Problem #4 – No error notifications
The initial pipeline provided the company with no error notifications. The company often discovered the occurrence of errors weeks after an event, and then only because the data scientist realized that they were missing data. When an error did occur, the next job would simply not be launched.

The pipeline did not include an error notification component that would allow the company to immediately become aware of errors happening within the pipeline. The company’s engineers had to go onto the console and investigate the history of executions to try to identify and pinpoint errors in the pipeline.

Optimized Pipeline

The following modifications were first proposed by the TrackIt team to the client.

The first modification to the pipeline proposed by the TrackIt team was to use Glue Workflow, a feature of AWS Glue to create a workflow that automatically launches the AWS Glue jobs in sequence. The Glue Workflow would also allow the company to launch and test jobs individually without triggering the whole workflow.

The implementation of the Glue Workflow would also enable the company to simplify the pipeline by getting rid of extraneous Lambda functions and CloudWatch events that had been implemented in the initial pipeline. Instead of having multiple Lambda functions, the new pipeline would have just one Lambda function that triggers the Glue Workflow when files are uploaded into the S3 bucket.

ETL Pipeline 2

The second modification proposed by the TrackIt team was the addition of an error notification component using Amazon CloudWatch. CloudWatch events would be triggered immediately when an error occurred in the Glue Workflow and would then send either an HTTP request or an email to the team, or could trigger a Lambda function that would execute additional tasks if there was an error.

The third modification to the pipeline proposed by the TrackIt team was to eliminate the use of S3 cross-region replication. Instead, the files are directly added to both S3 buckets (each located in a different region) so that when the Glue job is triggered in region 2, all the files are already up to date in both S3 buckets.

The client was quite pleased with this proposition and wanted to incorporate these new changes to the pipeline using Apache Airflow, a tool used to create and manage complex workflows.

Apache Airflow Implementation

The TrackIt team assisted the client in incorporating the suggested modifications to the pipeline and implementing it on Apache Airflow. The different parts of the pipeline were coded in Python as modules that the client could reuse in the future to build similar pipelines or to further modify the existing one.

ETL Pipeline 3

ETL Pipeline 4

ETL Pipeline 5

How the final Apache Airflow pipeline works:

•  The company’s CSV files are first added to an S3 bucket
•  The AWS Glue crawler fetches the data from S3
•  A Python Spark script is executed that modifies the data and makes it more suitable for use in Redshift Spectrum

•  The modified data is then stored in a new S3 bucket (now in the Parquet file format)
•  Then in one region, a Glue crawler fetches data
•  In the other region, the Glue crawler fetches data and a Redshift script is used to modify data and then update changes

•  Data scientists can use a Python script to query the data on Redshift
•  If any error occurs within the pipeline, a CloudWatch event is immediately triggered and sends an email to notify the team

 

About TrackIt

TrackIt, an Amazon Web Services Advanced Consulting Partner based in Marina del Rey, CA, offers a range of cloud management, consulting, and software development solutions. Their expertise includes Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization, with a focus on Media & Entertainment workflows, High-Performance Computing environments, and data storage.

TrackIt excels in cutting-edge software design, particularly in the areas of containerization, serverless architectures, and pipeline development. The company’s team of experts can help you design and deploy a custom solution tailored to your specific needs.

In addition to cloud management and modern software development services, TrackIt also provides an open-source AWS cost management tool to help users optimize their costs and resources on the platform. With its innovative approach and expertise, TrackIt is the ideal partner for organizations seeking to maximize the potential of their cloud infrastructure.

 

Previous Next

AWS Advanced Consulting Partner - Cloud Management, Consulting, and Software Development Solutions

Quick links

    • Home
    • Solutions
    • Services
    • White Papers
    • Blog
    • About Us

Tools & Apps

    • TrackIt AI Video Reviewer
    • Trackflix
    • TrackIt App
    • Tagbot

get in touch

    • Phone: +1 323-230-0042
    • Email: info@trackit.io

    • TrackIt 578 Washington Blvd #260 Marina Del Rey, CA 90292
  • @TrackIt 2022