Written by Arnaud Brown, Senior Full Stack Engineer and Joffrey Escobar, Cloud Data Engineer

The first article in the Data Quality series highlighted the significance of data quality and the key dimensions that assist in defining high-quality data. The article also discussed the common challenges associated with data quality and introduced techniques to help establish and maintain high data quality standards. 

This second installment demonstrates the data quality features offered by AWS Glue that can be utilized to prevent the transformation of a data source (such as a data lake) into a disorderly and unmanageable data repository, commonly referred to as a Data Swamp. AWS Glue provides a unified approach for streamlining data quality that works for data lakes, traditional databases, and other forms of data storage. It addresses fundamental issues such as improving data reliability, ensuring compliance, and enhancing operational efficiency. AWS Glue also meticulously validates all critical data dimensions, as previously discussed in the initial article.

The article explores how AWS Glue addresses the central challenges associated with data quality, encompassing Data Profiling, Data Cleansing, and Data Governance. Additionally, Glue has the ability to sustain a high level of data quality over an extended period of time.

To begin, a detailed guide on integrating data quality checks into AWS Glue jobs will be provided. Followed by a discussion of AWS Glue Data Brew. Known for its proficiency in data profiling and cleaning, it offers a transparent view of data lineage and streamlines the automation of these processes on incoming data.

Data Quality in AWS Glue jobs

AWS Glue is a cloud-based service provided by Amazon Web Services that offers data integration, ETL (Extract, Transform, Load), and data preparation capabilities to simplify and automate the process of managing and transforming data. 

Glue provides a robust platform for creating scripts that leverage the extensive capabilities of AWS cloud infrastructure. This allows for seamless handling of massive data volumes, with cost-effective serverless execution. These scripts are highly effective for building ETL/ELT pipelines. The platform also offers a user-friendly visual editor, making it easy to work with data through nodes for tasks like data retrieval, transformation, and writing. Notably, any actions performed in the visual editor can be easily converted into a Python script.

AWS Glue data quality

A notable addition to AWS Glue is the Data Quality node. This feature enables users to establish precise rules for data fields. With more than 25 preset rules at their disposal, users can set limits for each rule to maintain data accuracy. Furthermore, AWS Glue permits the inclusion of custom rules, for fast integration. As an example, some of the available rules include ColumnCount, ColumnDataType, ColumnLength, Completeness, DataFreshness, StandardDeviation, and Uniqueness.

AWS Glue data quality - Rules

The Data Quality node provides two output options. The primary output keeps the original data while adding predefined columns, allowing for row-by-row data quality results. The secondary output provides an overall view of data quality outcomes, which can be valuable for making decisions on actions like data cleansing.

AWS Glue Data Quality Node

AWS Glue offers an all-in-one solution for data quality with its editor. A wide range of preset rules helps to ensure high standards of data quality in every aspect. It effectively tackles the inherent challenges of data quality management highlighted in the first article.

AWS Glue Data Quality Results

Data Quality results for a successful execution (above) and a failed execution (below)

Data Quality with Glue DataBrew

AWS Glue DataBrew is a user-friendly tool created for visual data preparation, allowing  data analysts and scientists to refine and standardize data for analytics and machine learning (ML) purposes. The service is equipped with a library of over 250 pre-built transformations for automating various data preparation tasks without the need for coding. This includes tasks such as spotting and removing unusual data, making data formats consistent, and correcting incorrect data entries. Once the data is ready, it can be used for analytics and ML projects. AWS Glue DataBrew charges are based on actual usage, eliminating the need for an initial investment.

AWS Glue DataBrew Data Quality

To gain a deeper understanding of the data quality features offered by AWS DataBrew, it is essential to understand its architecture. At its core, the system revolves around four components:  

  • Jobs: In the context of AWS DataBrew, Jobs represent the execution units responsible for performing specific tasks within the data preparation process. They interconnect with other components, orchestrating the flow of data transformations and ensuring seamless processing from input to output.
  • Datasets: Datasets serve as central storage hubs within AWS DataBrew, housing the raw or processed data that undergoes various transformations. These repositories play a pivotal role in facilitating data accessibility, retrieval, and manipulation during the data preparation lifecycle.
  • Recipes: Recipes define the transformation parameters applied to the data. These configurations are based on a series of actions, specifying how the data should be modified or cleaned. Recipes are essential in shaping the data to meet specific requirements, providing a structured approach to the data preparation workflow.
  • DataQuality Rules: DataQuality rules are specific areas where rules are established to profile and evaluate the data. These rules define criteria for data accuracy, completeness, and consistency, contributing to the overall assessment of data quality within the system. They play a crucial role in ensuring that the data meets predefined standards and is fit for analytical and machine learning purposes.

bibBjhDa5zyAWVKasYza27z2WYtyeSz0vJzgk6I3uSEka rk2ykREzLd4ZJqTSd4fTEOmdLL3AIMyhSwG8E2E Ob26Edg1w8rOBPmsPGah7X8NdBjPHykj PWAw VMoHDoJrbsqTpkHvbYfvbeuDrKI

Creating rules using the web console interface is a straightforward process. The interface enables users to easily create tailored rules and assign threshold levels for each. 

AWS Glue DataBrew also provides a notable feature: rule suggestions based on the dataset’s profile. This simplifies the process of creating essential checks and helps users become more familiar with the tool’s capabilities.

nSY koWxCm3Jhm xcHetThRhEtz67qyjdSkaldX1KHcQTYe32RdvduwcUxsYAgYr6zZbdKsZhkn9GuW1Cfv4pOSNQTH3ZDFIEJJR5H5Op3gWNvFgwvZUyz vAlXtBPcTTvhXwK9MjxhU2 cWIhMW6I8

After initiating a job with data quality rules, the outcomes can be immediately reviewed within the ‘Dataset’ section. This section provides detailed insights into the performance of each rule and helps identify both successes and areas of concern. These insights serve as valuable feedback, guiding subsequent actions and fostering greater trust in the data.

Conclusion

This article has delved into the practical aspects of upholding data quality within the AWS ecosystem, with a particular focus on AWS Glue jobs and AWS Glue DataBrew. It has underscored the significance of AWS Glue jobs in the management, transformation, and validation of data, establishing a solid foundation for data quality. Additionally, it has shed light on the capabilities of AWS Glue DataBrew, a robust tool that facilitates data transformation and excels in data profiling and cleansing.

By harnessing the capabilities of AWS Glue, organizations can establish robust data management pipelines that effectively handle inconsistencies and anomalies. This not only ensures the reliability and quality of data for critical operations but also does so consistently over the long run with minimal ongoing effort. 

About TrackIt

TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.

We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.

Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.

About Arnaud Brown

Arnaud Photo

As a Full Stack Engineer at TrackIt, Arnaud specializes in building serverless projects within the AWS ecosystem. He is passionate about managing large datasets and enjoys creating big data systems.

Arnaud’s goal is to help clients visualize their data effectively and enhance their decision-making processes.

About Joffrey Escobar

Joffrey Photo

As a Cloud Data Engineer at TrackIt, Joffrey brings over five years of experience in developing and implementing custom AWS solutions. His expertise lies in creating high-capacity serverless systems, scalable data infrastructures, and integrating advanced search solutions.

Joffrey is passionate about leveraging technology to meet diverse client needs and ensure robust, secure, and efficient operations.