Nov. 20, 2023
The first article in the Data Quality series highlighted the significance of data quality and the key dimensions that assist in defining high-quality data. The article also discussed the common challenges associated with data quality and introduced techniques to help establish and maintain high data quality standards.
This second installment demonstrates the data quality features offered by AWS Glue that can be utilized to prevent the transformation of a data source (such as a data lake) into a disorderly and unmanageable data repository, commonly referred to as a Data Swamp. AWS Glue provides a unified approach for streamlining data quality that works for data lakes, traditional databases, and other forms of data storage. It addresses fundamental issues such as improving data reliability, ensuring compliance, and enhancing operational efficiency. AWS Glue also meticulously validates all critical data dimensions, as previously discussed in the initial article.
The article explores how AWS Glue addresses the central challenges associated with data quality, encompassing Data Profiling, Data Cleansing, and Data Governance. Additionally, Glue has the ability to sustain a high level of data quality over an extended period of time.
To begin, a detailed guide on integrating data quality checks into AWS Glue jobs will be provided. Followed by a discussion of AWS Glue Data Brew. Known for its proficiency in data profiling and cleaning, it offers a transparent view of data lineage and streamlines the automation of these processes on incoming data.
AWS Glue is a cloud-based service provided by Amazon Web Services that offers data integration, ETL (Extract, Transform, Load), and data preparation capabilities to simplify and automate the process of managing and transforming data.
Glue provides a robust platform for creating scripts that leverage the extensive capabilities of AWS cloud infrastructure. This allows for seamless handling of massive data volumes, with cost-effective serverless execution. These scripts are highly effective for building ETL/ELT pipelines. The platform also offers a user-friendly visual editor, making it easy to work with data through nodes for tasks like data retrieval, transformation, and writing. Notably, any actions performed in the visual editor can be easily converted into a Python script.
A notable addition to AWS Glue is the Data Quality node. This feature enables users to establish precise rules for data fields. With more than 25 preset rules at their disposal, users can set limits for each rule to maintain data accuracy. Furthermore, AWS Glue permits the inclusion of custom rules, for fast integration. As an example, some of the available rules include ColumnCount, ColumnDataType, ColumnLength, Completeness, DataFreshness, StandardDeviation, and Uniqueness.
The Data Quality node provides two output options. The primary output keeps the original data while adding predefined columns, allowing for row-by-row data quality results. The secondary output provides an overall view of data quality outcomes, which can be valuable for making decisions on actions like data cleansing.
AWS Glue offers an all-in-one solution for data quality with its editor. A wide range of preset rules helps to ensure high standards of data quality in every aspect. It effectively tackles the inherent challenges of data quality management highlighted in the first article.
Data Quality results for a successful execution (above) and a failed execution (below)
AWS Glue DataBrew is a user-friendly tool created for visual data preparation, allowing data analysts and scientists to refine and standardize data for analytics and machine learning (ML) purposes. The service is equipped with a library of over 250 pre-built transformations for automating various data preparation tasks without the need for coding. This includes tasks such as spotting and removing unusual data, making data formats consistent, and correcting incorrect data entries. Once the data is ready, it can be used for analytics and ML projects. AWS Glue DataBrew charges are based on actual usage, eliminating the need for an initial investment.
To gain a deeper understanding of the data quality features offered by AWS DataBrew, it is essential to understand its architecture. At its core, the system revolves around four components:
Creating rules using the web console interface is a straightforward process. The interface enables users to easily create tailored rules and assign threshold levels for each.
AWS Glue DataBrew also provides a notable feature: rule suggestions based on the dataset’s profile. This simplifies the process of creating essential checks and helps users become more familiar with the tool’s capabilities.
After initiating a job with data quality rules, the outcomes can be immediately reviewed within the ‘Dataset’ section. This section provides detailed insights into the performance of each rule and helps identify both successes and areas of concern. These insights serve as valuable feedback, guiding subsequent actions and fostering greater trust in the data.
This article has delved into the practical aspects of upholding data quality within the AWS ecosystem, with a particular focus on AWS Glue jobs and AWS Glue DataBrew. It has underscored the significance of AWS Glue jobs in the management, transformation, and validation of data, establishing a solid foundation for data quality. Additionally, it has shed light on the capabilities of AWS Glue DataBrew, a robust tool that facilitates data transformation and excels in data profiling and cleansing.
By harnessing the capabilities of AWS Glue, organizations can establish robust data management pipelines that effectively handle inconsistencies and anomalies. This not only ensures the reliability and quality of data for critical operations but also does so consistently over the long run with minimal ongoing effort.
TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.
TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.
In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.