Aug. 28, 2023
This article constitutes the third in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers that are new to this series, it is advisable to read the first two articles focused on data ingestion and data storage & cataloging.
The subsequent sections below detail the process of optimizing and enriching data to make it available for analysis.
AWS Glue is a managed solution designed to handle data preparation, loading, and transformation, commonly known as ETL (Extract, Transform, Load). The service functions in a serverless manner, facilitating the execution of necessary actions to enhance your data. It relies on open-source technologies, specifically Spark 3.3, Scala 2, and Python 3, while also incorporating a unique layer for streamlining certain processes.
With Glue, a visual editor can be used to accomplish simple operations with ease, while switching to script mode helps address more complex tasks. The number of workers and their associated computational capabilities can also be modified to meet volume execution requirements.
Amazon EMR (Elastic MapReduce) is a service that facilitates the processing of large volumes of data. The service uses popular open-source frameworks such as Apache Spark and Hadoop to distribute data processing across EC2 instances. EMR offers flexibility and allows choosing from various instance types and quantities to optimize performance and cost.
Amazon EMR is best suited for processing large volumes of data but requires trained personnel. In contrast, AWS Glue, with its intuitive visual editor and serverless architecture is easier to use and ideal for smaller projects.
AWS Glue and Amazon EMR are ideal for processing structured or semi-structured data. However, files such as images, PDFs, and videos do not fit within these categories. In such scenarios, AWS machine learning services help address data processing needs.
The following is a non-exhaustive list of services that AWS AI/ML services:
Amazon SageMaker is a fully-managed service that facilitates the creation, training, and deployment of machine learning models. SageMaker includes modules for data processing, model training, and large-scale deployment.
Amazon Rekognition is an object detection service that simplifies the addition of image and video analysis to applications. Rekognition can identify objects, people, text, scenes, and activities, and can also be used to detect inappropriate content. In a data lake, Rekognition can be employed to analyze images and extract useful metadata.
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. Comprehend can be leveraged to analyze textual data in a data lake and extract valuable information.
Amazon Textract automatically extracts text, handwriting, and data from scanned documents. This service can be used to extract structured information from PDF files.
When considering data storage, the choice of file format is crucial, particularly for large-scale data management systems such as data lakes. Commonly used file formats are generally classified into two categories: row-based file formats and columnar file formats.
Row-Based File Formats (ex: CSV): Store data in rows. Each row of the file contains a complete entry. These formats are particularly useful when working with transactions where each row represents a distinct and independent entity. However, for data analysis operations where calculations are often performed on specific columns rather than entire rows, row-based file formats may be less efficient.
Columnar File Formats (ex: Parquet & ORC): Store data by column. Values for a specific column are stored together. Columnar formats offer two key advantages for data analysis operations, making them the preferred choice for data lakes and warehouses :
A significant hurdle in conventional data lakes is the immutability of data: once recorded, data remains unalterable. This trait can pose challenges when the modification or removal of data becomes necessary for reasons involving compliance, precision, or the upkeep of data quality.
To address this challenge, technologies such as Hudi, Delta Lake, and Iceberg have been developed. These technologies bring large-scale transactional data management capabilities to data lakes. They facilitate the flexible management of data insertions, updates, and deletions while maintaining high performance. Additionally, they also offer features such as data history management, real-time data processing, and data snapshots that help ensure data integrity.
This section explores common transformations that help optimize data manipulation:
Cleaning: Involves detecting and correcting (or removing) errors such as outliers, missing values, or incorrect data. This step helps enhance data quality and renders data more suitable to analysis.
Standardization: Converts data into a common format to facilitate comparison and analysis. Examples of standardization include converting all dates to a specific format or converting all text to lowercase.
Data Type Transformation: Changes the data type of a field. For example, a field stored as a string could be converted to a number or a date.
Filtering: Removes unnecessary data for analysis. A typical example of filtering is the elimination of certain rows or columns that are not relevant to workflow needs.
Joining: Combines data from two or more sources. For example, customer data from one source can be joined with sales data in another to obtain a comprehensive view of customer activity.
Aggregation: Summarizes data by calculating values such as sum, average, maximum, minimum, etc. For instance, sales can be aggregated by month, region, and product.
Deduplication: Removes duplicate records in the data. For example, deduplication helps ensure that only one record is kept for each customer.
Enrichment: Adds new information to the data from other sources. For example, customer data can be enriched with data acquired through machine learning analysis.
All these transformations can be accomplished without code through the visual interface provided by AWS Glue.
Compression is a common transformation that offers several benefits to a data lake implementation. Compressing files directly contributes to lowered storage expenses. The extent of compression, varying according to file characteristics and content, can span from 20% to 80%. This reduction in file sizes subsequently translates to diminished expenses associated with data querying. The cost of queries is contingent on the bytes scanned, and smaller file dimensions culminate in more economical query operations.
Within AWS Glue, the option to compress files during S3 bucket write operations is readily accessible by a simple checkbox selection.
In the scenario provided below, two Glue jobs are introduced. The first job undertakes the transformation of files from data sources into Hudi format. This facilitates the observation of the present data state across diverse sources, the exclusion of unnecessary fields for subsequent pipeline processes, and the population of null values with default values.
The second job joins the “users” tables originating from distinct data sources. While this task could theoretically occur during table queries, it is more resource-efficient to execute it in advance as opposed to executing it recurrently within queries.
The Glue jobs are added to the Step Function and must be executed in order.
Step Function in the Visual Editor – Step 3
The Glue visual editor helps examine and modify scripts. Spark scripts can also be manually composed in either Python or Scala.
Note: Once the script is modified, the graphical interface can no longer be used to edit it.
Example of Transformation in the Glue Visual Editor
The data is now ready to be exploited.
Data Lake Architecture Diagram – Step 3
The next article in this series will focus on the querying and visualization of data using services such as Amazon Athena and Amazon Quicksight.
TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.
TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.
In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.