About this Series of Articles on Data Lakes

Managed services provided by major cloud providers like AWS have made it feasible for all types of businesses to establish data lakes. The requirement for substantial expenses or a large workforce to develop and sustain such lakes is no longer necessary. Nevertheless, the implementation process is more complex than setting up a traditional database. 

This series of articles delves into the fundamental aspects of data lakes, encompassing the necessary services and methodologies to design a customized solution.

Data Engineering Lifecycle - Data Lakes

Data engineering lifecycle from “Fundamentals of Data Engineering” by Matt Housley

Addressing Data Ingestion

In this inaugural chapter of the series, the focus is directed toward the ingestion of data into the data lake. 

Data Ingestion Challenges

The task of data ingestion within a data lake mandates a capacity to conform to a myriad of data sources, irrespective of their structural attributes, methods of acquisition, and the periodicity with which data retrieval is necessitated.

While manual creation of scripts to retrieve this data remains a viable option, it unavoidably entails prolonged and arduous labor, further susceptible to continual upkeep. AWS offers an array of services that address these challenges and help streamline data ingestion.

AWS Services for Data Ingestion

Application Data with DMS: For data sourced from application databases, data ingestion can be implemented using AWS Database Migration Service (DMS). This service enables the import of data into data lake and employs a Change Data Capture (CDC) system to ensure the continuous and near real-time replication of data modifications.

Real-Time Data with Kinesis: In scenarios demanding real-time data management, Amazon Kinesis emerges as a pivotal solution. Engineered to proficiently handle data streams from diverse origins, including mobile applications, IoT devices, website interactions, and log files, Kinesis undertakes the extensive tasks of capturing, processing, and real-time analysis. This functionality empowers expedited responses, as it consistently enriches the data lake with up-to-date and pertinent information.

IoT Data Collection with AWS IoT: For data originating from IoT (Internet of Things) devices, AWS IoT Core proves to be an indispensable tool. This service securely connects devices such as sensors and home appliances to collect, store, and analyze data. AWS IoT helps process and transit information in large quantities to a data lake, providing a constant stream of real-time data that serves as the basis for in-depth analysis.

Custom and Publicly Accessible Data with AWS Lambda or AWS Batch: For engagements involving custom or publicly accessible data, the optimal choice between AWS Lambda and AWS Batch hinges on the nature and duration of tasks at hand. AWS Lambda is ideally suited for brief tasks, limited to 15 minutes, activated by events like the addition of a new record to a database or the availability of fresh data via a public API. Conversely, AWS Batch caters to extensive tasks demanding prolonged execution times such as the retrieval and processing of substantial datasets from publicly accessible origins.

SaaS Data Integration with Amazon AppFlow: For interactions with SaaS applications such as Salesforce, Slack, or Zendesk, Amazon AppFlow stands as a definitive solution for data ingestion. It orchestrates the seamless transfer of data between these applications and AWS without the need for programming. By means of a few simple configurations, data flows for real-time or scheduled transfers can be established. For example, AppFlow can be used for the periodic synchronization of Salesforce sales data with the data lake or the automated transmission of Zendesk ticket data following any modifications.

Pipeline Orchestration

A multitude of services can be employed for the purpose of orchestrating pipelines. Conventionally, services such as Apache Airflow or mage.ai (a more recent choice) have been utilized. However, these services demand installation on substantial instances. 

In alignment with the preference for serverless and user-friendly solutions, the focus within this series of articles will center on Amazon Step Functions

Distinguished by its serverless architecture and robust integration into the AWS ecosystem, Step Functions help define a sequence of actions or services to be initiated in response to certain events. The construction of pipelines can be facilitated through the utilization of the graphical interface of the Step Functions Studio or through the formulation of pseudocode for its integration into Infrastructure as Code files (using CDK, Terraform, etc.).

Example

The following example is of a digital services company that wants to gain insights on employee data to understand its profitability rate, employee time allocation, and resource allocation for projects.

The following is the list of the data sources to integrate into our data lake:

  • Monday: A project management tool.
  • Clockify: A tool for managing hours allocated to each project.
  • Google Calendar: A calendar tool

Lambda functions are designed to retrieve pertinent information from the three designated data sources. The next step is to define the frequency of execution. In this scenario, a daily frequency seems reasonable since data from the past 24 hours covers decision-making needs.

A Step Function is scheduled to run every day at 3 UTC.

Step Function

Step Function in the Visual Editor – Step 1

The following is a visual representation of the data lake after the initial step. This diagram is poised to expand as this series progresses.

Data lake architecture diagram

Data Lake Architecture Diagram – Step 1

Conclusion and Next Article

Effective data ingestion forms the foundation of any robust data lake strategy. The ability to seamlessly integrate diverse data sources regardless of their structural variations or acquisition methods ensures accurate and informed analysis. The AWS services discussed in this article help streamline and simplify data ingestion.

The next article in this series will focus on data storage and cataloging.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA. 

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.

image 4