This article constitutes the fifth and final installment in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers who are new to this series, it is advisable to read the first four articles focused on data ingestion, data storage & cataloging, data processing, and data querying & visualization

The subsequent sections below focus on data governance within the data lake and explore the processes used to ensure data quality, security, and efficiency.

Data Governance

Data governance serves as a structured framework for efficiently managing data within an organization. It defines who can access and use the data, how they can do so, and for what purposes. In essence, data governance ensures responsible, accurate, and secure data handling.

In the context of a data lake, data governance becomes particularly crucial. Unlike more organized systems, data lakes can rapidly accumulate vast and diverse datasets. In the absence of efficient governance, a data lake can swiftly devolve into what is often termed a “data swamp,” characterized by difficulties in locating, comprehending, or placing trust in the data stored within it. Effective governance in this setting involves correctly categorizing, ensuring accessibility, establishing traceability, and implementing protective measures for the data. Additionally, it encompasses maintaining data quality, understanding its origins, and aligning its use with existing policies and regulations.

Lake Formation

AWS Lake Formation, offered by Amazon Web Services (AWS), streamlines the establishment, fortification, and administration of a data lake. This service automates numerous typically labor-intensive tasks linked with data lake setup, concurrently furnishing a suite of tools dedicated to upholding data security and governance standards.

Easy Data Lake Setup

AWS Lake Formation offers a simplified means to kick-start a data lake. Instead of delving into numerous complex settings, Lake Formation streamlines the essential setup steps. Once the data sources and cataloging preferences are defined, the service takes care of all the necessary setup tasks. This user-friendly approach significantly reduces the usual learning curve associated with creating a data lake, providing a well-structured starting point for data integration and utilization.

Tag-based Access Control

One significant concern associated with data lakes pertains to the need for strict access control, ensuring that only authorized individuals gain entry to specific data. Lake Formation introduces a tag-based control system, which represents a departure from the traditional approach of managing permissions on a per-file basis. Instead, this system allows for the assignment of tags to data, such as “finance” or “personnel,” facilitating the precise delineation of who may access data associated with particular tags. This approach streamlines the management of permissions, a particularly advantageous feature when dealing with substantial data volumes.

These tags can be applied at various levels of granularity, ranging from entire databases down to individual tables or even specific columns within those tables. For instance, this granular approach enables the granting of broad access to entire tables while simultaneously restricting access to highly sensitive columns, exclusively for specific team members. The resulting level of precision in access control ensures that users interact only with data directly pertinent to their responsibilities.

Integration with IAM

AWS Identity and Access Management (IAM) is a service that helps control access to AWS resources. The integration of Lake Formation with IAM represents a natural progression, enhancing the efficiency and consistency of access management within the broader AWS ecosystem.

Through the combined use of Lake Formation and IAM, administrators gain the ability to harmonize their access policies. This convergence enables roles, users, and permissions established within IAM to be directly employed in governing access to data lake resources. For example, a designated IAM role can be given rights to interact with specific data tagged within Lake Formation, thereby ensuring uniformity across AWS services.

Data Filters

Lake Formation data filters offer the capability to refine permissions further. Instead of granting access to an entire dataset, these filters enable limitations on specific data portions. For example, users may be permitted to view all financial transactions from a particular month, but not those of other months.

Auditing and Monitoring

Security extends beyond permission definitions; it also encompasses monitoring data access, including user identities, timestamps, and methods. Lake Formation incorporates embedded auditing tools that meticulously document all data lake operations. In cases of unusual incidents or routine access verification, the complete activity log is readily available for scrutiny. This commitment to transparency supports compliance adherence with diverse regulatory requirements.

Example

The final task involves managing permissions within the data lake.

For this example, a relatively simple action will be taken. The creation of three users:

  • Admin
  • PM Europe
  • PM US
Creating three users with different permissions for the data lake implementation.

Administrative users will have access to all data, whereas the PMs will be granted access solely to non-sensitive user data pertinent to their respective regions.

The first step involves adding the S3 buckets in Lake Formation, so the service can manage permissions for the data they contain.

S3 buckets and AWS Lake Formation

Subsequently, a “sensitive” tag will be established, encompassing two distinct values: “yes” and “no.”

hldpDzDvemWZpy27tvJO19K8emPBpEb1LeGuA0m L6d4j53cvll Hq4i d9s d LU UUOIAH mF FCsJdtQ3JtyQV2JQ9mH4Yg9hFaogb5ONCwaqbZb0H6BMGiDe FxSYw5wwaMec2TYp9iKKak boo

The non-sensitive tag is assigned to the database, and all child elements will inherit the same tags automatically.

9LC0SCFql CNDAUCEHxlq9KZMJE42V6iFQzDu9w8MtPCui6 E9rkvMWfoOapFvIfvbFNo 0PYC4P3wnpZ4BTZOARmi1u1bU6X0n2Z0of k4Zj zfFkwvk8LMsMQ YG9wvUMHpmkUqh3UWfvcN4z23so

In order to restrict access to sensitive information, the “yes” value tag is assigned to pertinent data within the tables:

Navigate to Lake Formation > Table > Edit > Edit Schema > Edit LF-Tags

QVYm01WAyPZdVTzTn2ZL3FeGcjL8BbmyuTAyPWUI7G4fAvBNG9gZRseSQ3mdhjYWIP9BkDuJ9bWITHLzPdkx7Vq30BBTx6rlDYfaAi75m5VU4AITreZJB

Lastly, it is necessary to establish a data filter for European and US users, ensuring that they exclusively access data pertaining to their respective regions.

Creation of Data Filter to Restrict European User Visibility

Creation of the Data Filter to Restrict Visibility for European Users

The final step entails assigning the appropriate permissions to the users as follows:

  • Admin users are granted permissions for both the “yes” and “no” values.
  • PMs are provided permissions exclusively for the “no” value.

Additionally, the data filter is applied to the PMs.

Admin Permission Tags and Data Filter

Admin Permissions Tags and Data Filter

aVDNrtHyP2Lque2YHA1pPka69tvAXdIoJxQ1q FU6Kejcljqm1r9okHEt4dS467N8GIOslsRdEW ciVueB3lxAEeG Um8ns q1ASFe7iPTb5TkqnFIUrBt

PM Permissions Tags

ittwGI ggtYRVwNIvZAaN7 LTsUNO vl1rEXq mU wKILXJAV9RLekFOuZgssbJ1zbhgc77zfC

PM Data Filter

Following these steps, the admin account will retain the capability to query all data, whereas the PM accounts will be limited to accessing non-sensitive user data and users within their designated region.

The infrastructure setup is now complete, with AWS Lake Formation responsible for the management of resources. The updated architecture diagram is as follows:

Data Lake - Final architecture diagram - Step 5

Data Lake Architecture Diagram – Step 5

Conclusion

This article marks the end of a five-part series dedicated to the implementation of Data Lakes on AWS. Throughout this series, we have provided a comprehensive overview of the essential steps required to establish a data lake on AWS. For companies considering the deployment of a data lake solution on AWS, it is recommended to collaborate with an AWS partner like TrackIt with deep expertise in AWS Data Analytics workflows to ensure a successful implementation.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA. 

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.

image 4