TrackIt

Written by Joffrey Escobar, Cloud Data Engineer

This article constitutes the fifth and final installment in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers who are new to this series, it is advisable to read the first four articles focused on data ingestion, data storage & cataloging, data processing, and data querying & visualization.

The subsequent sections below focus on data governance within the data lake and explore the processes used to ensure data quality, security, and efficiency.

Data Governance

Data governance serves as a structured framework for efficiently managing data within an organization. It defines who can access and use the data, how they can do so, and for what purposes. In essence, data governance ensures responsible, accurate, and secure data handling.

In the context of a data lake, data governance becomes particularly crucial. Unlike more organized systems, data lakes can rapidly accumulate vast and diverse datasets. In the absence of efficient governance, a data lake can swiftly devolve into what is often termed a "data swamp," characterized by difficulties in locating, comprehending, or placing trust in the data stored within it. Effective governance in this setting involves correctly categorizing, ensuring accessibility, establishing traceability, and implementing protective measures for the data. Additionally, it encompasses maintaining data quality, understanding its origins, and aligning its use with existing policies and regulations.

Lake Formation

AWS Lake Formation, offered by Amazon Web Services (AWS), streamlines the establishment, fortification, and administration of a data lake. This service automates numerous typically labor-intensive tasks linked with data lake setup, concurrently furnishing a suite of tools dedicated to upholding data security and governance standards.

Easy Data Lake Setup

AWS Lake Formation offers a simplified means to kick-start a data lake. Instead of delving into numerous complex settings, Lake Formation streamlines the essential setup steps. Once the data sources and cataloging preferences are defined, the service takes care of all the necessary setup tasks. This user-friendly approach significantly reduces the usual learning curve associated with creating a data lake, providing a well-structured starting point for data integration and utilization.

Tag-based Access Control

One significant concern associated with data lakes pertains to the need for strict access control, ensuring that only authorized individuals gain entry to specific data. Lake Formation introduces a tag-based control system, which represents a departure from the traditional approach of managing permissions on a per-file basis. Instead, this system allows for the assignment of tags to data, such as "finance" or "personnel," facilitating the precise delineation of who may access data associated with particular tags. This approach streamlines the management of permissions, a particularly advantageous feature when dealing with substantial data volumes.

These tags can be applied at various levels of granularity, ranging from entire databases down to individual tables or even specific columns within those tables. For instance, this granular approach enables the granting of broad access to entire tables while simultaneously restricting access to highly sensitive columns, exclusively for specific team members. The resulting level of precision in access control ensures that users interact only with data directly pertinent to their responsibilities.

Integration with IAM

AWS Identity and Access Management (IAM) is a service that helps control access to AWS resources. The integration of Lake Formation with IAM represents a natural progression, enhancing the efficiency and consistency of access management within the broader AWS ecosystem.

Through the combined use of Lake Formation and IAM, administrators gain the ability to harmonize their access policies. This convergence enables roles, users, and permissions established within IAM to be directly employed in governing access to data lake resources. For example, a designated IAM role can be given rights to interact with specific data tagged within Lake Formation, thereby ensuring uniformity across AWS services.

Data Filters

Lake Formation data filters offer the capability to refine permissions further. Instead of granting access to an entire dataset, these filters enable limitations on specific data portions. For example, users may be permitted to view all financial transactions from a particular month, but not those of other months.

Auditing and Monitoring

Security extends beyond permission definitions; it also encompasses monitoring data access, including user identities, timestamps, and methods. Lake Formation incorporates embedded auditing tools that meticulously document all data lake operations. In cases of unusual incidents or routine access verification, the complete activity log is readily available for scrutiny. This commitment to transparency supports compliance adherence with diverse regulatory requirements.

Example

The final task involves managing permissions within the data lake.

For this example, a relatively simple action will be taken. The creation of three users:

Admin
PM Europe
PM US

Administrative users will have access to all data, whereas the PMs will be granted access solely to non-sensitive user data pertinent to their respective regions.

The first step involves adding the S3 buckets in Lake Formation, so the service can manage permissions for the data they contain.

Subsequently, a "sensitive" tag will be established, encompassing two distinct values: "yes" and "no."

The non-sensitive tag is assigned to the database, and all child elements will inherit the same tags automatically.

In order to restrict access to sensitive information, the "yes" value tag is assigned to pertinent data within the tables:

Navigate to Lake Formation > Table > Edit > Edit Schema > Edit LF-Tags

Lastly, it is necessary to establish a data filter for European and US users, ensuring that they exclusively access data pertaining to their respective regions.

Creation of the Data Filter to Restrict Visibility for European Users

The final step entails assigning the appropriate permissions to the users as follows:

Admin users are granted permissions for both the "yes" and "no" values.
PMs are provided permissions exclusively for the "no" value.

Additionally, the data filter is applied to the PMs.

Admin Permissions Tags and Data Filter

PM Permissions Tags

PM Data Filter

Following these steps, the admin account will retain the capability to query all data, whereas the PM accounts will be limited to accessing non-sensitive user data and users within their designated region.

The infrastructure setup is now complete, with AWS Lake Formation responsible for the management of resources. The updated architecture diagram is as follows:

Data Lake Architecture Diagram - Step 5

Conclusion

This article marks the end of a five-part series dedicated to the implementation of Data Lakes on AWS. Throughout this series, we have provided a comprehensive overview of the essential steps required to establish a data lake on AWS. For companies considering the deployment of a data lake solution on AWS, it is recommended to collaborate with an AWS partner like TrackIt with deep expertise in AWS Data Analytics workflows to ensure a successful implementation.

About TrackIt

https://www.youtube.com/watch?v=QBiJ156cA2I

TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.

We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.

Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.

Data Lakes on AWS (Part 5/5) - Data Governance