TrackIt

Written by Joffrey Escobar, Cloud Data Engineer

This article constitutes the fourth installment in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers who are new to this series, it is advisable to read the first three articles focused on data ingestion, data storage & cataloging, and data processing.

The subsequent sections below delve into the process of querying and visualizing data.

Querying Data using Amazon Athena

Data is currently stored in Amazon S3 and the Glue Data Catalog holds essential metadata detailing its structure and location.

The traditional approach to querying data involves the moving of data into an OLAP (Online Analytical Processing) database to facilitate query operations. However, AWS provides an alternative solution called Amazon Athena that helps perform ad-hoc queries directly on files using SQL.

When using Athena, the cost associated with each query is calculated based on the number of bytes scanned. (As of the time of writing the in us-west-2 region, the cost is $5/TB*. To prevent unnecessary costs, it is advisable to avoid using `SELECT *` queries without limits, or queries without filters.

Note: Columnar file formats such as PARQUET and AVRO can greatly reduce query costs. For example, a query such as `SELECT id, name FROM users` will exclusively scan the content within the ID and name columns as opposed to scanning the entire file (which would have been the case had the files been retained in JSON format).

Upon query execution, the outcomes can be located within the designated bucket for a span of 30 days by default.

Amazon Athena – Console Example

Visualization using Amazon QuickSight

The following section focuses on the implementation of Amazon QuickSight, a serverless data visualization and business intelligence tool offered by AWS that enables the creation of interactive and insightful dashboards, reports, and visualizations from various data sources.

Data Sets

To create visualizations, the datasets need to be imported into QuickSight. The data can originate from sources within and without the AWS ecosystem.

List of Data Sources for Amazon Quicksight

After defining a dataset, a choice between two modes needs to be made: Direct query or SPICE.

Direct query

Each time visualizations are viewed, data is fetched from the source to display the dashboard. No information is stored on QuickSight. This mode is particularly useful for real-time data updates on databases with swift query systems such as RDS or Redshift.

SPICE

SPICE (Super-fast, Parallel, In-memory Calculation Engine) is a robust in-memory storage option provided by Quicksight. SPICE duplicates data to facilitate quick-access dashboards, eliminating the need to query data for each new visualization. While each user receives 10 GB of free SPICEs, extra capacity can be procured as necessary. SPICE storage proves more advantageous when queries are resource-intensive or sluggish for the system (ex:, Athena or Spark). Furthermore, it offers cost management benefits, especially for moderately sized datasets.

Pricing

The cost structure for QuickSight is based on the user count. The service classifies users into "Author" and "Reader" categories. "Author" users possess the capability to generate visualizations and dashboards utilizing the pre-existing datasets. They incur a fee of $24 per month*. In contrast, "Reader" users are restricted to viewing dashboards formulated by authors. These users are charged $0.30* per session, with a maximum of $5 per month*.

* The prices display are for the region us-west-2 in August 2023.

Example

In the example provided below, Athena is employed to import data into Quicksight. This necessitates the formulation of SQL queries enabling the extraction of desired information.

Example of Athena Request to Fetch All Meetings Attendees

Taking into account the limited nature of the datasets being used coupled with the fact that Athena does not provide an immediate response, retaining data within SPICEs is a preferable approach.

SPICEs serve as a snapshot of data at a given moment. In order to have up-to-date data on dashboards, the SPICEs need to be periodically refreshed.

To achieve this, a cron job can be configured within the QuickSight console. This ensures our dataset is refreshed daily at a designated time. While this represents a straightforward implementation, it does carry a drawback. In instances where the data pipeline encounters issues due to data corruption, the dashboards might exhibit inconsistent information.

This supports the decision to use a Lambda function to update the SPICEs. The execution of this Lambda will take place at the conclusion of the pipeline, orchestrated by the Step function in order to avoid refreshing data in the event of an error in the pipeline.

Step Function in the Visual Editor - Step 4

Visualizing Data Stored in the Data Lake

At this point, the remaining task involves crafting the visualizations and dashboards. The following is a list of desired insights in this example:

The resources dedicated to a project
The evolving trends of allocated and estimated project over the ensuing 90 days.
The distribution of resources based on roles (Backend, Frontend, Ops, etc.)
Identifying the individuals working full-time on the project

The next step involves the creation of dashboards.

Dashboard Example - Project-related Data

Below is an updated version of the data lake architecture diagram:

Data Lake Architecture Diagram - Step 4

The last article in this will be dedicated to data governance in a data lake - how to manage data and ensure that only authorized people have access to it.

About TrackIt

https://www.youtube.com/watch?v=QBiJ156cA2I

TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.

We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.

Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.

Data Lakes on AWS (Part 4/5) - Data Querying and Visualization

Querying Data using Amazon Athena

Visualization using Amazon QuickSight

Data Sets

Direct query

SPICE

Pricing

Example

Visualizing Data Stored in the Data Lake

Next Article

About TrackIt

About Joffrey Escobar

Your AWS Engineering Team

Contact

Socials