This article constitutes the fourth installment in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers who are new to this series, it is advisable to read the first three articles focused on data ingestion, data storage & cataloging, and data processing.
The subsequent sections below delve into the process of querying and visualizing data.
Querying Data using Amazon Athena
The traditional approach to querying data involves the moving of data into an OLAP (Online Analytical Processing) database to facilitate query operations. However, AWS provides an alternative solution called Amazon Athena that helps perform ad-hoc queries directly on files using SQL.
When using Athena, the cost associated with each query is calculated based on the number of bytes scanned. (As of the time of writing the in us-west-2 region, the cost is $5/TB*. To prevent unnecessary costs, it is advisable to avoid using `SELECT *` queries without limits, or queries without filters.
Note: Columnar file formats such as PARQUET and AVRO can greatly reduce query costs. For example, a query such as `SELECT id, name FROM users` will exclusively scan the content within the ID and name columns as opposed to scanning the entire file (which would have been the case had the files been retained in JSON format).
Upon query execution, the outcomes can be located within the designated bucket for a span of 30 days by default.
Amazon Athena – Console Example
Visualization using Amazon QuickSight
The following section focuses on the implementation of Amazon QuickSight, a serverless data visualization and business intelligence tool offered by AWS that enables the creation of interactive and insightful dashboards, reports, and visualizations from various data sources.
To create visualizations, the datasets need to be imported into QuickSight. The data can originate from sources within and without the AWS ecosystem.
List of Data Sources for Amazon Quicksight
After defining a dataset, a choice between two modes needs to be made: Direct query or SPICE.
Each time visualizations are viewed, data is fetched from the source to display the dashboard. No information is stored on QuickSight. This mode is particularly useful for real-time data updates on databases with swift query systems such as RDS or Redshift.
SPICE (Super-fast, Parallel, In-memory Calculation Engine) is a robust in-memory storage option provided by Quicksight. SPICE duplicates data to facilitate quick-access dashboards, eliminating the need to query data for each new visualization. While each user receives 10 GB of free SPICEs, extra capacity can be procured as necessary. SPICE storage proves more advantageous when queries are resource-intensive or sluggish for the system (ex:, Athena or Spark). Furthermore, it offers cost management benefits, especially for moderately sized datasets.
The cost structure for QuickSight is based on the user count. The service classifies users into “Author” and “Reader” categories. “Author” users possess the capability to generate visualizations and dashboards utilizing the pre-existing datasets. They incur a fee of $24 per month*. In contrast, “Reader” users are restricted to viewing dashboards formulated by authors. These users are charged $0.30* per session, with a maximum of $5 per month*.
* The prices display are for the region us-west-2 in August 2023.
In the example provided below, Athena is employed to import data into Quicksight. This necessitates the formulation of SQL queries enabling the extraction of desired information.
Example of Athena Request to Fetch All Meetings Attendees
Taking into account the limited nature of the datasets being used coupled with the fact that Athena does not provide an immediate response, retaining data within SPICEs is a preferable approach.
SPICEs serve as a snapshot of data at a given moment. In order to have up-to-date data on dashboards, the SPICEs need to be periodically refreshed.
To achieve this, a cron job can be configured within the QuickSight console. This ensures our dataset is refreshed daily at a designated time. While this represents a straightforward implementation, it does carry a drawback. In instances where the data pipeline encounters issues due to data corruption, the dashboards might exhibit inconsistent information.
This supports the decision to use a Lambda function to update the SPICEs. The execution of this Lambda will take place at the conclusion of the pipeline, orchestrated by the Step function in order to avoid refreshing data in the event of an error in the pipeline.
Step Function in the Visual Editor – Step 4
Visualizing Data Stored in the Data Lake
At this point, the remaining task involves crafting the visualizations and dashboards. The following is a list of desired insights in this example:
- The resources dedicated to a project
- The evolving trends of allocated and estimated project over the ensuing 90 days.
- The distribution of resources based on roles (Backend, Frontend, Ops, etc.)
- Identifying the individuals working full-time on the project
The next step involves the creation of dashboards.
Dashboard Example – Project-related Data
Below is an updated version of the data lake architecture diagram:
Data Lake Architecture Diagram – Step 4
The last article in this will be dedicated to data governance in a data lake – how to manage data and ensure that only authorized people have access to it.
TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.
TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.
In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.