How to Build a Data Lake on AWS?

3 Feb, 2020

Cloud AWS, AWS S3, Cloud, data lake, Product Review

Many customers need a solution regarding data storage and analytics which could be flexible and more robust than the traditional data management systems. Amazon Web Services (AWS) offers a data lake which is a new way to store and analyze huge data securely at a low-cost and supports easy search and analysis competencies on a variety of data types.

A Data Lake on AWS is a central storage repository that offers you to store both structured and unstructured data at any scale. By using Data Lake in AWS architecture, data can be stored as-is (without having to first structure the data) for different analytics for big data and machine learning, etc. can be performed for better decision making. For this purpose, AWS offers an automated, cost-effective, and highly available Data Lake architecture with a user-friendly console for real-time searching and requesting datasets.

The core AWS services are automatically configured for easy tagging, searching, sharing, and governing specific data subsets across an organization or with external users. By using data lake architecture, new users can catalog and upload new datasets of any size with searchable metadata as well as for existing datasets in Amazon S3 with minimal effort. These datasets will be integrated with AWS Glue and Amazon Athena for further transforming and analyzing that data.

Figure 1: Data Lake on AWS

Why Data Lake Needs?

Data lake helps in identifying and acting upon opportunities for faster growth of the business while attracting and retaining customers with better decision making. A data lake can have:

Unlimited Data Management: A data lake helps in storing an unlimited amount of data with its original formats and fidelity and acts as an online system where data is always available for queries.
Cost Reduction and Acceleration in Data Preparation: High performance required data processing workloads can be easily migrated to Data Lake with low cost and parallels in a much faster than before.
Analytic Ability: A Data Lake offers analytic agility where it provides a self-service environment to analysts and data scientists to rapidly integrate, explore, and analyze any data they require while structure can be applied incrementally at the right time rather than waiting for necessarily upfront.
Not limited to standard SQL: A Data Lake proposes opportunities for machine learning, full-text search, scripting, and connectivity without limiting to standard SQL to data discovery, existing business intelligence, and analytics platforms which makes it a cost-effective solution to run data-oriented experiments and analysis over an unlimited amount of data.

Components of Data Lake

While using a Data Lake, it offers three main operations i-e data ingestion, building catalog, and processing. There are multiple operations offered by AWS within these operations and following are few:

Ingestion

Amazon Kinesis is one of the ingestion options offered by AWS which offers easy data streaming. It helps in building custom applications that analyze or process streaming data by using standard SQL queries. By using Kinesis, users create availability zones that act as data centers and data is taken from various sources such as mobile applications and websites for these zones and pushed to archives for sliding window analysis with DynamoDB.

Building Catalog

The catalog is responsible for the information regarding key aspects of stored data such as its format, classification, and the tags which are used to search metadata in data lakes which can be stored in a storage location.

Figure 2: Building a Catalogue

You can build a catalog in the following steps:

An object will be created by putting an object into the Amazon S3 bucket in AWS lambda and an event will be triggered when an object is stored in a bucket. An event can be a piece of code that invokes in any infrastructure.
This invoked code is used to extract metadata of the stored object and then, the extracted object will be stored in a NoSQL database such as Amazon DynamoDB.
AWS Lambda will again pick up this data and push it into an Elasticsearch which will be used by different teams to query data to skim through the catalog.

In Amazon, metadata is accessed by users or teams via APIs which are built on top of metadata and Amazon API gateway service is used to build a website that helps in searching through the data lake. And, these APIs are used to connect components such as AWS Lambda, EC2, or public endpoint on the backend where the catalog is built.

Processing

Processing of unlimited data has different uses and different techniques are used for different entities. AWS has a variety of data processing services such as Amazon EMR, Athena, and Redshift, etc. Amazon Athena is widely used for processing data in data lakes and offers the following benefits:

Serverless (no ETL)
Pay per query ( you will have to pay only for the data you scan)
Built on presto (runs stands SQL)
Fast and interactive performance for large datasets
High available, and
Secure

Benefits of Data Lake Building on AWS

Following are some advantages which make AWS a supreme choice to build a Data Lake on it:

Flexibility: Supports and stores a large volume of data at a scale using Amazon S3 regardless of format or volume.
Most Comprehensive Platform: A most comprehensive platform to build data lakes including security, agility, flexibility, and lower TCO.
Security and Compliance: Supports easy encryption of all Data Lake data and can achieve regulatory compliance standards such as PCI DSS, HIPAA.

Building a Data Lake on AWS

To build a Data Lake on AWS, an AWS CloudFormation template will be used to configure the solution including AWS services such as Amazon S3 for unlimited data storage, Amazon Cognito for authentication, Amazon Elasticsearch for strong searching capabilities, AWS Lambda for offering microservices, AWS Glue for data transmission, and Amazon Athena for data analytics. The following figure represents the complete architecture of building a Data Lake on AWS using AWS services.

Figure 3: Data Lake Architecture on AWS

AWS Data Lake architecture leverages the durability, security, and scalability while storing unlimited data to Amazon S3 to manage an insistent catalog of datasets of businesses with Amazon DynamoDB to manage relevant metadata. When data is cataloged, its attributes and tags can be used to search and browse datasets from the solution console.

To build a data lake on AWS, you can follow the deployment guide.

Author: Nisar Ahmad

Systems Engineer, vExpert 2017-19, owner of My Virtual Journey, with experience in managing a Datacenter environment using VMware and Microsoft Technologies. This blog mainly covers virtualization and cloud technologies but also covers some other technologies such as Cyber Security, Quantum Computing etc.