AWS Enterprise Data Lake Solution Architecture – Design Principles and Usage Patterns

A data lake is an increasingly popular way to store and analyze data because it allows companies to store all of their data, structured and unstructured, in a centralized repository.

The design of AWS Data Lake Solution is architecture referenced with the 5 pillars of AWS Well-Architect Framework:

• Operational Excellence

Organization Priorities
Business Outcomes
Workload & state design
Continuous improvement to flow
Mitigate deployment risks
Workload support Readiness
Workload/Operation Events Monitoring & Analyzing

• Security

Workload Operation Security
Users/Machines Authentications & Authorizations
Security Events Monitoring & Detections
Networks & Computing Resources Protection
Data Classifications & Protection
Security Events Anticipations and Responses/Recovery

• Reliability

Services Quotas & Constraints Management
Network Topology Planning
Fault Tolerance and/or High availability Workload architecture design
Workload Resource Monitoring
Changes Management & Implementation
Failure Management & Data Backup/Disaster Recovery
Reliability Testing

• Performance Efficiency

Performance Measurement
Compute, storage, databases, networking solution Planning
Resources Performance Monitoring
Performance Tradeoffs

• Cost Optimization

Cloud Financial Management
Usage Governance
Usage and Costs Monitoring
Resources Decommission
Costs Evaluations
Data Transfer & Pricing Model Costs Optimization
Supply & Demand Management
Continuous Costs Optimizations

Reference Architecture – AWS Data Lake Solution Overview

Note:

Download the AWS toolkit (https://aws.amazon.com/architecture/icons/) and draw in power point
online tool to draw AWS reference Architecture https://app.creately.com/

The more detailed AWS data Lake Enterprise solution can be grouped into 6 categories of AWS services:

Data Lake Storage – S3 is the central of the AWS data lake solution, which we optimize the data lake storage costs base on S3 object life cycle definition among S3 classes (S3 Standard, S3 Intelligent Tier, S3 Standard infrequent, S3 Glacier and S3 Glacier Deep Archive). On top of that, S3 provide data protection for data in transit (SSL/TLS) and data at rest (SSE-KMS).

Data Ingestion:

Snowball – AWS Snowball is a data transport solution that accelerates moving terabytes to petabytes of data into and out of AWS using storage appliances designed to be secure for physical transport. [Available in Singapore Region since Mar 3, 2018]
Direct Connect & Storage Gateway – Use Direct Connect to create a public virtual interface, bypassing ISP in the network path, for routing traffic to on-premise Storage Gateway endpoints. This establish a dedicated network connection between your on-premises gateway and AWS

- [Equinix SG2 & Global Switch are offering Direct Connect Service in Singapore Region, refer to the link for pricing:https://aws.amazon.com/directconnect/pricing/?nc=sn&loc=3 ]
AWS Database Migration Service – Support database migration to cloud for both homogenous and heterogenous migration, with minimal down time to the source database.
- [AWS Schema Conversion Tools support the following conversion – https://aws.amazon.com/dms/schema-conversion-tool/?nc=sn&loc=2]

Source Database	Target Database on Amazon RDS
Oracle Database	Amazon Aurora, MySQL, PostgreSQL, Oracle
Oracle Data Warehouse	Amazon Redshift
Azure SQL	Amazon Aurora, MySQL, PostgreSQL
Microsoft SQL Server	Amazon Aurora, Amazon Redshift, MySQL, PostgreSQL
Teradata	Amazon Redshift
IBM Netezza	Amazon Redshift
Greenplum	Amazon Redshift
HPE Vertica	Amazon Redshift
MySQL and MariaDB	PostgreSQL
PostgreSQL	Amazon Aurora, MySQL
Amazon Aurora	PostgreSQL
IBM DB2 LUW	Amazon Aurora, MySQL, PostgreSQL
Apache Cassandra	Amazon DynamoDB
SAP ASE	RDS for MySQL, Aurora MySQL, RDS for PostgreSQL, and Aurora PostgreSQL

- AWS Kinesis & AWS MSK (Managed Streaming for Kafka) – Both are managed services for real time streaming data. Both handling the same tasks with their own pros and cons. When selecting between AWS Kinesis and AWS MSK, Kinesis is a very user-friendly and straight forward choice to ingest real time streaming data into cloud, with higher vendor lock-in to AWS. MSK is leverage on the open source Apache Kafka (up to version 2.8.0 as of date of this writing), it is suitable for those applications already using Kafka producers and consumer API library, however with Steeper learning curve for beginners.

User Management & Security, Audit

IAM – help to securely control access to AWS resources. Will focus more on Authentication (who sign in) and Authorisation (has permission to access) in the relevant writings.

- Refer to IAM Security Best Practices
  - https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html
- Roles, Permissions Matrix ( Prepare templates to map among the Policies, User Groups, Roles and Users)

- KMS (Key Management Services) – create and manage cryptographic keys and control their use across a wide range of AWS services and in the applications. It integrated with CloudTrail to logs key usages to meet regulatory compliance requirements.
- CloudWatch – collects monitoring and operational data in the form of logs, metrics, and events, and visualizes AWS resources, applications, and services that run in AWS and on-premises in a dashboard.
- CloudTrail – log, continuously monitor, and retain account activity related to actions across the AWS infrastructure.

Data Lake User Access Interfaces

To better illustrate the Data Lake Solution access interface, I bring in 2 graphs into below.

- AWS AppSync – giving front-end developers the ability to query multiple databases, microservices, and APIs with a single GraphQL endpoint.
- AWS API Gateway – create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications
  - Together with AWS Lambda serverless custom functions, microservices (series of AWS Lambda functions that provide the business logic and data access layer ) are built for the various operation of data lake use cases.
- AWS Cognito /Userpool – add user sign-up, sign-in, and access control to the enterprise web and mobile apps. It supports sign-in with social identity providers, such as Apple, Facebook, Google, and Amazon, and enterprise identity providers via SAML 2.0 (e.g. Active Directory Federation Service) and OpenID Connect.
  - As refer to the below diagram, the data lake console is secured for user access with Amazon Cognito and provides an administrative interface for managing data lake users through integration with Amazon Cognito user pools.
  - Or use the Active Directory federated template, all administrative tasks should be done on the AD server.

Data Catalog & Search –

AWS Glue Data Catalog – crawl and metatag across multiple AWS data sets in S3 data lake to build data catalogue (table definition, inferred schema metadata), for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Below is an illustration on how the AWS Glue Data Catalogue works behind the scene.

Elasticsearch – use Amazon Elasticsearch Service to easily index unstructured data and search the metadata and the content of the documents in data lake.
DynamoDB – come with crawling and extraction features to simplify the task of moving DynamoDB NoSQL data to Amazon S3 for analysis.

Data Analytic & Business Intelligence

Once the data lake setup and configuration is ready for data analytic and business intelligence purposes, we can explore what are the common usage patterns available

- - Athena – query datasets in your S3 data lake with simple SQL expressions, it supports complex analyses, such as large joins, window functions, and arrays.
  - AWS EMR – Analyze S3 data without node provisioning, cluster setup and tuning, and Hadoop setup. Users will be able to run multiple clusters in parallel over the same data set.
  - Redshift Spectrum – run fast, complex queries using SQL expressions across exabytes of S3 data without moving to Redshift.
  - AWS Glue Services:
    - Studio – Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows without coding.
    - DataBrew – Data analysts and data scientists can visually enrich, clean, and normalize data without writing code.
    - Elastic Views – application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores. (No Available in Singapore Region as of writing time)

Data Analytic & Business Intelligence Design Principles:

Special additional notes on S3 Data lake:

AI and Machine Learning Services – launch AWS AI services such as Amazon Comprehend, Amazon Forecast, Amazon Personalize, and Amazon Rekognition to discover insights from your unstructured datasets, get accurate forecasts, create recommendation machines, and analyze images and videos stored in S3 data lake.

You can also deploy Amazon Sagemaker to build, train, and deploy ML models quickly with the datasets stored in S3 data lake.

2. S3 Select – query object metadata without moving the object to another data store. Improve the query performance by up to 400% and reduce query costs by 80%

3. Amazon FSx for Lustre – provides a high-performance file system that works natively with your S3 data lake and is optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).

Evernote helps you remember everything and get organized effortlessly. Download Evernote.

f53f5864-b7f2-40bc-a817-bab6e2e0ddbc

AWS Enterprise Data Lake Solution Architecture – Design Principles and Usage Patterns

You May Also Like

Building a Personal Career Coach AI Agent with Open AI Chatgpt 4.o

LLMOps Teckstack for GenAI

Data Strategy & Data Governance Framework

Learning Journey of having GAN within LLM

AWS Enterprise Data Lake Solution Architecture – Design Principles and Usage Patterns

Related Posts

Building a Personal Career Coach AI Agent with Open AI Chatgpt 4.o

LLMOps Teckstack for GenAI

You May Also Like

Building a Personal Career Coach AI Agent with Open AI Chatgpt 4.o

LLMOps Teckstack for GenAI

Data Strategy & Data Governance Framework

Learning Journey of having GAN within LLM