A data lake is an increasingly popular way to store and analyze data because it allows companies to store all of their data, structured and unstructured, in a centralized repository.
The design of AWS Data Lake Solution is architecture referenced with the 5 pillars of AWS Well-Architect Framework:
• Operational Excellence
- Organization Priorities
- Business Outcomes
- Workload & state design
- Continuous improvement to flow
- Mitigate deployment risks
- Workload support Readiness
- Workload/Operation Events Monitoring & Analyzing
• Security
- Workload Operation Security
- Users/Machines Authentications & Authorizations
- Security Events Monitoring & Detections
- Networks & Computing Resources Protection
- Data Classifications & Protection
- Security Events Anticipations and Responses/Recovery
• Reliability
- Services Quotas & Constraints Management
- Network Topology Planning
- Fault Tolerance and/or High availability Workload architecture design
- Workload Resource Monitoring
- Changes Management & Implementation
- Failure Management & Data Backup/Disaster Recovery
- Reliability Testing
• Performance Efficiency
- Performance Measurement
- Compute, storage, databases, networking solution Planning
- Resources Performance Monitoring
- Performance Tradeoffs
• Cost Optimization
- Cloud Financial Management
- Usage Governance
- Usage and Costs Monitoring
- Resources Decommission
- Costs Evaluations
- Data Transfer & Pricing Model Costs Optimization
- Supply & Demand Management
- Continuous Costs Optimizations
Reference Architecture – AWS Data Lake Solution Overview
Note:
- Download the AWS toolkit (https://aws.amazon.com/architecture/icons/) and draw in power point
- online tool to draw AWS reference Architecture https://app.creately.com/
The more detailed AWS data Lake Enterprise solution can be grouped into 6 categories of AWS services:
- Data Lake Storage – S3 is the central of the AWS data lake solution, which we optimize the data lake storage costs base on S3 object life cycle definition among S3 classes (S3 Standard, S3 Intelligent Tier, S3 Standard infrequent, S3 Glacier and S3 Glacier Deep Archive). On top of that, S3 provide data protection for data in transit (SSL/TLS) and data at rest (SSE-KMS).
- Data Ingestion:
- Snowball – AWS Snowball is a data transport solution that accelerates moving terabytes to petabytes of data into and out of AWS using storage appliances designed to be secure for physical transport. [Available in Singapore Region since Mar 3, 2018]
- Direct Connect & Storage Gateway – Use Direct Connect to create a public virtual interface, bypassing ISP in the network path, for routing traffic to on-premise Storage Gateway endpoints. This establish a dedicated network connection between your on-premises gateway and AWS
-
- [Equinix SG2 & Global Switch are offering Direct Connect Service in Singapore Region, refer to the link for pricing:https://aws.amazon.com/directconnect/pricing/?nc=sn&loc=3 ]
- AWS Database Migration Service – Support database migration to cloud for both homogenous and heterogenous migration, with minimal down time to the source database.
- [AWS Schema Conversion Tools support the following conversion – https://aws.amazon.com/dms/schema-conversion-tool/?nc=sn&loc=2]
Source Database | Target Database on Amazon RDS |
---|---|
Oracle Database | Amazon Aurora, MySQL, PostgreSQL, Oracle |
Oracle Data Warehouse | Amazon Redshift |
Azure SQL | Amazon Aurora, MySQL, PostgreSQL |
Microsoft SQL Server | Amazon Aurora, Amazon Redshift, MySQL, PostgreSQL |
Teradata | Amazon Redshift |
IBM Netezza | Amazon Redshift |
Greenplum | Amazon Redshift |
HPE Vertica | Amazon Redshift |
MySQL and MariaDB | PostgreSQL |
PostgreSQL | Amazon Aurora, MySQL |
Amazon Aurora | PostgreSQL |
IBM DB2 LUW | Amazon Aurora, MySQL, PostgreSQL |
Apache Cassandra | Amazon DynamoDB |
SAP ASE | RDS for MySQL, Aurora MySQL, RDS for PostgreSQL, and Aurora PostgreSQL |
-
- AWS Kinesis & AWS MSK (Managed Streaming for Kafka) – Both are managed services for real time streaming data. Both handling the same tasks with their own pros and cons. When selecting between AWS Kinesis and AWS MSK, Kinesis is a very user-friendly and straight forward choice to ingest real time streaming data into cloud, with higher vendor lock-in to AWS. MSK is leverage on the open source Apache Kafka (up to version 2.8.0 as of date of this writing), it is suitable for those applications already using Kafka producers and consumer API library, however with Steeper learning curve for beginners.
- User Management & Security, Audit
- IAM – help to securely control access to AWS resources. Will focus more on Authentication (who sign in) and Authorisation (has permission to access) in the relevant writings.
-
- Refer to IAM Security Best Practices
- Roles, Permissions Matrix ( Prepare templates to map among the Policies, User Groups, Roles and Users)
-
- KMS (Key Management Services) – create and manage cryptographic keys and control their use across a wide range of AWS services and in the applications. It integrated with CloudTrail to logs key usages to meet regulatory compliance requirements.
- CloudWatch – collects monitoring and operational data in the form of logs, metrics, and events, and visualizes AWS resources, applications, and services that run in AWS and on-premises in a dashboard.
- CloudTrail – log, continuously monitor, and retain account activity related to actions across the AWS infrastructure.
- Data Lake User Access Interfaces
- To better illustrate the Data Lake Solution access interface, I bring in 2 graphs into below.
-
- AWS AppSync – giving front-end developers the ability to query multiple databases, microservices, and APIs with a single GraphQL endpoint.
- AWS API Gateway – create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications
- Together with AWS Lambda serverless custom functions, microservices (series of AWS Lambda functions that provide the business logic and data access layer ) are built for the various operation of data lake use cases.
- AWS Cognito /Userpool – add user sign-up, sign-in, and access control to the enterprise web and mobile apps. It supports sign-in with social identity providers, such as Apple, Facebook, Google, and Amazon, and enterprise identity providers via SAML 2.0 (e.g. Active Directory Federation Service) and OpenID Connect.
- As refer to the below diagram, the data lake console is secured for user access with Amazon Cognito and provides an administrative interface for managing data lake users through integration with Amazon Cognito user pools.
- Or use the Active Directory federated template, all administrative tasks should be done on the AD server.
- Data Catalog & Search –
- AWS Glue Data Catalog – crawl and metatag across multiple AWS data sets in S3 data lake to build data catalogue (table definition, inferred schema metadata), for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Below is an illustration on how the AWS Glue Data Catalogue works behind the scene.
- Elasticsearch – use Amazon Elasticsearch Service to easily index unstructured data and search the metadata and the content of the documents in data lake.
- DynamoDB – come with crawling and extraction features to simplify the task of moving DynamoDB NoSQL data to Amazon S3 for analysis.
- Data Analytic & Business Intelligence
- Once the data lake setup and configuration is ready for data analytic and business intelligence purposes, we can explore what are the common usage patterns available
-
-
- Athena – query datasets in your S3 data lake with simple SQL expressions, it supports complex analyses, such as large joins, window functions, and arrays.
- AWS EMR – Analyze S3 data without node provisioning, cluster setup and tuning, and Hadoop setup. Users will be able to run multiple clusters in parallel over the same data set.
- Redshift Spectrum – run fast, complex queries using SQL expressions across exabytes of S3 data without moving to Redshift.
- AWS Glue Services:
- Studio – Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows without coding.
- DataBrew – Data analysts and data scientists can visually enrich, clean, and normalize data without writing code.
- Elastic Views – application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores. (No Available in Singapore Region as of writing time)
-
Data Analytic & Business Intelligence Design Principles:
Special additional notes on S3 Data lake:
- AI and Machine Learning Services – launch AWS AI services such as Amazon Comprehend, Amazon Forecast, Amazon Personalize, and Amazon Rekognition to discover insights from your unstructured datasets, get accurate forecasts, create recommendation machines, and analyze images and videos stored in S3 data lake.
- You can also deploy Amazon Sagemaker to build, train, and deploy ML models quickly with the datasets stored in S3 data lake.
2. S3 Select – query object metadata without moving the object to another data store. Improve the query performance by up to 400% and reduce query costs by 80%
3. Amazon FSx for Lustre – provides a high-performance file system that works natively with your S3 data lake and is optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
Evernote helps you remember everything and get organized effortlessly. Download Evernote. |