diff --git a/README.md b/README.md index 934e54e..6942c57 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,8 @@ - [EC2 Instance Storage](/ec2_storage.md) - [Elastic Load Balancing & Auto Scaling Groups](/elb_asg.md) - [Amazon S3](/s3.md) + - [Databases & Analytics](/databases.md) + - [Other Compute Section](/databases.md) ### Contributors diff --git a/databases.md b/databases.md new file mode 100644 index 0000000..8b76c83 --- /dev/null +++ b/databases.md @@ -0,0 +1,264 @@ +# Databases + +## Databases Intro + +* Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits +* Sometimes, you want to store data in a database… +* You can structure the data +* You build indexes to efficiently query / search through the data +* You define relationships between your datasets +* Databases are optimized for a purpose and come with different features, shapes and constraint + +## Relational Databases + +* Looks just like Excel spreadsheets, with links between them! +* Can use the SQL language to perform queries / lookups + +## NoSQL Databases + +* NoSQL = non-SQL = non relational databases +* NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications. +* Benefits: + * Flexibility: easy to evolve data model + * Scalability: designed to scale-out by using distributed clusters + * High-performance: optimized for a specific data model + * Highly functional: types optimized for the data model +* Examples: Key-value, document, graph, in-memory, search databases + +### NoSQL data example: JSON + +* JSON = JavaScript Object Notation +* JSON is a common form of data that fits into a NoSQL model +* Data can be nested +* Fields can change over time +* Support for new types: arrays, etc… + +```json +{ + "name": "John", + "age": 30, + "cars": [ + "Ford", + "BMW", + "Fiat" + ], + "address": { + "type": "house", + "number": 23, + "street": "Dream Road" + } +} +``` + +## Databases & Shared Responsibility on AWS + +* AWS offers use to manage different databases +* Benefits include: + * Quick Provisioning, High Availability, Vertical and Horizontal Scaling + * Automated Backup & Restore, Operations, Upgrades + * Operating System Patching is handled by AWS + * Monitoring, alerting +* Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling + +## AWS RDS Overview + +* RDS stands for Relational Database Service +* It’s a managed DB service for DB use SQL as a query language. +* It allows you to create databases in the cloud that are managed by AWS + * Postgres + * MySQL + * MariaDB + * Oracle + * Microsoft SQL Server + * **Aurora (AWS Proprietary database)** + +### Advantage over using RDS versus deploying DB on EC2 + +* RDS is a managed service: + * Automated provisioning, OS patching + * Continuous backups and restore to specific timestamp (Point in Time Restore)! + * Monitoring dashboards + * Read replicas for improved read performance + * Multi AZ setup for DR (Disaster Recovery) + * Maintenance windows for upgrades + * Scaling capability (vertical and horizontal) + * Storage backed by EBS (gp2 or io1) +* BUT you can’t SSH into your instances + +## Amazon Aurora + +* Aurora is a proprietary technology from AWS (not open sourced) +* PostgreSQL and MySQL are both supported as Aurora DB +* Aurora is “AWS cloud optimized” and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS +* Aurora storage automatically grows in increments of 10GB, up to 64 TB. +* Aurora costs more than RDS (20% more) – but is more efficient +* Not in the free tier + +## RDS Deployments: Read Replicas, Multi-AZ + +Read Replicas | Multi-AZ +---- | ---- +Scale the read workload of your DB | Failover in case of AZ outage (high availability) +Can create up to 5 Read Replicas | Data is only read/written to the main database +Data is only written to the main DB | Can only have 1 other AZ as failover + +![Read Replicas | Multi-AZ](/images/read_replicas_multi_AZ.png) + +## RDS Deployments: Multi-Region + +* Multi-Region (Read Replicas) + * Disaster recovery in case of region issue + * Local performance for global reads + * Replication cost + +![Multi-Region](/images/multi_region.png) + +## Amazon ElastiCache Overview + +* The same way RDS is to get managed Relational Databases… +* ElastiCache is to get managed Redis or Memcached +* Caches are in-memory databases with high performance, low latency +* Helps reduce load off databases for read intensive workloads +* AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup + +## DynamoDB + +* Fully Managed Highly available with replication across 3 AZ +* NoSQL database - not a relational database +* Scales to massive workloads, distributed “serverless” database +* Millions of requests per seconds, trillions of row, 100s of TB of storage +* Fast and consistent in performance +* Single-digit millisecond latency – low latency retrieval +* Integrated with IAM for security, authorization and administration +* Low cost and auto scaling capabilities +* Standard & Infrequent Access (IA) Table Class + +### DynamoDB Accelerator - DAX + +* Fully Managed in-memory cache for DynamoDB +* 10x performance improvement – single- digit millisecond latency to microseconds latency – when accessing your DynamoDB tables +* Secure, highly scalable & highly available +* Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases + +### DynamoDB – Global Tables + +* Make a DynamoDB table accessible with low latency in multiple-regions +* Active-Active replication (read/write to any AWS Region) + +## Redshift Overview + +* Redshift is based on PostgreSQL, but it’s not used for OLTP (Online Transactional Processing) +* It’s OLAP – online analytical processing (analytics and data warehousing) +* Load data once every hour, not every second +* 10x better performance than other data warehouses, scale to PBs of data +* Columnar storage of data (instead of row based) +* Massively Parallel Query Execution (MPP), highly available +* Pay as you go based on the instances provisioned +* Has a SQL interface for performing the queries +* BI tools such as AWS Quicksight or Tableau integrate with it + +## Amazon EMR + +* EMR stands for “Elastic MapReduce” +* EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data +* The clusters can be made of hundreds of EC2 instances +* Also supports Apache Spark, HBase, Presto, Flink +* EMR takes care of all the provisioning and configuration +* Auto-scaling and integrated with Spot instances +* Use cases: data processing, machine learning, web indexing, big data + +## Amazon Athena + +* Serverless query service to analyze data stored in Amazon S3 +* Uses standard SQL language to query the files +* Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto) +* Pricing: $5.00 per TB of data scanned +* Use compressed or columnar data for cost-savings (less scan) +* Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc... +* **analyze data in S3 using serverless SQL, use Athena** + +## Amazon QuickSight + +* Serverless machine learning-powered business intelligence service to create interactive dashboards +* Fast, automatically scalable, embeddable, with per-session pricing +* Use cases: + * Business analytics + * Building visualizations + * Perform ad-hoc analysis + * Get business insights using data +* Integrated with RDS, Aurora, Athena, Redshift, S3… + +## DocumentDB + +* Aurora is an “AWS-implementation” of PostgreSQL / MySQL … +* DocumentDB is the same for MongoDB (which is a NoSQL database) +* MongoDB is used to store, query, and index JSON data +* Similar “deployment concepts” as Aurora +* Fully Managed, highly available with replication across 3 AZ +* Aurora storage automatically grows in increments of 10GB, up to 64 TB. +* Automatically scales to workloads with millions of requests per seconds + +## Amazon Neptune + +* Fully managed graph database +* A popular graph dataset would be a social network + * Users have friends + * Posts have comments + * Comments have likes from users + * Users share and like posts… +* Highly available across 3 AZ, with up to 15 read replicas +* Build and run applications working with highly connected datasets – optimized for these complex and hard queries +* Can store up to billions of relations and query the graph with milliseconds latency +* Highly available with replications across multiple AZs +* Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking + +## Amazon QLDB + +* QLDB stands for ”Quantum Ledger Database” +* A ledger is a book **recording financial transactions** +* Fully Managed, Serverless, High available, Replication across 3 AZ +* Used to **review history of all the changes made to your application data** over time +* **Immutable** system: no entry can be removed or modified, cryptographically verifiable +* 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL +* Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules + +## Amazon Managed Blockchain + +* Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority. +* Amazon Managed Blockchain is a managed service to: + * Join public blockchain networks + * Or create your own scalable private network +* Compatible with the frameworks Hyperledger Fabric & Ethereum + +## AWS Glue + +* Managed extract, transform, and load (ETL) service +* Useful to prepare and transform data for analytics +* Fully serverless service +* Glue Data Catalog: catalog of datasets + * can be used by Athena, Redshift, EMR + +## DMS – Database Migration Service + +* Quickly and securely migrate databases to AWS, resilient, self healing +* The source database remains available during the migration +* Supports: + * Homogeneous migrations: ex Oracle to Oracle + * Heterogeneous migrations: ex Microsoft SQL Server to Aurora + +## Databases & Analytics Summary in AWS + +* Relational Databases - OLTP: RDS & Aurora (SQL) +* Differences between Multi-AZ, Read Replicas, Multi-Region +* In-memory Database: ElastiCache +* Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB) +* Warehouse - OLAP: Redshift (SQL) +* Hadoop Cluster: EMR +* Athena: query data on Amazon S3 (serverless & SQL) +* QuickSight: dashboards on your data (serverless) +* DocumentDB: “Aurora for MongoDB” (JSON – NoSQL database) +* Amazon QLDB: Financial Transactions Ledger (immutable journal, cryptographically verifiable) +* Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains +* Glue: Managed ETL (Extract Transform Load) and Data Catalog service +* Database Migration: DMS +* Neptune: graph database \ No newline at end of file diff --git a/images/multi_region.png b/images/multi_region.png new file mode 100644 index 0000000..ebba387 Binary files /dev/null and b/images/multi_region.png differ diff --git a/images/read_replicas_multi_AZ.png b/images/read_replicas_multi_AZ.png new file mode 100644 index 0000000..1d3a867 Binary files /dev/null and b/images/read_replicas_multi_AZ.png differ diff --git a/s3.md b/s3.md index 65e1080..7e8d994 100644 --- a/s3.md +++ b/s3.md @@ -291,7 +291,7 @@ Data | 100 Mbps | 1Gbps | 10Gbps * High security: temperature controlled, GPS, 24/7 video surveillance * **Better than Snowball if you transfer more than 10 PB** -Propertie | Snowcone | Snowball Edge Storage Optimized | Snowmobile +Properties | Snowcone | Snowball Edge Storage Optimized | Snowmobile ---- | ---- | ---- | ---- Storage Capacity | 8 TB usable | 80 TB usable | < 100 PB Migration Size | Up to 24 TB, online and offline | Up to petabytes, offline | Up to exabytes, offline