[Added] rds and analytics doc.

2022-08-14 22:16:57 +09:00
parent 6f328932d6
commit 621eaa1dd7
5 changed files with 267 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -11,6 +11,8 @@
  - [EC2 Instance Storage](/ec2_storage.md)
  - [Elastic Load Balancing & Auto Scaling Groups](/elb_asg.md)
  - [Amazon S3](/s3.md)
  - [Databases & Analytics](/databases.md)
  - [Other Compute Section](/databases.md)
 ### Contributors
--- a/databases.md
+++ b/databases.md
@@ -0,0 +1,264 @@
 # Databases
 ## Databases Intro
 * Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits
 * Sometimes, you want to store data in a database…
 * You can structure the data
 * You build indexes to efficiently query / search through the data
 * You define relationships between your datasets
 * Databases are optimized for a purpose and come with different features, shapes and constraint
 ## Relational Databases
 * Looks just like Excel spreadsheets, with links between them!
 * Can use the SQL language to perform queries / lookups
 ## NoSQL Databases
 * NoSQL = non-SQL = non relational databases
 * NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications.
 * Benefits:
  * Flexibility: easy to evolve data model
  * Scalability: designed to scale-out by using distributed clusters
  * High-performance: optimized for a specific data model
  * Highly functional: types optimized for the data model
 * Examples: Key-value, document, graph, in-memory, search databases
 ### NoSQL data example: JSON
 * JSON = JavaScript Object Notation
 * JSON is a common form of data that fits into a NoSQL model
 * Data can be nested
 * Fields can change over time
 * Support for new types: arrays, etc…
 ```json
 {
  "name": "John",
  "age": 30,
  "cars": [
    "Ford",
    "BMW",
    "Fiat"
  ],
  "address": {
    "type": "house",
    "number": 23,
    "street": "Dream Road"
  }
 }
 ```
 ## Databases & Shared Responsibility on AWS
 * AWS offers use to manage different databases
 * Benefits include:
  * Quick Provisioning, High Availability, Vertical and Horizontal Scaling
  * Automated Backup & Restore, Operations, Upgrades
  * Operating System Patching is handled by AWS
  * Monitoring, alerting
 * Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling
 ## AWS RDS Overview
 * RDS stands for Relational Database Service
 * It’s a managed DB service for DB use SQL as a query language.
 * It allows you to create databases in the cloud that are managed by AWS
  * Postgres
  * MySQL
  * MariaDB
  * Oracle
  * Microsoft SQL Server
  * **Aurora (AWS Proprietary database)**
 ### Advantage over using RDS versus deploying DB on EC2
 * RDS is a managed service:
  * Automated provisioning, OS patching
  * Continuous backups and restore to specific timestamp (Point in Time Restore)!
  * Monitoring dashboards
  * Read replicas for improved read performance
  * Multi AZ setup for DR (Disaster Recovery)
  * Maintenance windows for upgrades
  * Scaling capability (vertical and horizontal)
  * Storage backed by EBS (gp2 or io1)
 * BUT you can’t SSH into your instances
 ## Amazon Aurora
 * Aurora is a proprietary technology from AWS (not open sourced)
 * PostgreSQL and MySQL are both supported as Aurora DB
 * Aurora is “AWS cloud optimized” and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
 * Aurora storage automatically grows in increments of 10GB, up to 64 TB.
 * Aurora costs more than RDS (20% more) – but is more efficient
 * Not in the free tier
 ## RDS Deployments: Read Replicas, Multi-AZ
 Read Replicas | Multi-AZ
 ---- | ----
 Scale the read workload of your DB | Failover in case of AZ outage (high availability)
 Can create up to 5 Read Replicas | Data is only read/written to the main database
 Data is only written to the main DB | Can only have 1 other AZ as failover
 ![Read Replicas | Multi-AZ](/images/read_replicas_multi_AZ.png)
 ## RDS Deployments: Multi-Region
 * Multi-Region (Read Replicas)
  * Disaster recovery in case of region issue
  * Local performance for global reads
  * Replication cost
 ![Multi-Region](/images/multi_region.png)
 ## Amazon ElastiCache Overview
 * The same way RDS is to get managed Relational Databases…
 * ElastiCache is to get managed Redis or Memcached
 * Caches are in-memory databases with high performance, low latency
 * Helps reduce load off databases for read intensive workloads
 * AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup
 ## DynamoDB
 * Fully Managed Highly available with replication across 3 AZ
 * NoSQL database - not a relational database
 * Scales to massive workloads, distributed “serverless” database
 * Millions of requests per seconds, trillions of row, 100s of TB of storage
 * Fast and consistent in performance
 * Single-digit millisecond latency – low latency retrieval
 * Integrated with IAM for security, authorization and administration
 * Low cost and auto scaling capabilities
 * Standard & Infrequent Access (IA) Table Class
 ### DynamoDB Accelerator - DAX
 * Fully Managed in-memory cache for DynamoDB
 * 10x performance improvement – single- digit millisecond latency to microseconds latency – when accessing your DynamoDB tables
 * Secure, highly scalable & highly available
 * Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases
 ### DynamoDB – Global Tables
 * Make a DynamoDB table accessible with low latency in multiple-regions
 * Active-Active replication (read/write to any AWS Region)
 ## Redshift Overview
 * Redshift is based on PostgreSQL, but it’s not used for OLTP (Online Transactional Processing)
 * It’s OLAP – online analytical processing (analytics and data warehousing)
 * Load data once every hour, not every second
 * 10x better performance than other data warehouses, scale to PBs of data
 * Columnar storage of data (instead of row based)
 * Massively Parallel Query Execution (MPP), highly available
 * Pay as you go based on the instances provisioned
 * Has a SQL interface for performing the queries
 * BI tools such as AWS Quicksight or Tableau integrate with it
 ## Amazon EMR
 * EMR stands for “Elastic MapReduce”
 * EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
 * The clusters can be made of hundreds of EC2 instances
 * Also supports Apache Spark, HBase, Presto, Flink
 * EMR takes care of all the provisioning and configuration
 * Auto-scaling and integrated with Spot instances
 * Use cases: data processing, machine learning, web indexing, big data
 ## Amazon Athena
 * Serverless query service to analyze data stored in Amazon S3
 * Uses standard SQL language to query the files
 * Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto)
 * Pricing: $5.00 per TB of data scanned
 * Use compressed or columnar data for cost-savings (less scan)
 * Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc...
 * **analyze data in S3 using serverless SQL, use Athena**
 ## Amazon QuickSight
 * Serverless machine learning-powered business intelligence service to create interactive dashboards
 * Fast, automatically scalable, embeddable, with per-session pricing
 * Use cases:
  * Business analytics
  * Building visualizations
  * Perform ad-hoc analysis
  * Get business insights using data
 * Integrated with RDS, Aurora, Athena, Redshift, S3…
 ## DocumentDB
 * Aurora is an “AWS-implementation” of PostgreSQL / MySQL …
 * DocumentDB is the same for MongoDB (which is a NoSQL database)
 * MongoDB is used to store, query, and index JSON data
 * Similar “deployment concepts” as Aurora
 * Fully Managed, highly available with replication across 3 AZ
 * Aurora storage automatically grows in increments of 10GB, up to 64 TB.
 * Automatically scales to workloads with millions of requests per seconds
 ## Amazon Neptune
 * Fully managed graph database
 * A popular graph dataset would be a social network
  * Users have friends
  * Posts have comments
  * Comments have likes from users
  * Users share and like posts…
 * Highly available across 3 AZ, with up to 15 read replicas
 * Build and run applications working with highly connected datasets – optimized for these complex and hard queries
 * Can store up to billions of relations and query the graph with milliseconds latency
 * Highly available with replications across multiple AZs
 * Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking
 ## Amazon QLDB
 * QLDB stands for ”Quantum Ledger Database”
 * A ledger is a book **recording financial transactions**
 * Fully Managed, Serverless, High available, Replication across 3 AZ
 * Used to **review history of all the changes made to your application data** over time
 * **Immutable** system: no entry can be removed or modified, cryptographically verifiable
 * 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL
 * Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules
 ## Amazon Managed Blockchain
 * Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
 * Amazon Managed Blockchain is a managed service to:
  * Join public blockchain networks
  * Or create your own scalable private network
 * Compatible with the frameworks Hyperledger Fabric & Ethereum
 ## AWS Glue
 * Managed extract, transform, and load (ETL) service
 * Useful to prepare and transform data for analytics
 * Fully serverless service
 * Glue Data Catalog: catalog of datasets
  * can be used by Athena, Redshift, EMR
 ## DMS – Database Migration Service
 * Quickly and securely migrate databases to AWS, resilient, self healing
 * The source database remains available during the migration
 * Supports:
  * Homogeneous migrations: ex Oracle to Oracle
  * Heterogeneous migrations: ex Microsoft SQL Server to Aurora
 ## Databases & Analytics Summary in AWS
 * Relational Databases - OLTP: RDS & Aurora (SQL)
 * Differences between Multi-AZ, Read Replicas, Multi-Region
 * In-memory Database: ElastiCache
 * Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB)
 * Warehouse - OLAP: Redshift (SQL)
 * Hadoop Cluster: EMR
 * Athena: query data on Amazon S3 (serverless & SQL)
 * QuickSight: dashboards on your data (serverless)
 * DocumentDB: “Aurora for MongoDB” (JSON – NoSQL database)
 * Amazon QLDB: Financial Transactions Ledger (immutable journal, cryptographically verifiable)
 * Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains
 * Glue: Managed ETL (Extract Transform Load) and Data Catalog service
 * Database Migration: DMS
 * Neptune: graph database
--- a/images/multi_region.png
+++ b/images/multi_region.png
--- a/images/read_replicas_multi_AZ.png
+++ b/images/read_replicas_multi_AZ.png
--- a/s3.md
+++ b/s3.md
@@ -291,7 +291,7 @@ Data | 100 Mbps | 1Gbps | 10Gbps
 * High security: temperature controlled, GPS, 24/7 video surveillance
 * **Better than Snowball if you transfer more than 10 PB**
-Propertie | Snowcone | Snowball Edge Storage Optimized | Snowmobile
+Properties | Snowcone | Snowball Edge Storage Optimized | Snowmobile
 ---- | ---- | ---- | ----
 Storage Capacity | 8 TB usable | 80 TB usable | < 100 PB
 Migration Size | Up to 24 TB, online and offline | Up to petabytes, offline | Up to exabytes, offline