From e21d5542e562234b47cf86005b6ef92dbdc0a634 Mon Sep 17 00:00:00 2001 From: kananinirav <30398499+kananinirav@users.noreply.github.com> Date: Mon, 22 Aug 2022 23:13:04 +0900 Subject: [PATCH] [Modified / Added] Cloud Monitoring doc --- README.md | 1 + sections/cloud_monitoring.md | 215 +++++++++++++++++++++++++++++++++++ 2 files changed, 216 insertions(+) create mode 100644 sections/cloud_monitoring.md diff --git a/README.md b/README.md index 35b8d73..d14392e 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ - [Deploying and Managing Infrastructure at Scale](sections/deploying.md) - [Global Infrastructure](sections/global_infrastructure.md) - [Cloud Integration](sections/cloud_integration.md) +- [Cloud Monitoring](sections/cloud_monitoring.md) ### Contributors diff --git a/sections/cloud_monitoring.md b/sections/cloud_monitoring.md new file mode 100644 index 0000000..9f86a84 --- /dev/null +++ b/sections/cloud_monitoring.md @@ -0,0 +1,215 @@ +# Cloud Monitoring + +- [Cloud Monitoring](#cloud-monitoring) + - [Amazon CloudWatch](#amazon-cloudwatch) + - [Important Metrics](#important-metrics) + - [Amazon CloudWatch Alarms](#amazon-cloudwatch-alarms) + - [Amazon CloudWatch Logs](#amazon-cloudwatch-logs) + - [CloudWatch Logs for EC2](#cloudwatch-logs-for-ec2) + - [Amazon CloudWatch Events](#amazon-cloudwatch-events) + - [Amazon EventBridge](#amazon-eventbridge) + - [AWS CloudTrail](#aws-cloudtrail) + - [CloudTrail Events](#cloudtrail-events) + - [CloudTrail Insights Events](#cloudtrail-insights-events) + - [CloudTrail Events Retention](#cloudtrail-events-retention) + - [AWS X-Ray](#aws-x-ray) + - [AWS X-Ray advantages](#aws-x-ray-advantages) + - [Amazon CodeGuru](#amazon-codeguru) + - [Amazon CodeGuru Reviewer](#amazon-codeguru-reviewer) + - [Amazon CodeGuru Profiler](#amazon-codeguru-profiler) + - [AWS Status - Service Health Dashboard](#aws-status---service-health-dashboard) + - [AWS Personal Health Dashboard](#aws-personal-health-dashboard) + - [Cloud Monitoring Summary](#cloud-monitoring-summary) + +## Amazon CloudWatch + +- CloudWatch provides metrics for every services in AWS +- Metric is a variable to monitor (CPUUtilization, NetworkIn, etc..) +- Metrics have timestamps +- Can create CloudWatch dashboards of metrics + +### Important Metrics + +- EC2 instances: CPU Utilization, Status Checks, Network (not RAM) + - Default metrics every 5 minutes + - Option for Detailed Monitoring ($$$): metrics every 1 minute +- EBS volumes: Disk Read/Writes +- S3 buckets: BucketSizeBytes, NumberOfObjects, AllRequests +- Billing:Total Estimated Charge (only in us-east-1) +- Service Limits: how much you’ve been using a service API +- Custom metrics: push your own metrics + +### Amazon CloudWatch Alarms + +- Alarms are used to trigger notifications for any metric +- Alarms actions… + - Auto Scaling: increase or decrease EC2 instances “desired” count + - EC2 Actions: stop, terminate, reboot or recover an EC2 instance + - SNS notifications: send a notification into an SNS topic +- Various options (sampling, %, max, min, etc…) +- Can choose the period on which to evaluate an alarm +- Example: create a billing alarm on the CloudWatch Billing metric +- Alarm States: OK. INSUFFICIENT_DATA, ALARM + +### Amazon CloudWatch Logs + +- CloudWatch Logs can collect log from: + - Elastic Beanstalk: collection of logs from application + - ECS: collection from containers + - AWS Lambda: collection from function logs + - CloudTrail based on filter + - CloudWatch log agents: on EC2 machines or on-premises servers + - Route53: Log DNS queries +- Enables real-time monitoring of logs +- Adjustable CloudWatch Logs retention + +#### CloudWatch Logs for EC2 + +- By default, no logs from your EC2 instance will go to CloudWatch +- You need to run a CloudWatch agent on EC2 to push the log files you want +- Make sure IAM permissions are correct +- The CloudWatch log agent can be setup on-premises too + +### Amazon CloudWatch Events + +- Schedule: Cron jobs (scheduled scripts) + - Schedule Every hour => Trigger script on Lambda function +- Event Pattern: Event rules to react to a service doing something + - IAM Root User Sign in Event => SNS Topic with Email Notification +- Trigger Lambda functions, send SQS/SNS messages + +### Amazon EventBridge + +- EventBridge is the next evolution of CloudWatch Events +- Default event bus: generated by AWS services (CloudWatch Events) +- Partner event bus: receive events from SaaS service or applications (Zendesk, DataDog, Segment, Auth0…) +- Custom Event buses: for your own applications +- Schema Registry: model event schema +- EventBridge has a different name to mark the new capabilities +- The CloudWatch Events name will be replaced with EventBridge + +## AWS CloudTrail + +- Provides governance, compliance and audit for your AWS Account +- CloudTrail is enabled by default! +- Get an history of events / API calls made within your AWS Account by: + - Console + - SDK + - CLI + - AWS Services +- Can put logs from CloudTrail into CloudWatch Logs or S3 +- A trail can be applied to All Regions (default) or a single Region. +- If a resource is deleted in AWS, investigate CloudTrail first! + +### CloudTrail Events + +- Management Events: + - Operations that are performed on resources in your AWS account + - Examples: + - Configuring security (IAM AttachRolePolicy) + - Configuring rules for routing data (Amazon EC2 CreateSubnet) + - Setting up logging (AWS CloudTrail CreateTrail) + - By default, trails are configured to log management events. + - Can separate Read Events (that don’t modify resources) from Write Events (that may modify resources) +- Data Events: + - By default, data events are not logged (because high volume operations) + - Amazon S3 object-level activity (ex: GetObject, DeleteObject, PutObject): can separate Read and Write Events + - AWS Lambda function execution activity (the Invoke API) + +### CloudTrail Insights Events + +- Enable CloudTrail Insights to detect unusual activity in your account: + - inaccurate resource provisioning + - hitting service limits + - Bursts of AWS IAM actions + - Gaps in periodic maintenance activity +- CloudTrail Insights analyzes normal management events to create a baseline +- And then continuously analyzes write events to detect unusual patterns + - Anomalies appear in the CloudTrail console + - Event is sent to Amazon S3 + - An EventBridge event is generated (for automation needs) + +### CloudTrail Events Retention + +- Events are stored for 90 days in CloudTrail +- To keep events beyond this period, log them to S3 and use Athena + +## AWS X-Ray + +- Debugging in Production, the good old way: + - Test locally + - Add log statements everywhere + - Re-deploy in production +- Log formats differ across applications and log analysis is hard. +- Debugging: one big monolith “easy”, distributed services “hard” +- No common views of your entire architecture + +### AWS X-Ray advantages + +- Troubleshooting performance (bottlenecks) +- Understand dependencies in a microservice architecture +- Pinpoint service issues +- Review request behavior +- Find errors and exceptions +- Are we meeting time SLA? +- Where I am throttled? +- Identify users that are impacted + +## Amazon CodeGuru + +- An ML-powered service for automated code reviews and application performance recommendations +- Provides two functionalities +- CodeGuru Reviewer: automated code reviews for static code analysis (development) +- CodeGuru Profiler: visibility/recommendations about application performance during runtime (production) + +### Amazon CodeGuru Reviewer + +- Identify critical issues, security vulnerabilities, and hard-to-find bugs +- Example: common coding best practices, resource leaks, security detection, input validation +- Uses Machine Learning and automated reasoning +- Hard-learned lessons across millions of code reviews on 1000s of open-source and Amazon repositories +- Supports Java and Python +- Integrates with GitHub, Bitbucket, and AWS CodeCommit + +### Amazon CodeGuru Profiler + +- Helps understand the runtime behavior of your application +- Example: identify if your application is consuming excessive CPU capacity on a logging routine +- Features: + - Identify and remove code inefficiencies + - Improve application performance (e.g., reduce CPU utilization) + - Decrease compute costs + - Provides heap summary (identify which objects using up memory) + - Anomaly Detection +- Support applications running on AWS or on- premise +- Minimal overhead on application + +## AWS Status - Service Health Dashboard + +- Shows all regions, all services health +- Shows historical information for each day +- Has an RSS feed you can subscribe to +- + +## AWS Personal Health Dashboard + +- AWS Personal Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that may impact you. +- While the Service Health Dashboard displays the general status of AWS services, Personal Health Dashboard gives you a personalized view into the performance and availability of the AWS services underlying your AWS resources. +- The dashboard displays relevant and timely information to help you manage events in progress and provides proactive notification to help you plan for scheduled activities. +- Global service +- Shows how AWS outages directly impact you & your AWS resources +- Alert, remediation, proactive, scheduled activities + +## Cloud Monitoring Summary + +- CloudWatch: + - Metrics: monitor the performance of AWS services and billing metrics + - Alarms: automate notification, perform EC2 action, notify to SNS based on metric + - Logs: collect log files from EC2 instances, servers, Lambda functions… + - Events (or EventBridge): react to events in AWS, or trigger a rule on a schedule +- CloudTrail: audit API calls made within your AWS account +- CloudTrail Insights: automated analysis of your CloudTrail Events +- X-Ray: trace requests made through your distributed applications +- Service Health Dashboard: status of all AWS services across all regions +- Personal Health Dashboard: AWS events that impact your infrastructure +- Amazon CodeGuru: automated code reviews and application performance recommendations