diff --git a/README.md b/README.md index 80ff164..e587c27 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,8 @@ - Why make a global application?, Amazon Route 53 Overview, Route 53 Routing Policies, AWS CloudFront, AWS Global Accelerator, AWS Outposts, AWS WaveLength, AWS Local Zones - [Cloud Integration](sections/cloud_integration.md) - Amazon SQS - Simple Queue Service, Amazon Kinesis, Amazon SNS, Amazon MQ +- [Cloud Monitoring](./sections/cloud_monitoring.md) + - Amazon CloudWatch, AWS CloudTrail, AWS X-Ray, Amazon CodeGuru, AWS Status - Service Health Dashboard, AWS Personal Health Dashboard ## Practice Exams ( dumps ) diff --git a/sections/cloud_monitoring.md b/sections/cloud_monitoring.md new file mode 100644 index 0000000..0eea00d --- /dev/null +++ b/sections/cloud_monitoring.md @@ -0,0 +1,243 @@ +# Cloud Monitoring + +- [Cloud Monitoring](#cloud-monitoring) + - [Amazon CloudWatch](#amazon-cloudwatch) + - [Important Metrics](#important-metrics) + - [Amazon CloudWatch Alarms](#amazon-cloudwatch-alarms) + - [Amazon CloudWatch Logs](#amazon-cloudwatch-logs) + - [CloudWatch Logs for EC2](#cloudwatch-logs-for-ec2) + - [Amazon CloudWatch Events](#amazon-cloudwatch-events) + - [Amazon EventBridge](#amazon-eventbridge) + - [AWS CloudTrail](#aws-cloudtrail) + - [CloudTrail Events](#cloudtrail-events) + - [CloudTrail Insights Events](#cloudtrail-insights-events) + - [CloudTrail Events Retention](#cloudtrail-events-retention) + - [AWS X-Ray](#aws-x-ray) + - [AWS X-Ray advantages](#aws-x-ray-advantages) + - [Amazon CodeGuru](#amazon-codeguru) + - [Amazon CodeGuru Reviewer](#amazon-codeguru-reviewer) + - [Amazon CodeGuru Profiler](#amazon-codeguru-profiler) + - [AWS Status - Service Health Dashboard](#aws-status---service-health-dashboard) + - [AWS Personal Health Dashboard](#aws-personal-health-dashboard) + - [Cloud Monitoring Summary](#cloud-monitoring-summary) + +## Amazon CloudWatch + +- A monitoring and observability service for AWS resources and applications. +- Enables real-time monitoring of AWS resources, applications, and custom metrics. +- Metric is a variable to monitor (CPUUtilization, NetworkIn, etc..) +- Can create CloudWatch dashboards of metrics + +**Key Features:** + +- Collect and track metrics. +- Set alarms and take automated actions. +- Store and access logs for troubleshooting. + +### Important Metrics + +- **EC2 Instances:** CPU utilization, disk I/O, network I/O. + - Default metrics every 5 minutes + - Option for Detailed Monitoring ($$$): metrics every 1 minute +- **EBS volumes**: Disk Read/Writes +- **RDS Databases:** CPU utilization, free storage space, read/write IOPS. +- **S3 Buckets:** Number of requests, latency, and errors., AllRequests +- **Lambda Functions:** Invocation count, error count, duration. +- **Billing**:Total Estimated Charge (only in us-east-1) +- **Service Limits**: how much you’ve been using a service API +- **Custom metrics**: push your own metrics + +### Amazon CloudWatch Alarms + +- Trigger notifications or automated actions when a metric exceeds a threshold. +- Examples: + - Send an alert if EC2 CPU utilization exceeds 80%. + - Scale out EC2 instances based on demand. + - EC2 Actions: stop, terminate, reboot or recover an EC2 instance + - SNS notifications: send a notification into an SNS topic +- Various options (sampling, %, max, min, etc…) +- Example: create a billing alarm on the CloudWatch Billing metric +- Alarm States: OK. INSUFFICIENT_DATA, ALARM + +### Amazon CloudWatch Logs + +- Centralized logging for AWS services and applications. +- CloudWatch Logs can collect log from: + - Elastic Beanstalk: collection of logs from application + - ECS: collection from containers + - AWS Lambda: collection from function logs + - CloudTrail based on filter + - CloudWatch log agents: on EC2 machines or on-premises servers + - Route53: Log DNS queries +- Enables real-time monitoring of logs +- Adjustable CloudWatch Logs retention + +#### CloudWatch Logs for EC2 + +- By default, no logs from your EC2 instance will go to CloudWatch +- You need to run a CloudWatch agent on EC2 to push the log files you want +- Make sure IAM permissions are correct +- The CloudWatch log agent can be setup on-premises too + +### Amazon CloudWatch Events + +- Delivers a stream of system events describing changes in AWS resources. +- Example: Trigger a Lambda function when an EC2 instance state changes. +- Schedule: Cron jobs (scheduled scripts) + - Schedule Every hour => Trigger script on Lambda function +- Event Pattern: Event rules to react to a service doing something + - IAM Root User Sign in Event => SNS Topic with Email Notification +- Trigger Lambda functions, send SQS/SNS messages + +### Amazon EventBridge + +- EventBridge is the next evolution of CloudWatch Events +- Default event bus: generated by AWS services (CloudWatch Events) +- Partner event bus: receive events from SaaS service or applications (Zendesk, DataDog, Segment, Auth0…) +- Custom Event buses: for your own applications +- Schema Registry: model event schema +- EventBridge has a different name to mark the new capabilities +- The CloudWatch Events name will be replaced with EventBridge + +## AWS CloudTrail + +- Tracks and logs API calls made in your AWS account for auditing and governance. +- Useful for security analysis, compliance, and operational troubleshooting. +- CloudTrail is enabled by default! +- Get an history of events / API calls made within your AWS Account by: + - Console + - SDK + - CLI + - AWS Services +- Can put logs from CloudTrail into CloudWatch Logs or S3 +- A trail can be applied to All Regions (default) or a single Region. +- If a resource is deleted in AWS, investigate CloudTrail first! + +**Key Features:** + +- Logs API calls across AWS services, including CLI, SDK, and Management Console. +- Tracks who made the call, when, and from where. + +### CloudTrail Events + +- Management Events: + - Operations that are performed on resources in your AWS account + - Examples: + - Configuring security (IAM AttachRolePolicy) + - Configuring rules for routing data (Amazon EC2 CreateSubnet) + - Setting up logging (AWS CloudTrail CreateTrail) + - By default, trails are configured to log management events. + - Can separate Read Events (that don’t modify resources) from Write Events (that may modify resources) +- Data Events: + - By default, data events are not logged (because high volume operations) + - Amazon S3 object-level activity (ex: GetObject, DeleteObject, PutObject): can separate Read and Write Events + - AWS Lambda function execution activity (the Invoke API) + +### CloudTrail Insights Events + +- Enable CloudTrail Insights to detect unusual activity in your account: + - inaccurate resource provisioning + - hitting service limits + - Bursts of AWS IAM actions + - Gaps in periodic maintenance activity +- CloudTrail Insights analyzes normal management events to create a baseline +- And then continuously analyzes write events to detect unusual patterns + - Anomalies appear in the CloudTrail console + - Event is sent to Amazon S3 + - An EventBridge event is generated (for automation needs) + +### CloudTrail Events Retention + +- Events are stored for 90 days in CloudTrail +- To keep events beyond this period, log them to S3 and use Athena + +## AWS X-Ray + +- Helps analyze and debug distributed applications by providing request tracing. + - Test locally + - Add log statements everywhere + - Re-deploy in production + +**Key Features:** + +- Trace requests across AWS services and custom applications. +- Identify performance bottlenecks and errors. +- Visualize service maps to understand dependencies. + +### AWS X-Ray advantages + +- Troubleshooting performance (bottlenecks) +- Understand dependencies in a microservice architecture +- Pinpoint service issues +- Review request behavior +- Find errors and exceptions +- Are we meeting time SLA? +- Where I am throttled? +- Identify users that are impacted + +## Amazon CodeGuru + +- Code review and performance profiling service. +- Provides suggestions to improve the performance of applications. +- Identifies the most costly lines of applications. +- It is based on machine learning models long used at Amazon. +- Identifies code errors and risks with automatic code reviews. +- CodeGuru Reviewer: automated code reviews for static code analysis (development) +- CodeGuru Profiler: visibility/recommendations about application performance during runtime (production) + +### Amazon CodeGuru Reviewer + +- Uses machine learning to identify: + - Security vulnerabilities. + - Code inefficiencies. + - Best practices violations. +- Provides recommendations to improve code quality. +- Supports Java and Python +- Integrates with GitHub, Bitbucket, and AWS CodeCommit + +### Amazon CodeGuru Profiler + +- Helps understand the runtime behavior of your application +- Example: identify if your application is consuming excessive CPU capacity on a logging routine +- Features: + - Identify and remove code inefficiencies + - Improve application performance (e.g., reduce CPU utilization) + - Decrease compute costs + - Provides heap summary (identify which objects using up memory) + - Anomaly Detection +- Support applications running on AWS or on- premise +- Minimal overhead on application + +## AWS Status - Service Health Dashboard + +- Service Health Dashboard is the single place to learn about the availability and operations of AWS services. +- You can view the overall status of AWS services, and you can sign in to view personalized communications about your particular AWS account or organization. +- Shows all regions, all services health +- Shows historical information for each day +- Has an RSS feed you can subscribe to +- + +## AWS Personal Health Dashboard + +- AWS Personal Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that may impact you. +- While the Service Health Dashboard displays the general status of AWS services, Personal Health Dashboard gives you a personalized view into the performance and availability of the AWS services underlying your AWS resources. +- The dashboard displays relevant and timely information to help you manage events in progress and provides proactive notification to help you plan for scheduled activities. +- Global service +- Shows how AWS outages directly impact you & your AWS resources +- Alert, remediation, proactive, scheduled activities + +## Cloud Monitoring Summary + +| **Service** | **Key Features** | +| ------------------------- | ---------------------------------------------------------------------------------- | +| Amazon CloudWatch | Metrics, Alarms, Logs, Events, EventBridge. | +| | - Metrics: monitor the performance of AWS services and billing metrics | +| | - Alarms: automate notification, perform EC2 action, notify to SNS based on metric | +| | - Logs: collect log files from EC2 instances, servers, Lambda functions… | +| | - Events (or EventBridge): react to events in AWS, or trigger a rule on a schedule | +| AWS CloudTrail | Tracks API calls, detects unusual activity. | +| CloudTrail Insights | automated analysis of your CloudTrail Events | +| AWS X-Ray | Trace requests made through your distributed applications | +| Amazon CodeGuru | automated code reviews and application performance recommendations | +| Service Health Dashboard | status of all AWS services across all regions | +| Personal Health Dashboard | AWS events that impact your infrastructure |