Amazon CloudWatch – Tài liệu toàn diện tiếng Việt

01 · Tổng quan

CloudWatch là gì?

Amazon CloudWatch là dịch vụ giám sát và quan sát (monitoring & observability) của AWS, cho phép thu thập, theo dõi và phân tích metrics, logs, events từ tất cả tài nguyên AWS, ứng dụng on-premises và hybrid. CloudWatch cung cấp khả năng nhìn toàn cảnh hệ thống, phát hiện bất thường và tự động phản hồi sự cố.

📊

Metrics

Thu thập metrics từ 70+ dịch vụ AWS tự động. Hỗ trợ custom metrics với độ phân giải đến 1 giây.

📝

Logs

Tập trung hóa logs từ EC2, Lambda, ECS, API Gateway. Tìm kiếm, lọc và phân tích real-time.

🔔

Alarms

Thiết lập ngưỡng cảnh báo, tự động trigger SNS, Auto Scaling, EC2 actions khi vượt threshold.

📈

Dashboards

Tạo dashboard tùy chỉnh, cross-account, cross-region. Hiển thị real-time trên màn hình lớn.

⚡

Events

Phản hồi thay đổi trạng thái tài nguyên AWS. Lên lịch cron jobs với EventBridge rules.

🕵️

Synthetics

Canary monitoring — kiểm tra endpoint, API, UI liên tục 24/7 trước khi user phát hiện lỗi.

Ba trụ cột Observability

Use Cases phổ biến

Infrastructure Monitoring: Giám sát EC2, RDS, ELB, Lambda — phát hiện bottleneck và resource exhaustion
Application Performance: Theo dõi latency, error rate, throughput của ứng dụng
Log Analytics: Tập trung logs, tìm kiếm lỗi, tạo metric filters để đếm error patterns
Auto Scaling: Trigger scale-out/scale-in dựa trên CloudWatch Alarms
Cost Optimization: Phát hiện tài nguyên idle hoặc under-utilized
Security Monitoring: Giám sát unauthorized API calls, unusual traffic patterns
Compliance & Audit: Lưu trữ logs dài hạn, tạo báo cáo tuân thủ
DevOps Automation: Tự động remediation khi phát hiện sự cố (self-healing)

CloudWatch vs các dịch vụ monitoring khác

Tiêu chí	CloudWatch	X-Ray	CloudTrail
Mục đích	Monitoring metrics, logs, alarms	Distributed tracing	API audit logging
Dữ liệu	Metrics + Logs + Events	Traces + Service Map	API call history
Câu hỏi trả lời	"CPU bao nhiêu %? Có lỗi gì?"	"Request chậm ở đâu?"	"Ai đã làm gì, khi nào?"
Tích hợp	70+ dịch vụ AWS tự động	SDK instrumentation	Tất cả AWS API calls
Retention	Metrics: 15 tháng, Logs: tùy chỉnh	30 ngày (mặc định)	90 ngày (Event history)

💡 Ghi nhớ cho SAA Exam

CloudWatch là dịch vụ monitoring mặc định của AWS. Mọi dịch vụ AWS đều gửi metrics về CloudWatch. CloudWatch không thu thập memory và disk metrics từ EC2 theo mặc định — cần cài CloudWatch Agent.

02 · Metrics

CloudWatch Metrics

Metrics là đơn vị dữ liệu cơ bản trong CloudWatch — mỗi metric là một chuỗi data points theo thời gian. AWS tự động gửi metrics từ hầu hết dịch vụ (EC2, RDS, Lambda, ELB...) mà không cần cấu hình.

Khái niệm cốt lõi

Khái niệm	Mô tả	Ví dụ
Namespace	Container cô lập cho metrics của một dịch vụ. Định dạng: `AWS/ServiceName`	`AWS/EC2`, `AWS/RDS`, `AWS/Lambda`
Metric Name	Tên metric cụ thể trong namespace	`CPUUtilization`, `NetworkIn`
Dimension	Cặp name/value để lọc metric (tối đa 30 dimensions/metric)	`InstanceId=i-1234`, `AutoScalingGroupName=my-asg`
Statistic	Phép tính tổng hợp trên data points	`Average`, `Sum`, `Min`, `Max`, `SampleCount`, `pNN`
Period	Khoảng thời gian tổng hợp (giây)	`60` (1 phút), `300` (5 phút), `3600` (1 giờ)
Resolution	Tần suất thu thập data point	Standard (60s), High-Resolution (1s)

EC2 Metrics mặc định vs Agent

✅ Metrics mặc định (không cần Agent)

CPUUtilization — % CPU sử dụng
NetworkIn / NetworkOut — bytes
DiskReadOps / DiskWriteOps
StatusCheckFailed — instance & system
CPUCreditBalance — burstable instances
Period mặc định: 5 phút

❌ Cần CloudWatch Agent

mem_used_percent — % RAM sử dụng
disk_used_percent — % disk sử dụng
swap_used_percent — swap usage
netstat_tcp_established — TCP connections
Processes, custom app metrics
Có thể đẩy với 1 giây resolution

⚠️ Câu hỏi thi thường gặp

"EC2 instance không hiển thị memory utilization trên CloudWatch" → Đáp án: Cần cài CloudWatch Agent. Memory và disk-level metrics không được thu thập mặc định vì CloudWatch chỉ thấy hypervisor-level metrics.

Custom Metrics

Bạn có thể đẩy metrics tùy chỉnh vào CloudWatch bằng API PutMetricData. Custom metrics nằm trong namespace do bạn đặt tên (không dùng prefix AWS/).

aws cli — put custom metric

# Đẩy custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp/Production" \
  --metric-name "ActiveUsers" \
  --value 142 \
  --unit Count \
  --dimensions Environment=Production,Service=WebApp

# High-resolution metric (storage-resolution = 1 giây)
aws cloudwatch put-metric-data \
  --namespace "MyApp/Production" \
  --metric-name "OrdersPerSecond" \
  --value 85 \
  --unit Count \
  --storage-resolution 1

Metric Resolution & Retention

Resolution	Period tối thiểu	Retention	Chi phí
Standard (60 giây)	1 phút	15 ngày (1-phút data) → 63 ngày (5-phút) → 455 ngày (1-giờ)	Miễn phí (AWS metrics)
High-Resolution (1 giây)	1 giây	3 giờ (1-giây data) → sau đó roll-up như standard	$0.30/metric/tháng

💚 Mẹo tối ưu chi phí

Chỉ dùng high-resolution metrics cho các metric thực sự cần giám sát real-time (ví dụ: trading systems, game servers). Đa số workload chỉ cần standard resolution 60 giây là đủ.

Metric Math

CloudWatch cho phép tạo metric mới bằng cách kết hợp các metrics hiện có với biểu thức toán học:

Metric Math Examples

# Error Rate (%) = Errors / Total Requests * 100
METRICS("m1", "AWS/ApplicationELB", "HTTPCode_Target_5XX_Count")
METRICS("m2", "AWS/ApplicationELB", "RequestCount")
Expression: (m1 / m2) * 100

# Anomaly Detection Band
ANOMALY_DETECTION_BAND(m1, 2)  # 2 standard deviations

# SEARCH expression — tìm metrics động
SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization"', 'Average', 300)

03 · Alarms

CloudWatch Alarms

CloudWatch Alarms theo dõi một metric và thực hiện hành động khi metric vượt ngưỡng (threshold) trong một khoảng thời gian xác định. Đây là cơ chế chính để tự động phản hồi sự cố trên AWS.

Ba trạng thái của Alarm

Alarm Actions

Action Type	Mô tả	Ví dụ
SNS Notification	Gửi thông báo qua email, SMS, HTTP endpoint, Lambda	Gửi email khi CPU > 90%
Auto Scaling	Trigger scaling policy (scale out/in)	Thêm instance khi CPU > 70%
EC2 Action	Stop, Terminate, Reboot, Recover instance	Recover instance khi StatusCheckFailed
Systems Manager	Chạy SSM Automation document	Tự động restart service khi alarm
Lambda	Invoke Lambda function (qua SNS)	Custom remediation logic

Tạo Alarm cơ bản

aws cli — create alarm

# Alarm khi CPU > 80% trong 2 evaluation periods liên tiếp (mỗi period 5 phút)
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-WebServer" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --alarm-actions arn:aws:sns:ap-southeast-1:123456789:AlertTeam \
  --dimensions Name=InstanceId,Value=i-0abcd1234efgh5678 \
  --treat-missing-data notBreaching

Composite Alarms

Composite Alarm kết hợp nhiều alarms bằng logic AND/OR/NOT, giúp giảm alarm noise và chỉ trigger khi nhiều điều kiện đồng thời xảy ra.

Composite Alarm — ví dụ

# Chỉ trigger khi CẢ HAI alarm đều ở trạng thái ALARM
aws cloudwatch put-composite-alarm \
  --alarm-name "Critical-WebApp" \
  --alarm-rule 'ALARM("HighCPU-WebServer") AND ALARM("High5xxErrors")' \
  --alarm-actions arn:aws:sns:ap-southeast-1:123456789:CriticalAlerts \
  --insufficient-data-actions []

Anomaly Detection

CloudWatch sử dụng machine learning để tự động xây dựng mô hình dự đoán giá trị "bình thường" của metric. Alarm sẽ trigger khi metric nằm ngoài anomaly detection band.

💚 Khi nào dùng Anomaly Detection?

Metrics có pattern theo mùa (seasonal) — ví dụ: traffic cao vào giờ trưa, thấp vào đêm
Không biết threshold cố định phù hợp
Muốn phát hiện bất thường tự động mà không cần tune threshold thủ công

⚠️ treat-missing-data

Tham số treat-missing-data rất quan trọng:

missing — giữ nguyên trạng thái hiện tại (mặc định)
notBreaching — coi như metric trong ngưỡng (tốt cho scale-in)
breaching — coi như metric vượt ngưỡng
ignore — bỏ qua, không đánh giá

04 · Logs

CloudWatch Logs

CloudWatch Logs cho phép thu thập, lưu trữ và phân tích log data từ EC2, Lambda, ECS, API Gateway, Route 53, CloudTrail và nhiều nguồn khác. Logs được tổ chức theo cấu trúc phân cấp: Log Group → Log Stream → Log Events.

Cấu trúc Log

Thành phần	Mô tả	Ví dụ
Log Group	Nhóm log streams có cùng cấu hình retention, permissions, metric filters	`/aws/lambda/my-function`, `/aws/ecs/my-cluster`
Log Stream	Chuỗi log events từ cùng một nguồn (instance, container, function invocation)	`i-0abc123/var/log/messages`
Log Event	Một bản ghi log đơn lẻ gồm timestamp + message	`2024-01-15T10:30:00Z ERROR Connection refused`

Retention Policy

Mặc định logs được lưu vĩnh viễn (Never Expire). Bạn nên thiết lập retention để kiểm soát chi phí:

Retention	Use Case
1 ngày	Debug tạm thời, dev/test environments
7-14 ngày	Application logs không cần lưu lâu
30-90 ngày	Production logs, troubleshooting
1-2 năm	Compliance requirements (PCI-DSS, HIPAA)
Never Expire	Audit logs, legal requirements (mặc định — cẩn thận chi phí!)

Filter Patterns

Filter patterns cho phép tìm kiếm log events theo pattern cụ thể:

Filter Pattern Examples

# Tìm tất cả dòng chứa "ERROR"
ERROR

# Tìm dòng chứa cả "ERROR" và "timeout"
ERROR timeout

# Tìm dòng KHÔNG chứa "INFO"
-INFO

# JSON filter — tìm log có statusCode = 500
{ $.statusCode = 500 }

# JSON filter — tìm log có latency > 1000ms
{ $.latency > 1000 }

# JSON filter — kết hợp điều kiện
{ $.statusCode = 5* && $.latency > 2000 }

# Space-delimited filter
[ip, user, timestamp, request, status_code = 5*, bytes]

Metric Filters

Metric Filters tạo CloudWatch metrics từ log data. Ví dụ: đếm số lần xuất hiện "ERROR" trong logs và tạo metric ErrorCount.

aws cli — tạo metric filter

# Tạo metric filter đếm số lỗi 5xx
aws logs put-metric-filter \
  --log-group-name "/aws/lambda/my-api" \
  --filter-name "5xxErrors" \
  --filter-pattern '{ $.statusCode = 5* }' \
  --metric-transformations \
    metricName=Http5xxCount,metricNamespace=MyApp,metricValue=1,defaultValue=0

Log Destinations

CloudWatch Logs có thể export/stream đến nhiều đích:

S3: Export batch (CreateExportTask) — dùng cho archive, phân tích offline
Kinesis Data Firehose: Near real-time streaming → S3, Redshift, OpenSearch
Kinesis Data Streams: Real-time processing với Lambda consumers
Lambda: Subscription filter → xử lý real-time
OpenSearch: Subscription filter → full-text search & visualization

💡 Export vs Subscription

Export to S3 là batch operation (có thể mất đến 12 giờ). Nếu cần near real-time, dùng Subscription Filter với Kinesis Data Firehose hoặc Lambda.

🔴 Cross-account Log Sharing

Để gửi logs từ account A sang account B, dùng Subscription Filter + Destination (Kinesis stream trong account B). Cần cấu hình Destination Access Policy cho phép account A gửi logs.

05 · Logs Insights

CloudWatch Logs Insights

CloudWatch Logs Insights là công cụ truy vấn tương tác cho phép tìm kiếm, phân tích và trực quan hóa log data bằng ngôn ngữ truy vấn chuyên dụng. Hỗ trợ query nhiều log groups cùng lúc và tự động phát hiện các trường trong JSON logs.

Cú pháp truy vấn

Command	Mô tả	Ví dụ
`fields`	Chọn các trường hiển thị	`fields @timestamp, @message`
`filter`	Lọc log events theo điều kiện	`filter @message like /ERROR/`
`stats`	Tính toán thống kê tổng hợp	`stats count(*) by bin(5m)`
`sort`	Sắp xếp kết quả	`sort @timestamp desc`
`limit`	Giới hạn số kết quả	`limit 50`
`parse`	Trích xuất trường từ message	`parse @message "user=* action=*" as user, action`
`display`	Chọn trường hiển thị cuối cùng	`display @timestamp, user, action`

Ví dụ truy vấn thực tế

Logs Insights — Queries

# 1. Tìm 25 log events gần nhất
fields @timestamp, @message
| sort @timestamp desc
| limit 25

# 2. Đếm số lỗi theo mỗi 5 phút
filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)

# 3. Top 10 IP có nhiều request nhất (Apache/Nginx logs)
parse @message "* - - [*] \"* * *\" * *" as ip, time, method, url, protocol, status, bytes
| stats count(*) as requestCount by ip
| sort requestCount desc
| limit 10

# 4. Phân tích latency percentiles (JSON logs)
filter ispresent(latency)
| stats avg(latency) as avgLatency,
        min(latency) as minLatency,
        max(latency) as maxLatency,
        pct(latency, 95) as p95,
        pct(latency, 99) as p99
  by bin(15m)

# 5. Tìm Lambda cold starts
filter @type = "REPORT"
| parse @message "Init Duration: * ms" as initDuration
| filter ispresent(initDuration)
| stats count(*) as coldStarts, avg(initDuration) as avgInitMs by bin(1h)

# 6. Tìm các exception phổ biến nhất
filter @message like /Exception/
| parse @message "*Exception*" as exType
| stats count(*) as cnt by exType
| sort cnt desc
| limit 10

💚 Saved Queries

Bạn có thể lưu các truy vấn hay dùng thành Saved Queries để tái sử dụng. Saved queries được lưu theo account và region, có thể chia sẻ trong team.

💡 Giới hạn cần biết

Tối đa query 50 log groups cùng lúc
Query timeout: 60 phút
Kết quả tối đa: 10,000 rows (dùng limit để kiểm soát)
Tối đa 30 concurrent queries per account

06 · Agent

CloudWatch Agent

CloudWatch Agent là phần mềm chạy trên EC2 instances hoặc on-premises servers, thu thập system-level metrics (memory, disk, swap) và logs rồi gửi về CloudWatch. Đây là cách duy nhất để lấy memory/disk metrics từ EC2.

Cài đặt Agent

Cài đặt CloudWatch Agent trên Amazon Linux 2

# Cách 1: Dùng SSM (khuyến nghị)
aws ssm send-command \
  --document-name "AWS-ConfigureAWSPackage" \
  --parameters '{"action":["Install"],"name":["AmazonCloudWatchAgent"]}' \
  --targets "Key=instanceids,Values=i-0abc123"

# Cách 2: Cài thủ công
sudo yum install -y amazon-cloudwatch-agent

# Cách 3: Download trực tiếp
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U ./amazon-cloudwatch-agent.rpm

Cấu hình Agent (config.json)

amazon-cloudwatch-agent.json

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyApp/EC2",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available_percent"],
        "metrics_collection_interval": 30
      },
      "disk": {
        "measurement": ["disk_used_percent", "disk_free"],
        "resources": ["/", "/data"],
        "metrics_collection_interval": 60
      },
      "swap": {
        "measurement": ["swap_used_percent"]
      },
      "cpu": {
        "measurement": ["cpu_usage_idle", "cpu_usage_user", "cpu_usage_system"],
        "totalcpu": true,
        "metrics_collection_interval": 30
      },
      "netstat": {
        "measurement": ["tcp_established", "tcp_time_wait"],
        "metrics_collection_interval": 60
      }
    },
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages",
            "log_group_name": "/ec2/system/messages",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          },
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{instance_id}/{file_name}",
            "retention_in_days": 14,
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}"
          }
        ]
      }
    }
  }
}

Khởi động Agent

Quản lý CloudWatch Agent

# Dùng wizard tạo config (interactive)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Khởi động agent với config file
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

# Kiểm tra trạng thái
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a status

# Dừng agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a stop

⚠️ IAM Role cần thiết

EC2 instance cần IAM Role với policy CloudWatchAgentServerPolicy để gửi metrics và logs. Nếu dùng SSM để quản lý config, thêm AmazonSSMManagedInstanceCore.

Lưu config trên SSM Parameter Store

SSM Parameter Store

# Lưu config vào SSM
aws ssm put-parameter \
  --name "AmazonCloudWatch-linux-config" \
  --type String \
  --value file://amazon-cloudwatch-agent.json

# Khởi động agent từ SSM config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c ssm:AmazonCloudWatch-linux-config

07 · Dashboards

CloudWatch Dashboards

CloudWatch Dashboards là trang hiển thị tùy chỉnh cho phép bạn tạo các bảng điều khiển trực quan với nhiều loại widget. Dashboards hỗ trợ cross-account và cross-region, lý tưởng cho NOC (Network Operations Center) screens.

Các loại Widget

Widget Type	Mô tả	Use Case
Line	Biểu đồ đường theo thời gian	CPU utilization, request count theo thời gian
Stacked Area	Biểu đồ vùng xếp chồng	Network traffic in/out, memory breakdown
Number	Hiển thị giá trị đơn lẻ (big number)	Current active users, error count
Gauge	Đồng hồ đo với ngưỡng màu	CPU %, disk usage % với red/yellow/green zones
Bar	Biểu đồ cột	So sánh metrics giữa các instances
Pie	Biểu đồ tròn	Phân bổ HTTP status codes
Text	Markdown text	Tiêu đề, ghi chú, links, hướng dẫn
Log	Hiển thị kết quả Logs Insights query	Recent errors, top slow requests
Alarm Status	Hiển thị trạng thái nhiều alarms	Health overview của tất cả services
Explorer	Widget động tự động phát hiện resources	Tự động hiển thị metrics cho tagged resources

Cross-Account & Cross-Region

Một dashboard có thể hiển thị metrics từ nhiều AWS accounts và nhiều regions trên cùng một trang. Cần thiết lập:

Cross-account: Cấu hình CloudWatch-CrossAccountSharingRole IAM role trong source accounts
Cross-region: Chọn region cho từng widget khi tạo dashboard

aws cli — tạo dashboard

# Tạo dashboard bằng JSON definition
aws cloudwatch put-dashboard \
  --dashboard-name "Production-Overview" \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0, "y": 0, "width": 12, "height": 6,
        "properties": {
          "metrics": [
            ["AWS/EC2", "CPUUtilization", "InstanceId", "i-0abc123"]
          ],
          "period": 300,
          "stat": "Average",
          "region": "ap-southeast-1",
          "title": "EC2 CPU Utilization"
        }
      },
      {
        "type": "text",
        "x": 0, "y": 6, "width": 12, "height": 2,
        "properties": {
          "markdown": "## 🟢 Production Dashboard\nCập nhật real-time"
        }
      }
    ]
  }'

💡 Chi phí Dashboard

3 dashboards miễn phí (mỗi dashboard tối đa 50 metrics)
Dashboard thêm: $3/dashboard/tháng
Automatic dashboards (tự động tạo bởi AWS) hoàn toàn miễn phí

08 · Events / EventBridge

CloudWatch Events / EventBridge

Amazon EventBridge (trước đây là CloudWatch Events) là serverless event bus cho phép kết nối ứng dụng với dữ liệu từ nhiều nguồn. EventBridge nhận events từ AWS services, SaaS applications và custom applications, rồi route đến targets dựa trên rules.

💡 CloudWatch Events → EventBridge

EventBridge là phiên bản nâng cấp của CloudWatch Events. Chúng dùng chung API và infrastructure. AWS khuyến nghị dùng EventBridge cho tất cả use cases mới. Trong kỳ thi SAA, cả hai tên đều có thể xuất hiện.

Event Pattern vs Schedule

📋 Event Pattern (Reactive)

Phản hồi khi sự kiện xảy ra
EC2 instance terminated
S3 object created
IAM policy changed
CodeBuild build failed

⏰ Schedule (Proactive)

Chạy theo lịch cố định
rate(5 minutes)
rate(1 hour)
cron(0 8 * * ? *) — 8h sáng mỗi ngày
cron(0 0 1 * ? *) — đầu mỗi tháng

Ví dụ Event Pattern

Event Pattern — EC2 Instance State Change

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated", "stopped"]
  }
}

aws cli — tạo EventBridge rule

# Rule: Gửi SNS khi EC2 instance bị terminated
aws events put-rule \
  --name "EC2TerminationAlert" \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": {"state": ["terminated"]}
  }' \
  --state ENABLED

# Thêm target (SNS topic)
aws events put-targets \
  --rule "EC2TerminationAlert" \
  --targets "Id"="1","Arn"="arn:aws:sns:ap-southeast-1:123456789:Alerts"

# Schedule rule: Chạy Lambda mỗi 5 phút
aws events put-rule \
  --name "HealthCheck-Every5Min" \
  --schedule-expression "rate(5 minutes)" \
  --state ENABLED

⚠️ EventBridge vs SNS

EventBridge: Event routing thông minh với pattern matching, content-based filtering, nhiều event sources (AWS + SaaS + custom)
SNS: Pub/sub messaging đơn giản, fan-out pattern, message filtering cơ bản
Dùng EventBridge khi cần filter phức tạp trên event content. Dùng SNS khi cần fan-out đơn giản đến nhiều subscribers.

09 · Synthetics

CloudWatch Synthetics

CloudWatch Synthetics cho phép tạo canaries — các script chạy định kỳ để giám sát endpoints, APIs và workflows. Canaries phát hiện sự cố trước khi người dùng bị ảnh hưởng bằng cách mô phỏng hành vi thực tế.

Canary là gì?

Canary là script viết bằng Node.js hoặc Python, chạy trên Lambda, thực hiện các kiểm tra như:

Heartbeat Monitoring: Kiểm tra URL có trả về 200 OK không
API Canary: Gọi REST API, kiểm tra response body, status code, latency
Broken Link Checker: Crawl website tìm broken links
Visual Monitoring: Chụp screenshot và so sánh với baseline (phát hiện UI regression)
GUI Workflow: Mô phỏng user flow (login → navigate → checkout)

Canary Blueprints

Blueprint	Mô tả	Use Case
Heartbeat	Kiểm tra URL trả về status 200	Uptime monitoring cho website
API Canary	Test REST API với request/response validation	Kiểm tra API health, response format
Broken Link Checker	Crawl trang web tìm broken links	SEO monitoring, website quality
Visual Monitoring	Chụp screenshot, so sánh pixel-by-pixel	Phát hiện UI changes, defacement
Canary Recorder	Record user actions trong Chrome → tạo script	E2E testing cho complex workflows

Canary Script — Node.js Heartbeat

const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const pageLoadBlueprint = async function () {
  const page = await synthetics.getPage();

  // Navigate đến URL
  const response = await page.goto('https://myapp.example.com', {
    waitUntil: 'domcontentloaded',
    timeout: 30000
  });

  // Kiểm tra status code
  if (response.status() !== 200) {
    throw new Error(`Expected 200, got ${response.status()}`);
  }

  // Chụp screenshot
  await synthetics.takeScreenshot('homepage', 'loaded');

  // Kiểm tra element tồn tại
  await page.waitForSelector('#main-content', { timeout: 5000 });

  log.info('Page loaded successfully');
};

exports.handler = async () => {
  return await pageLoadBlueprint();
};

aws cli — tạo canary

# Tạo canary chạy mỗi 5 phút
aws synthetics create-canary \
  --name "api-health-check" \
  --artifact-s3-location "s3://my-canary-artifacts/api-health/" \
  --execution-role-arn "arn:aws:iam::123456789:role/CanaryRole" \
  --schedule '{"Expression":"rate(5 minutes)"}' \
  --runtime-version "syn-nodejs-puppeteer-6.2" \
  --code '{"Handler":"index.handler","ZipFile":"..."}'

# Bắt đầu canary
aws synthetics start-canary --name "api-health-check"

💚 Tích hợp với Alarms

Canary tự động tạo metrics trong namespace CloudWatchSynthetics. Bạn có thể tạo CloudWatch Alarm trên metric SuccessPercent để nhận thông báo khi canary fail. Kết hợp với SNS để gửi alert qua email/Slack.

10 · Container Insights

CloudWatch Container Insights

Container Insights thu thập, tổng hợp và hiển thị metrics và logs từ containerized applications chạy trên Amazon ECS, Amazon EKS, và Kubernetes on EC2. Cung cấp visibility ở mức cluster, node, pod, task và service.

Metrics thu thập

Level	Metrics	Platform
Cluster	CPU/Memory utilization, running tasks/pods count, node count	ECS + EKS
Service	CPU/Memory per service, running task count, desired count	ECS
Task / Pod	CPU/Memory per task/pod, network rx/tx, storage read/write	ECS + EKS
Node	CPU/Memory per node, filesystem utilization, network	EKS
Container	CPU/Memory per container, restart count	ECS + EKS

Bật Container Insights

# ECS — bật khi tạo cluster
aws ecs create-cluster \
  --cluster-name my-cluster \
  --settings name=containerInsights,value=enabled

# ECS — bật cho cluster hiện có
aws ecs update-cluster-settings \
  --cluster my-cluster \
  --settings name=containerInsights,value=enabled

# EKS — cài CloudWatch Agent (DaemonSet)
# Dùng AWS Distro for OpenTelemetry (ADOT) hoặc Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonSet/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml

Performance Log Events

Container Insights gửi performance log events dưới dạng Embedded Metric Format (EMF) vào log group /aws/containerinsights/{cluster-name}/performance. Bạn có thể query bằng Logs Insights:

Logs Insights — Container Insights

# Top 10 pods sử dụng CPU nhiều nhất
stats avg(CpuUtilized) as avgCPU by PodName
| sort avgCPU desc
| limit 10

# Tìm pods bị OOMKilled
filter @message like /OOMKilled/
| fields @timestamp, PodName, @message
| sort @timestamp desc

⚠️ Chi phí Container Insights

Container Insights tạo custom metrics (tính phí $0.30/metric/tháng) và log data (tính phí ingestion + storage). Với cluster lớn (100+ pods), chi phí có thể đáng kể. Cân nhắc dùng metric filters để chỉ thu thập metrics cần thiết.

11 · Application Insights

CloudWatch Application Insights

CloudWatch Application Insights tự động phát hiện và cấu hình monitoring cho các ứng dụng chạy trên AWS resources. Sử dụng machine learning để phát hiện bất thường, tương quan events và tạo automated insights giúp giảm thời gian troubleshooting.

Ứng dụng được hỗ trợ

☕

Java

Tomcat, JBoss, Spring Boot trên EC2. Tự động phát hiện JVM metrics, GC, heap.

🔷

.NET

IIS, .NET Framework/.NET Core trên Windows EC2. CLR metrics, IIS logs.

🗄️

SQL Server

SQL Server trên EC2 hoặc RDS. Query performance, deadlocks, wait stats.

📊

SharePoint

SharePoint Server trên EC2. Farm health, search, content databases.

🐧

Linux Workloads

MySQL, PostgreSQL, SAP HANA trên EC2. Database metrics, OS metrics.

📦

Containerized

ECS, EKS workloads. Container-level metrics và application logs.

Cách hoạt động

Discover: Application Insights quét resource group (dựa trên tags hoặc CloudFormation stack) để tìm các components
Configure: Tự động cấu hình CloudWatch Agent, metrics, logs, alarms phù hợp cho từng technology stack
Monitor: Liên tục phân tích metrics và logs bằng ML models
Detect: Phát hiện anomalies, tương quan events từ nhiều nguồn
Insight: Tạo "Problem" với root cause analysis, related observations và recommended actions

aws cli — thiết lập Application Insights

# Tạo application monitoring
aws application-insights create-application \
  --resource-group-name "MyWebApp-Resources" \
  --ops-center-enabled \
  --auto-config-enabled

# Liệt kê problems được phát hiện
aws application-insights list-problems \
  --resource-group-name "MyWebApp-Resources" \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-31T23:59:59Z

# Xem chi tiết problem
aws application-insights describe-problem \
  --problem-id "p-abc123def456"

💚 Tích hợp OpsCenter

Application Insights có thể tự động tạo OpsItems trong AWS Systems Manager OpsCenter khi phát hiện problem. Điều này giúp team operations theo dõi và giải quyết sự cố trong một nơi tập trung.

Application Insights vs Container Insights

Tiêu chí	Application Insights	Container Insights
Focus	Application-level (Java, .NET, SQL)	Container infrastructure (ECS, EKS)
Setup	Tự động phát hiện & cấu hình	Bật feature flag + install agent
ML/AI	Anomaly detection + root cause analysis	Metrics collection (không có ML)
Output	Problems + Insights + Recommendations	Metrics + Logs + Dashboards

12 · Lỗi & Best Practices

Các lỗi thường gặp & Best Practices

Lỗi thường gặp

Lỗi / Vấn đề	Nguyên nhân	Giải pháp
Không thấy Memory metrics trên EC2	CloudWatch mặc định chỉ thu thập hypervisor-level metrics	Cài CloudWatch Agent và cấu hình thu thập `mem_used_percent`
Alarm ở trạng thái INSUFFICIENT_DATA	Metric chưa có data (instance mới, metric name sai, dimension sai)	Kiểm tra namespace, metric name, dimensions. Đợi ít nhất 1-2 evaluation periods
Logs không xuất hiện trong CloudWatch	IAM role thiếu permissions, Agent chưa chạy, log file path sai	Kiểm tra IAM policy (`logs:PutLogEvents`), agent status, file path trong config
Custom metric không hiển thị	Namespace sai, dimensions không khớp, timestamp quá cũ (>2 tuần)	Kiểm tra namespace (case-sensitive), dimensions phải khớp chính xác
Alarm flapping (bật/tắt liên tục)	Threshold quá gần giá trị bình thường, evaluation period quá ngắn	Tăng `datapoints-to-alarm`, dùng `evaluation-periods` dài hơn, hoặc dùng Anomaly Detection
Dashboard hiển thị "No data"	Region sai, metric chưa có data, period quá nhỏ	Kiểm tra region của widget, đảm bảo metric đang active, tăng time range
Logs Insights query chậm	Scan quá nhiều data, time range quá rộng	Thu hẹp time range, dùng `filter` sớm trong query, giảm số log groups
EventBridge rule không trigger	Event pattern sai, target permissions thiếu, rule disabled	Test event pattern với `TestEventPattern` API, kiểm tra target resource policy
Chi phí CloudWatch cao bất thường	Quá nhiều custom metrics, log retention vĩnh viễn, high-resolution metrics	Audit custom metrics, set retention policy, dùng standard resolution khi có thể

Best Practices

📊 Metrics

Dùng Detailed Monitoring (1 phút) cho production EC2 instances thay vì Basic (5 phút)
Tạo custom namespace có cấu trúc: Company/App/Environment
Sử dụng dimensions hợp lý — quá nhiều dimensions tạo quá nhiều metric streams (tốn phí)
Dùng Metric Math thay vì tạo custom metrics cho các phép tính đơn giản
Chỉ dùng high-resolution (1s) khi thực sự cần thiết

🔔 Alarms

Luôn set treat-missing-data phù hợp — notBreaching cho scale-in alarms
Dùng Composite Alarms để giảm alarm noise — chỉ alert khi nhiều điều kiện đồng thời
Thiết lập alarm cho billing (CloudWatch Billing Alarm) để kiểm soát chi phí
Dùng Anomaly Detection cho metrics có pattern seasonal thay vì static threshold
Tạo alarm cho StatusCheckFailed với EC2 Recover action

📝 Logs

Luôn set retention policy — mặc định là Never Expire (tốn phí vô hạn)
Dùng JSON structured logging để dễ query với Logs Insights
Tạo Metric Filters cho error patterns quan trọng
Export logs cũ sang S3 (Glacier) cho long-term archive với chi phí thấp
Dùng Subscription Filters cho real-time processing thay vì polling

📈 Dashboards

Tạo dashboard theo layer: Executive (high-level) → Operations (detailed) → Debug (deep-dive)
Sử dụng Alarm Status widget cho quick health overview
Thêm Text widget với runbook links và escalation procedures
Dùng Automatic Dashboards (miễn phí) làm điểm bắt đầu

💰 Cost Optimization

⚠️ Kiểm soát chi phí CloudWatch

Audit custom metrics định kỳ — mỗi metric $0.30/tháng, 1000 metrics = $300/tháng
Set log retention phù hợp — 1 GB logs/ngày × 365 ngày = ~$190/năm storage
Dùng Embedded Metric Format (EMF) thay vì PutMetricData API calls riêng lẻ
Tắt Container Insights cho dev/test clusters
Dùng CloudWatch Metric Streams + S3 cho long-term metric storage thay vì giữ high-resolution data

Chiến lược Monitoring toàn diện

13 · CLI Cheat Sheet

AWS CLI Cheat Sheet

Tổng hợp các lệnh AWS CLI thường dùng nhất cho CloudWatch. Tất cả lệnh đều dùng AWS CLI v2.

📊 Metrics

CloudWatch Metrics CLI

# Liệt kê tất cả metrics trong namespace
aws cloudwatch list-metrics --namespace "AWS/EC2"

# Liệt kê metrics theo dimension
aws cloudwatch list-metrics \
  --namespace "AWS/EC2" \
  --dimensions Name=InstanceId,Value=i-0abc123

# Lấy metric statistics
aws cloudwatch get-metric-statistics \
  --namespace "AWS/EC2" \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --start-time 2024-01-15T00:00:00Z \
  --end-time 2024-01-15T23:59:59Z \
  --period 3600 \
  --statistics Average Maximum

# Đẩy custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp" \
  --metric-name "ActiveSessions" \
  --value 42 \
  --unit Count \
  --dimensions App=WebServer,Env=Prod

# Lấy metric data (mới hơn, hỗ trợ Metric Math)
aws cloudwatch get-metric-data \
  --metric-data-queries '[
    {
      "Id": "cpu",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/EC2",
          "MetricName": "CPUUtilization",
          "Dimensions": [{"Name":"InstanceId","Value":"i-0abc123"}]
        },
        "Period": 300,
        "Stat": "Average"
      }
    }
  ]' \
  --start-time 2024-01-15T00:00:00Z \
  --end-time 2024-01-15T12:00:00Z

🔔 Alarms

CloudWatch Alarms CLI

# Liệt kê tất cả alarms
aws cloudwatch describe-alarms

# Liệt kê alarms đang ở trạng thái ALARM
aws cloudwatch describe-alarms --state-value ALARM

# Tạo alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU" \
  --namespace "AWS/EC2" \
  --metric-name CPUUtilization \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --datapoints-to-alarm 2 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-southeast-1:123456789:Alerts \
  --ok-actions arn:aws:sns:ap-southeast-1:123456789:Alerts \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --treat-missing-data notBreaching

# Test alarm (set state thủ công)
aws cloudwatch set-alarm-state \
  --alarm-name "HighCPU" \
  --state-value ALARM \
  --state-reason "Testing alarm notification"

# Xóa alarm
aws cloudwatch delete-alarms --alarm-names "HighCPU"

# Tắt alarm actions (không xóa alarm)
aws cloudwatch disable-alarm-actions --alarm-names "HighCPU"

# Bật lại alarm actions
aws cloudwatch enable-alarm-actions --alarm-names "HighCPU"

📝 Logs

CloudWatch Logs CLI

# Liệt kê log groups
aws logs describe-log-groups

# Liệt kê log streams trong group
aws logs describe-log-streams \
  --log-group-name "/aws/lambda/my-function" \
  --order-by LastEventTime \
  --descending

# Lấy log events
aws logs get-log-events \
  --log-group-name "/aws/lambda/my-function" \
  --log-stream-name "2024/01/15/[$LATEST]abc123"

# Tìm kiếm logs (filter)
aws logs filter-log-events \
  --log-group-name "/aws/lambda/my-function" \
  --filter-pattern "ERROR" \
  --start-time 1705276800000 \
  --end-time 1705363200000

# Tạo log group với retention
aws logs create-log-group --log-group-name "/myapp/production"
aws logs put-retention-policy \
  --log-group-name "/myapp/production" \
  --retention-in-days 30

# Xóa log group
aws logs delete-log-group --log-group-name "/myapp/old-logs"

# Tạo metric filter
aws logs put-metric-filter \
  --log-group-name "/myapp/production" \
  --filter-name "ErrorCount" \
  --filter-pattern "ERROR" \
  --metric-transformations \
    metricName=AppErrors,metricNamespace=MyApp,metricValue=1

# Export logs sang S3
aws logs create-export-task \
  --log-group-name "/myapp/production" \
  --from 1705276800000 \
  --to 1705363200000 \
  --destination "my-log-archive-bucket" \
  --destination-prefix "exports/myapp"

# Tail logs real-time (cần aws cli v2)
aws logs tail "/aws/lambda/my-function" --follow --since 1h

📈 Dashboards

CloudWatch Dashboards CLI

# Liệt kê dashboards
aws cloudwatch list-dashboards

# Lấy dashboard definition
aws cloudwatch get-dashboard --dashboard-name "Production"

# Tạo/cập nhật dashboard
aws cloudwatch put-dashboard \
  --dashboard-name "Production" \
  --dashboard-body file://dashboard.json

# Xóa dashboard
aws cloudwatch delete-dashboards --dashboard-names "OldDashboard"

⚡ EventBridge

EventBridge CLI

# Liệt kê rules
aws events list-rules

# Tạo rule với event pattern
aws events put-rule \
  --name "S3UploadAlert" \
  --event-pattern '{
    "source": ["aws.s3"],
    "detail-type": ["Object Created"],
    "detail": {
      "bucket": {"name": ["my-important-bucket"]}
    }
  }'

# Tạo scheduled rule
aws events put-rule \
  --name "DailyCleanup" \
  --schedule-expression "cron(0 2 * * ? *)" \
  --description "Chạy cleanup lúc 2h sáng mỗi ngày"

# Thêm target cho rule
aws events put-targets \
  --rule "S3UploadAlert" \
  --targets '[{
    "Id": "1",
    "Arn": "arn:aws:lambda:ap-southeast-1:123456789:function:ProcessUpload"
  }]'

# Test event pattern
aws events test-event-pattern \
  --event-pattern '{"source":["aws.ec2"]}' \
  --event '{"source":"aws.ec2","detail-type":"EC2 Instance State-change Notification"}'

# Gửi custom event
aws events put-events \
  --entries '[{
    "Source": "myapp.orders",
    "DetailType": "Order Created",
    "Detail": "{\"orderId\":\"12345\",\"amount\":99.99}",
    "EventBusName": "default"
  }]'

# Xóa rule (phải xóa targets trước)
aws events remove-targets --rule "OldRule" --ids "1"
aws events delete-rule --name "OldRule"

🕵️ Synthetics

CloudWatch Synthetics CLI

# Liệt kê canaries
aws synthetics describe-canaries

# Lấy kết quả chạy gần nhất
aws synthetics get-canary-runs --name "my-canary" --max-results 5

# Bắt đầu / dừng canary
aws synthetics start-canary --name "my-canary"
aws synthetics stop-canary --name "my-canary"

# Xóa canary
aws synthetics delete-canary --name "old-canary"

🔍 Logs Insights

Logs Insights CLI

# Chạy query
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time 1705276800 \
  --end-time 1705363200 \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'

# Lấy kết quả query (dùng queryId từ start-query)
aws logs get-query-results --query-id "abc123-def456-ghi789"

# Liệt kê saved queries
aws logs describe-query-definitions

💚 Mẹo CLI hữu ích

Dùng --output table để hiển thị kết quả dạng bảng dễ đọc
Dùng --query (JMESPath) để lọc output: --query 'MetricAlarms[?StateValue==`ALARM`].AlarmName'
Dùng aws logs tail --follow để xem logs real-time (giống tail -f)
Set AWS_DEFAULT_REGION environment variable để không phải thêm --region mỗi lệnh