What is Amazon CloudWatch?
Amazon CloudWatch is AWS's native observability service. It collects metrics, logs, and events from virtually every AWS service — automatically, with no agent required for managed services. Before you can act on a problem, you need to know it exists. CloudWatch is that foundation: the place where EC2 CPU utilization, Lambda invocation errors, RDS query latency, and hundreds of other signals land by default.
This article walks through CloudWatch's core building blocks and how they fit together.
Core concepts
Namespaces
Metrics in CloudWatch are organized into namespaces — logical containers that prevent name collisions across services. AWS services publish into their own reserved namespaces: AWS/EC2, AWS/RDS, AWS/Lambda, AWS/ELB, and so on. When you publish custom metrics from your application code, you define your own namespace (e.g., MyApp/Orders).
Metrics
A metric is a time-series of data points. Each metric is identified by three things:
- Namespace — which service or application it comes from
- Metric name — what is being measured (e.g.,
CPUUtilization,Errors,Duration) - Dimensions — key-value pairs that scope the metric to a specific resource
For example, the CPUUtilization metric in the AWS/EC2 namespace with dimension InstanceId=i-0abc1234 gives you CPU usage for a specific instance. Without the dimension, it would be an aggregate across all instances.
Dimensions
Dimensions are filters. EC2 uses InstanceId, AutoScalingGroupName, and ImageId. RDS uses DBInstanceIdentifier. Lambda uses FunctionName and Resource. You can query a metric with any combination of its supported dimensions, which is how you drill from an aggregate view down to a single resource.
Resolution
Standard metrics have a minimum granularity of one minute. High-resolution metrics, published via the PutMetricData API with a StorageResolution of 1, can go down to one second. High-resolution metrics cost more and are rarely needed outside of use cases like real-time gaming servers or financial systems. Most workloads run fine on the one-minute standard.
CloudWatch Logs
CloudWatch Logs stores log output from Lambda, ECS, EC2 (via the CloudWatch agent), API Gateway, VPC Flow Logs, and more.
Log groups and log streams
Logs are organized into log groups (typically one per application or service) and log streams within each group (typically one per instance or container). When Lambda writes logs, it creates one log group named /aws/lambda/<function-name> and one stream per invocation container.
Retention policies
By default, log groups never expire. This is a cost trap. A Lambda function that runs thousands of times a day will accumulate gigabytes of logs at $0.03/GB/month indefinitely. Set a retention policy on every log group.
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30
30 days is a reasonable default for most workloads. Compliance requirements may force longer retention — but route those logs to S3 via a subscription filter rather than keeping them in CloudWatch.
Metric filters
Metric filters parse log events and increment a CloudWatch metric when a pattern matches. This lets you create alarms based on log content — for example, count the number of times ERROR appears in your application logs and alert when it exceeds a threshold.
aws logs put-metric-filter \
--log-group-name /aws/lambda/my-function \
--filter-name ErrorCount \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1
CloudWatch Logs Insights
Logs Insights is a query engine for log data stored in CloudWatch. It runs directly in the console or via API without any additional infrastructure. The query language is straightforward and optimized for log analysis.
Count errors in the last hour:
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errorCount by bin(5m)
| sort @timestamp desc
Find Lambda executions slower than 3 seconds:
fields @timestamp, @requestId, @duration
| filter @type = "REPORT" and @duration > 3000
| sort @duration desc
| limit 20
Track API Gateway 5xx errors by endpoint:
fields @timestamp, status, resourcePath
| filter status >= 500
| stats count() as errorCount by resourcePath
| sort errorCount desc
Logs Insights charges per GB of data scanned. Add a time filter to limit the range — querying a 15-minute window instead of a full day can reduce cost by 96x.
Subscription filters
Subscription filters stream logs in near real-time to a destination: a Lambda function, a Kinesis Data Stream, or an OpenSearch Service cluster. This is the standard pattern for shipping logs to a centralized logging system. You can have up to two subscription filters per log group.
CloudWatch Alarms
Alarms watch a metric over time and trigger actions when a threshold is crossed.
Alarm types
Threshold alarms compare a metric to a static value. CPU > 80% for two consecutive five-minute periods: alarm.
Anomaly detection alarms use a machine learning model trained on historical data to establish a band of expected values. Useful for metrics with predictable patterns (e.g., request count drops to near zero on weekends) where a static threshold would produce too many false positives.
Composite alarms combine multiple alarms with AND/OR logic. Use them to reduce noise — for example, alert only when both CPU is high AND network throughput is high, which is more indicative of a real problem than either signal alone.
Alarm states
Every alarm is in one of three states:
OK— metric is within the defined thresholdALARM— metric has breached the threshold for the required evaluation periodsINSUFFICIENT_DATA— not enough data points yet (common for new alarms or infrequently invoked resources)
Alarm actions
When an alarm transitions to ALARM, it can trigger:
- An SNS notification (→ email, SMS, Lambda, PagerDuty, Slack via webhook)
- An EC2 action: reboot, stop, terminate, or recover an instance
- An Auto Scaling policy: scale out or in
- A Systems Manager OpsItem
Example: create a CPU alarm for an EC2 instance via the CLI:
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-i-1234" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-1234 \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanOrEqualToThreshold \
--statistic Average \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts
This alarm fires if the average CPU across two consecutive five-minute periods is >= 80%, and sends a notification to the ops-alerts SNS topic.
CloudWatch Dashboards
Dashboards are shareable, customizable views of metrics and alarms. They support multiple widget types: line graphs, stacked area charts, single-value numbers, alarm status widgets, and free-text markdown.
Dashboards can span multiple AWS accounts and regions in a single view — useful for platform teams monitoring a fleet of services across accounts. You can also share a dashboard publicly (read-only) without requiring IAM credentials, which is practical for stakeholder visibility.
The first three dashboards per account are free. Additional dashboards cost $3/month each.
CloudWatch Events and EventBridge
CloudWatch Events has been rebranded as Amazon EventBridge, though the underlying infrastructure is shared. EventBridge matches events against rules and routes them to targets.
Common event sources:
- EC2 state changes — instance stopped, terminated, or health check failed
- ECS task failures — task stopped unexpectedly
- CodePipeline state changes — pipeline failed, deployment succeeded
- Scheduled rules — cron expressions for periodic automation
Example uses: trigger a Lambda to clean up snapshots nightly, restart a service when ECS reports a task failure, notify a Slack channel when a CodePipeline deployment succeeds. EventBridge is the glue layer that lets AWS services react to each other without point-to-point integration.
Pricing
CloudWatch pricing has several independent components:
| Component | Free tier | Price |
|---|---|---|
| Metrics | 10 custom metrics/month | $0.30/metric/month |
| Logs ingestion | 5 GB/month | $0.50/GB |
| Logs storage | — | $0.03/GB/month |
| Logs Insights queries | 5 GB/month | $0.005/GB scanned |
| Standard alarms | 10/month | $0.10/alarm/month |
| High-resolution alarms | — | $0.30/alarm/month |
| Dashboards | 3/month | $3/dashboard/month |
The most common cost driver in practice is log ingestion, not metrics. Lambda functions invoked thousands of times per hour and API Gateway access logs can ingest significant data volumes. Mitigate this by:
- Setting retention policies on all log groups (prevents unbounded storage costs)
- Filtering out noisy
INFO-level logs with metric filters or at the application layer - Using Logs Insights with narrow time windows to limit query cost
- Considering
awsfirelens(Fluent Bit) for high-volume workloads where you need to route logs selectively
CloudWatch vs third-party tools
CloudWatch is deeply integrated with AWS and requires no additional infrastructure. For teams starting out or running AWS-only workloads, it covers the basics well.
Its limitations show up at scale:
- No distributed tracing — use AWS X-Ray for request traces across microservices
- No application performance monitoring — CloudWatch has no concept of service maps, error rates by deployment version, or latency percentiles across services. Tools like Datadog, New Relic, or Honeycomb fill this gap.
- Dashboards are functional but basic — Grafana with the CloudWatch datasource plugin gives significantly more flexibility (calculated metrics, cross-service correlation, variable interpolation) with minimal additional cost.
A common production pattern: CloudWatch as the collection layer (metrics, logs, alarms), Grafana for dashboards, and X-Ray for traces. This keeps data within AWS while using better visualization tooling.
Tying it together with infrastructure visibility
CloudWatch tells you how your services are performing. When an alarm fires on an EC2 instance or a Lambda function, the next question is almost always structural: what is connected to this thing, and could the network topology be a factor? VizCon answers that question directly — when a CloudWatch alarm fires on an EC2 instance, you can open VizCon to immediately see which VPCs, subnets, and security groups that instance sits in, making it faster to determine whether a network path or an overly permissive security group rule is part of the problem.



