What is Amazon CloudWatch? Metrics, logs, alarms, and pricing explained

What is Amazon CloudWatch?

Amazon CloudWatch is AWS's native observability service. It collects metrics, logs, and events from virtually every AWS service — automatically, with no agent required for managed services. Before you can act on a problem, you need to know it exists. CloudWatch is that foundation: the place where EC2 CPU utilization, Lambda invocation errors, RDS query latency, and hundreds of other signals land by default.

This article walks through CloudWatch's core building blocks and how they fit together.

Core concepts

Namespaces

Metrics in CloudWatch are organized into namespaces — logical containers that prevent name collisions across services. AWS services publish into their own reserved namespaces: AWS/EC2, AWS/RDS, AWS/Lambda, AWS/ELB, and so on. When you publish custom metrics from your application code, you define your own namespace (e.g., MyApp/Orders).

Metrics

A metric is a time-series of data points. Each metric is identified by three things:

Namespace — which service or application it comes from
Metric name — what is being measured (e.g., CPUUtilization, Errors, Duration)
Dimensions — key-value pairs that scope the metric to a specific resource

For example, the CPUUtilization metric in the AWS/EC2 namespace with dimension InstanceId=i-0abc1234 gives you CPU usage for a specific instance. Without the dimension, it would be an aggregate across all instances.

Dimensions

Dimensions are filters. EC2 uses InstanceId, AutoScalingGroupName, and ImageId. RDS uses DBInstanceIdentifier. Lambda uses FunctionName and Resource. You can query a metric with any combination of its supported dimensions, which is how you drill from an aggregate view down to a single resource.

Resolution

Standard metrics have a minimum granularity of one minute. High-resolution metrics, published via the PutMetricData API with a StorageResolution of 1, can go down to one second. High-resolution metrics cost more and are rarely needed outside of use cases like real-time gaming servers or financial systems. Most workloads run fine on the one-minute standard.

CloudWatch Logs

CloudWatch Logs stores log output from Lambda, ECS, EC2 (via the CloudWatch agent), API Gateway, VPC Flow Logs, and more.

Log groups and log streams

Logs are organized into log groups (typically one per application or service) and log streams within each group (typically one per instance or container). When Lambda writes logs, it creates one log group named /aws/lambda/<function-name> and one stream per invocation container.

Retention policies

By default, log groups never expire. This is a cost trap. A Lambda function that runs thousands of times a day will accumulate gigabytes of logs at $0.03/GB/month indefinitely. Set a retention policy on every log group.

aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

30 days is a reasonable default for most workloads. Compliance requirements may force longer retention — but route those logs to S3 via a subscription filter rather than keeping them in CloudWatch.

Metric filters

Metric filters parse log events and increment a CloudWatch metric when a pattern matches. This lets you create alarms based on log content — for example, count the number of times ERROR appears in your application logs and alert when it exceeds a threshold.

aws logs put-metric-filter \
  --log-group-name /aws/lambda/my-function \
  --filter-name ErrorCount \
  --filter-pattern "ERROR" \
  --metric-transformations \
    metricName=ErrorCount,metricNamespace=MyApp,metricValue=1

CloudWatch Logs Insights

Logs Insights is a query engine for log data stored in CloudWatch. It runs directly in the console or via API without any additional infrastructure. The query language is straightforward and optimized for log analysis.

Count errors in the last hour:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errorCount by bin(5m)
| sort @timestamp desc

Find Lambda executions slower than 3 seconds:

fields @timestamp, @requestId, @duration
| filter @type = "REPORT" and @duration > 3000
| sort @duration desc
| limit 20

Track API Gateway 5xx errors by endpoint:

fields @timestamp, status, resourcePath
| filter status >= 500
| stats count() as errorCount by resourcePath
| sort errorCount desc

Logs Insights charges per GB of data scanned. Add a time filter to limit the range — querying a 15-minute window instead of a full day can reduce cost by 96x.

Subscription filters

Subscription filters stream logs in near real-time to a destination: a Lambda function, a Kinesis Data Stream, or an OpenSearch Service cluster. This is the standard pattern for shipping logs to a centralized logging system. You can have up to two subscription filters per log group.

CloudWatch Alarms

Alarms watch a metric over time and trigger actions when a threshold is crossed.

Alarm types

Threshold alarms compare a metric to a static value. CPU > 80% for two consecutive five-minute periods: alarm.

Anomaly detection alarms use a machine learning model trained on historical data to establish a band of expected values. Useful for metrics with predictable patterns (e.g., request count drops to near zero on weekends) where a static threshold would produce too many false positives.

Composite alarms combine multiple alarms with AND/OR logic. Use them to reduce noise — for example, alert only when both CPU is high AND network throughput is high, which is more indicative of a real problem than either signal alone.

Alarm states

Every alarm is in one of three states:

OK — metric is within the defined threshold
ALARM — metric has breached the threshold for the required evaluation periods
INSUFFICIENT_DATA — not enough data points yet (common for new alarms or infrequently invoked resources)

Alarm actions

When an alarm transitions to ALARM, it can trigger:

An SNS notification (→ email, SMS, Lambda, PagerDuty, Slack via webhook)
An EC2 action: reboot, stop, terminate, or recover an instance
An Auto Scaling policy: scale out or in
A Systems Manager OpsItem

Example: create a CPU alarm for an EC2 instance via the CLI:

aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-i-1234" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-1234 \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts

This alarm fires if the average CPU across two consecutive five-minute periods is >= 80%, and sends a notification to the ops-alerts SNS topic.

CloudWatch Dashboards

Dashboards are shareable, customizable views of metrics and alarms. They support multiple widget types: line graphs, stacked area charts, single-value numbers, alarm status widgets, and free-text markdown.

Dashboards can span multiple AWS accounts and regions in a single view — useful for platform teams monitoring a fleet of services across accounts. You can also share a dashboard publicly (read-only) without requiring IAM credentials, which is practical for stakeholder visibility.

The first three dashboards per account are free. Additional dashboards cost $3/month each.

CloudWatch Events and EventBridge

CloudWatch Events has been rebranded as Amazon EventBridge, though the underlying infrastructure is shared. EventBridge matches events against rules and routes them to targets.

Common event sources:

EC2 state changes — instance stopped, terminated, or health check failed
ECS task failures — task stopped unexpectedly
CodePipeline state changes — pipeline failed, deployment succeeded
Scheduled rules — cron expressions for periodic automation

Example uses: trigger a Lambda to clean up snapshots nightly, restart a service when ECS reports a task failure, notify a Slack channel when a CodePipeline deployment succeeds. EventBridge is the glue layer that lets AWS services react to each other without point-to-point integration.

Pricing

CloudWatch pricing has several independent components:

Component	Free tier	Price
Metrics	10 custom metrics/month	$0.30/metric/month
Logs ingestion	5 GB/month	$0.50/GB
Logs storage	—	$0.03/GB/month
Logs Insights queries	5 GB/month	$0.005/GB scanned
Standard alarms	10/month	$0.10/alarm/month
High-resolution alarms	—	$0.30/alarm/month
Dashboards	3/month	$3/dashboard/month

The most common cost driver in practice is log ingestion, not metrics. Lambda functions invoked thousands of times per hour and API Gateway access logs can ingest significant data volumes. Mitigate this by:

Setting retention policies on all log groups (prevents unbounded storage costs)
Filtering out noisy INFO-level logs with metric filters or at the application layer
Using Logs Insights with narrow time windows to limit query cost
Considering awsfirelens (Fluent Bit) for high-volume workloads where you need to route logs selectively

CloudWatch vs third-party tools

CloudWatch is deeply integrated with AWS and requires no additional infrastructure. For teams starting out or running AWS-only workloads, it covers the basics well.

Its limitations show up at scale:

No distributed tracing — use AWS X-Ray for request traces across microservices
No application performance monitoring — CloudWatch has no concept of service maps, error rates by deployment version, or latency percentiles across services. Tools like Datadog, New Relic, or Honeycomb fill this gap.
Dashboards are functional but basic — Grafana with the CloudWatch datasource plugin gives significantly more flexibility (calculated metrics, cross-service correlation, variable interpolation) with minimal additional cost.

A common production pattern: CloudWatch as the collection layer (metrics, logs, alarms), Grafana for dashboards, and X-Ray for traces. This keeps data within AWS while using better visualization tooling.

Tying it together with infrastructure visibility

CloudWatch tells you how your services are performing. When an alarm fires on an EC2 instance or a Lambda function, the next question is almost always structural: what is connected to this thing, and could the network topology be a factor? VizCon answers that question directly — when a CloudWatch alarm fires on an EC2 instance, you can open VizCon to immediately see which VPCs, subnets, and security groups that instance sits in, making it faster to determine whether a network path or an overly permissive security group rule is part of the problem.

What is Amazon CloudWatch? Metrics, logs, alarms, and pricing explained