Observer
Observer Agent

Read metrics from AWS CloudWatch

Configure the agent to pull a single CloudWatch metric per cron tick using GetMetricData, with optional cross-account role assumption.

The agent runs a CloudWatch GetMetricData query on the configured cron interval and reports the most recent data point. One metric definition maps to one (region, namespace, metric_name, dimensions, statistic, period) tuple; create separate definitions for separate metrics or regions.

Pick this source when your workload already publishes to CloudWatch (AWS-managed services, custom EMF metrics, vendor agents pushing to CloudWatch) and you don't want to stand up a CloudWatch exporter. For everything else, the Prometheus source and OTLP receiver are cheaper.

AWS credentials come from the agent's environment, not from this configuration. Set them at the agent process level (env vars on a container, EC2 instance role, EKS IRSA, or ECS task role); the form only carries the target of the read.

When NOT to use this

  • The metric is already in Prometheus via a CloudWatch exporter (yace, cloudwatch_exporter). Use the prometheus source. You get one query against your Prometheus instead of one CloudWatch API call per tick, and the CloudWatch billing surface stays at your exporter.
  • You need sub-minute granularity. CloudWatch periods are 60s, 300s, 900s, or 3600s. If you need 10-second resolution, send the metric over OTLP instead.
  • You need to alert on the absence of a metric. CloudWatch can take 1-3 periods to publish; the agent looks back 5 periods to absorb that lag. If a metric stops emitting, cloudwatch_no_data surfaces after ~5 periods, not immediately.

AWS credentials

The agent uses the standard AWS SDK credential provider chain. In order of precedence:

  1. Environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional AWS_SESSION_TOKEN.
  2. AWS shared credentials file (~/.aws/credentials) with an optional AWS_PROFILE env var to pick a profile.
  3. EC2 instance metadata (the agent runs on an EC2 instance with an attached IAM role).
  4. ECS task role (the agent runs as an ECS task with a task role).
  5. EKS pod identity / IRSA (the agent runs as a Kubernetes pod with an associated service account).

Pick whichever fits your deployment. For Kubernetes deployments, IRSA avoids handling access keys: bind the IAM role to a service account and the agent picks up credentials from the pod identity.

Minimum-permissions IAM policy

Attach this policy to the role the agent assumes (or to the access key's user, if you're using static credentials):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ObserverAgentReadMetrics",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*"
    }
  ]
}

GetMetricData and ListMetrics do not support resource-level constraints, so Resource must be *. Tighten the scope at the role's trust policy instead. ListMetrics is required for the console's Fetch from AWS affordance; omit it if you only want read access for probes and accept the curated catalog for discovery.

Cross-account access

When the metric lives in account B and the agent runs in account A:

  1. In account B, create a role (e.g. observer-cloudwatch-read) with the policy above. Add a trust policy permitting account A's role / user to assume it:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::AAAAAAAAAAAA:role/observer-agent"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": { "sts:ExternalId": "your-external-id" }
          }
        }
      ]
    }
  2. In Observer, set the metric definition's Role ARN to arn:aws:iam::BBBBBBBBBBBB:role/observer-cloudwatch-read and External ID to your-external-id. The agent will call sts:AssumeRole with its ambient credentials before each GetMetricData.

  3. In account A, attach this inline policy to the role the agent uses (the same identity named in the Principal of step 1's trust policy). That's usually the IRSA service-account role, the EC2 instance role, or the ECS task role:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "sts:AssumeRole",
          "Resource": "arn:aws:iam::BBBBBBBBBBBB:role/observer-cloudwatch-read"
        }
      ]
    }

The external ID is optional from AWS's side but recommended; it prevents the "confused deputy" attack where a third party tricks account A into reading from the wrong account B role.

Configure a metric in the Observer console

Create a metric, pick AWS CloudWatch, fill in:

  • Region: AWS region code (us-east-1, eu-west-2, etc.).
  • Period: 60s, 300s, 900s, or 3600s. Lower = more API calls per hour (one per cron tick) and finer-grained alerting. See Cost considerations.
  • Namespace: AWS/RDS, AWS/Lambda, AWS/ApplicationELB, or your custom namespace. See Common namespaces and metrics.
  • Metric name: e.g. CPUUtilization. Case-sensitive.
  • Dimensions: Key=Value lines that scope to a single resource (e.g. DBInstanceIdentifier=prod-db).
  • Statistic: Average, Sum, Minimum, Maximum, SampleCount, or a percentile (p50, p95, p99.9). See Statistic reference.
  • Role ARN / External ID (optional): for cross-account reads. See Cross-account access.

Statistic reference

Pick a statistic that maps your metric's meaning to a single number per period:

Metric shapeRecommended statisticExample
Gauge / utilization (CPU, memory, queue depth)AverageAWS/RDS CPUUtilization Average
Counter (requests, errors, invocations)SumAWS/Lambda Invocations Sum
Latency-style with skewp95 or p99AWS/ApplicationELB TargetResponseTime p95
Spike detectionMaximumAWS/ApplicationELB HTTPCode_Target_5XX_Count Maximum

If you pick Average for a metric CloudWatch only stores as a count, you get a cloudwatch_no_data result. The AWS Console under CloudWatch → Metrics shows which statistics each metric supports.

Common namespaces and metrics

A non-exhaustive starting list; the AWS Console is authoritative.

ServiceNamespaceUseful metrics
RDSAWS/RDSCPUUtilization, FreeableMemory, DatabaseConnections, ReadLatency, WriteLatency
LambdaAWS/LambdaInvocations, Errors, Duration, ConcurrentExecutions, Throttles
Application Load BalancerAWS/ApplicationELBRequestCount, HTTPCode_Target_5XX_Count, TargetResponseTime, HealthyHostCount
SQSAWS/SQSApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage
API GatewayAWS/ApiGatewayCount, 4XXError, 5XXError, Latency
ECSAWS/ECSCPUUtilization, MemoryUtilization

For latency-sensitive services, prefer p95 / p99 statistics over Average. Average hides tail-latency regressions.

Cost considerations

CloudWatch GetMetricData is billed per metric retrieved (see the AWS CloudWatch pricing page for current rates; at the time of writing it's roughly $0.01 per 1,000 metrics retrieved). One Observer metric definition issues one GetMetricData call per cron tick, retrieving one metric.

Back-of-envelope at 60s period and 1 metric def: 43,200 calls per 30-day month. The agent batches dimensions inside one query but does not batch across metric defs; if you have 50 CloudWatch-backed metrics with 60s periods, you're at ~2.16M calls per month.

Two ways to keep the bill bounded:

  1. Raise the period for metrics that don't need 1-minute granularity. A 300s period cuts the call rate by 5×.
  2. Move stable metrics to Prometheus via a CloudWatch exporter. The exporter consolidates many CloudWatch metrics into one exporter scrape; Observer reads the exporter via Prometheus without per-metric CloudWatch billing.

Reason codes specific to CloudWatch

The reason field on no_data results carries one of:

  • cloudwatch_no_data: GetMetricData returned an empty value list. The metric is not publishing, or no data point exists in the lookback (5 periods).
  • cloudwatch_access_denied: the agent's credentials cannot call GetMetricData against this metric. Check the IAM policy on the role / user the agent assumes.
  • cloudwatch_throttled: AWS is rate-limiting the agent. Raise the period, or split the metric across multiple agents in different AWS accounts.
  • cloudwatch_invalid_parameter: the request was malformed. Common causes: a dimension name CloudWatch doesn't recognize, or a statistic the metric doesn't support.
  • cloudwatch_resource_not_found: the namespace, metric name, and dimension combination doesn't exist (and never has) in this region and account.
  • cloudwatch_expired_credentials: an STS session expired. The agent refreshes automatically; this should self-heal on the next tick.
  • cloudwatch_server_error: AWS returned 5xx. Transient; usually clears within minutes.
  • cloudwatch_error: an uncategorized error. Check the AWS service health dashboard.

Troubleshooting

Each entry leads with the symptom and the action to take.

  • cloudwatch_access_denied and IAM policy looks right. Check the role's trust policy. The agent's identity must be a principal named in the target role's AssumeRolePolicyDocument. Run aws sts get-caller-identity from the agent's host to confirm which identity it's using.
  • cloudwatch_no_data but the metric is visible in the AWS Console. Check the dimensions exactly. CloudWatch matches on the full dimension set: a metric published with {DBInstanceIdentifier=prod-db} is not the same metric as {DBInstanceIdentifier=prod-db, EngineName=postgres}. The Console shows you the dimensions when you click a metric.
  • cloudwatch_throttled repeatedly. Multiple metric definitions on the same agent share an account-wide TPS limit. Either raise the period for non-critical metrics or split agents per account.
  • Wrong region. A metric in eu-west-1 is invisible from a us-east-1 query. Each region has its own metric definition.
  • Cross-account read returns cloudwatch_no_data but works for the IAM user directly. The assumed-role session inherits the role's policy, not the trust-policy principal's. Add cloudwatch:GetMetricData to the target role's policy, not the source role's.

Known limits

  • One metric per definition. This source does not batch across metric defs. If you need 500 metrics in one API call, write a custom dashboard against GetMetricData directly and forward aggregates as OTLP.
  • Static credentials are not stored. If you cannot use the AWS credential chain (env vars / instance role / IRSA), set the env vars on the agent process. Per-metric access keys would require an encryption + rotation surface that v1 deliberately omits.
  • Two ways to discover metric names. The namespace + metric-name inputs are pre-populated from a curated catalog covering the AWS services most operators instrument (RDS, Lambda, ApplicationELB, EC2, SQS, ApiGateway, ECS, DynamoDB, S3, CloudFront, SNS, Step Functions, Kinesis, Network LB). For custom namespaces or AWS services outside the catalog, click Fetch from AWS next to the metric name input: the agent runs cloudwatch:ListMetrics against the configured region (using its own AWS credentials, including any cross-account role ARN you set) and returns the live list within ~5 seconds. Click a row to fill the metric name + dimensions in one step.
Was this page helpful?