CloudWatch

Amazon CloudWatch is basically a metrics repository. It is hosted in the AWS Public Zone, which allows for monitoring of on-premise resources with no additional network configuration.

Namespaces: A namespace is a container for related CloudWatch metrics. Metrics in different namespaces are isolated from each other. There is no default namespace.
The default namespace for AWS services is AWS/<service_name> (E.g.: "AWS/EC2").

Metrics: a time-ordered set of data points that are published to CloudWatch. Metrics are uniquely defined by:

A Name
A Namespace
0 or more dimensions

Timestamps: Each metric data point must be associated with a time stamp. The time stamp can be up to two weeks in the past and up to two hours into the future. If you do not provide a time stamp, CloudWatch creates a time stamp for you based on the time the data point was received. Time stamps are dateTime objects, with the complete date plus hours, minutes, and seconds (for example, 2016-10-31T23:59:59Z).

Dimensions: name/value pair that is part of the identity of a metric. You can assign up to 30 dimensions to a metric. CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. You can only retrieve statistics using combinations of dimensions that you specifically published. The exception is by using the metric math SEARCH function, which can retrieve statistics for multiple metrics

Resolution:

standard resolution: one-minute granularity.
high resolution: one-second granularity.

Period: time between datapoints. Can be 1, 5, 10, 30 or any multiple of 60.

Metrics exist only in the Region in which they are created. Metrics cannot be deleted, but they automatically expire after 15 months if no new data is published to them. Data points older than 15 months expire on a rolling basis; as new data points come in, data older than 15 months is dropped.

Metrics

Retention of metrics depends on the period:

Data points with a period of less than 60 seconds are available for 3 hours. These data points are high-resolution custom metrics.
Data points with a period of 60 seconds (1 minute) are available for 15 days.
Data points with a period of 300 seconds (5 minutes) are available for 63 days.
Data points with a period of 3600 seconds (1 hour) are available for 455 days (15 months)

Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days this data is still available, but is aggregated and is retrievable only with a resolution of 5 minutes. After 63 days, the data is further aggregated and is available with a resolution of 1 hour.

Alarms

Metric alarms: they watch a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods.
Composite alarms: they watch other alarms (metric or composite) and fire when all the conditions defined in the rule evaluate to true. They can reduce the alarm noise. They’re not supported for cross-account scenarios.

While an alarm invokes actions only when the alarm changes state, Autoscaling Groups alarms continue to fire once per minute while they’re in the ALARM state.

Default resolution for alarms is 1 minute. High-resolution alarms can be set to 10 or 30 seconds for higher charges.

There’s no limit to the amount of alarms you can create.

You can create alarms for custom metrics before creating those custom metrics.

With Amazon CloudWatch cross-account observability, you can monitor and troubleshoot applications that span multiple accounts within a Region.

States:

OK
ALARM
INSUFFICIENT_DATA

Datapoint states:

Not breaching (within the threshold)
Breaching (violating threshold)
Missing: no data. What to do?

Alarm evaluation

When you create an alarm, you specify three settings:

Period: length of time to use to evaluate in seconds.
Evaluation Period: the number of most recent periods/data points to evaluate.
Datapoints to alarm: how many breaching data points in the evaluation period are needed to trigger the alarm. They don’t need to be consecutive but they must be in the evaluation period.

If period is 1 min or longer the alarm is evaluated every minute.
For example, if the Period is 5 minutes (300 seconds) and Evaluation Periods is 1, then at the end of minute 5 the alarm evaluates based on data from minutes 1 to 5. Then at the end of minute 6, the alarm is evaluated based on the data from minutes 2 to 6.

When you configure Evaluation Periods and Datapoints to Alarm as different values, you’re setting an "M out of N" alarm.

The number of evaluation periods for an alarm multiplied by the length of each evaluation period can’t exceed one day

Evaluating Missing Data

In this case you can tell CW how to consider the lack of data for that datapoint:

notBreaching
breaching
ignore: keep the current alarm state.
missing: the alarm transitions into the MISSING_DATA state

Alarm Actions

Invoke a Lambda Function
Trigger an EC2 action (for alarms based on EC2)
Scale an ASG
Create OpsItems in Systems Manager Ops Center
create incidents in AWS Systems Manager Incident Manager

CloudWatch Logs

CloudWatch is hosted in the AWS Public Zone, so it can be used on-premise without particular networking configuration.

Concepts

Log Class:
- The Standard log class is a full-featured option
- The Infrequent Access log class is a lower-cost option for logs that you access less frequently. It supports a subset of the Standard log class capabilities.
Log events: a record of some activity recorded by the application or resource being monitored. E.g.: a line in the apache access log
Log streams: a sequence of log events that share the same source. E.g.: all the lines coming from the apache access logs from one instance.
Log Group: groups of log streams that share the same retention, monitoring, and access control settings. E.g.: All the streams from an ASG of EC2 instances running the Apache webserver whose logs are being streamed. You can export the log group data to an S3 bucket (encryption of the bucket is supported but DSSE-KMS). Data can also be exported to OpenSearch but you may incur in high usage charges for large amounts of data.
Metric filters: filters to log events that capture data to export as a metric. You can give dimensions and a unit to the metric.

Log Classes

Standard Log Class

Fully managed log ingestion and storage
Cross-account features
Encryption with AWS KMS
CloudWatch Logs Insights query commands
CloudWatch Logs Insights discovered fields
Natural language query assist
CloudWatch Logs Anomaly Detection
Compare to previous time range
Subscription filters
Export to Amazon S3
GetLogEvents and FilterLogEvents API operations
Metric filters
Container Insights log ingestion
Lambda Insights log ingestion
Sensitive data protection with masking
Embedded metrics format

Infrequent Access Log Class

Fully managed log ingestion and storage
Cross-account features
Encryption with AWS KMS
CloudWatch Logs Insights query commands (not all commands)

Use Cases

You need to query your log data
You need to be able to detect and debug using Live Tail
Monitoring EC2 instances
Monitor AWS CloudTrail logged events
Audit and mask sensitive data
Log Route 53 DNS queries