Chapter 9
Monitoring and Metrics

THE AWS CERTIFIED SYSOPS ADMINISTRATOR - ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

  • Domain 1.0 Monitoring and Metrics
  • images1.1 Demonstrate ability to monitor availability and performance
  • images1.2 Demonstrate ability to monitor and manage billing and cost optimization processes
  • Domain 3.0 Analysis
  • images3.1 Optimize the environment to ensure maximum performance
  • images3.2 Identify performance bottlenecks and implement remedies
  • images3.3 Identify potential issues on a given application deployment
  • Domain 6.0 Security
  • images6.2 Ensure data integrity and access controls when using the AWS platform
  • images6.4 Demonstrate ability to prepare for security assessment use of AWS

images

Introduction to Monitoring and Metrics

This chapter covers the collection of metrics from Amazon CloudWatch which, in conjunction with other AWS Cloud services, can assist in the management, deployment, and optimization of workloads in the cloud.

In addition to performance monitoring, services such as Amazon CloudWatch Logs, AWS Config, AWS Trusted Advisor, and AWS CloudTrail can provide a detailed inventory of provisioned resources for security audits and financial accounting.

Amazon CloudWatch was designed to monitor cloud-based computing. Even so, when using Amazon CloudWatch Logs, systems in an existing, on-premises datacenter can send log information to Amazon CloudWatch for monitoring.

Sometimes, it is important to monitor and view the health of AWS in general. To do this, there are two tools: the AWS Service Health Dashboard and the AWS Personal Health Dashboard. These services display the general status of AWS and provide a personalized view of the performance and availability of provisioned resources.

This chapter covers these topics separately and also highlights how they can work together to maintain a robust environment on AWS.

An Overview of Monitoring

Everything fails, all the time.

—Werner Vogels

Computing systems are incredibly complex. Effectively troubleshooting them requires data that is easily understood to be delivered in real time. Service Level Agreements (SLAs) often require high levels of availability, and a lack of meaningful information can lead to loss of time and revenue.

Why Monitor?

Monitoring provides several major benefits:

  • Monitoring enables systems operators to catch issues before they become problems. This, in turn, maintains high availability and delivers high-quality customer service.
  • Monitoring provides tools for making informed decisions about capacity planning.
  • Monitoring is an input mechanism for automation.
  • Monitoring provides visibility into the cost, utilization, and security of computing resources.

Monitoring is the process of observing and recording resource utilization in real time. Alarms are notifications based on this information in response to a predefined condition. Frequently, this condition involves failure. Alarms can also be configured to send notifications when resources are being underutilized and money is being wasted.

Traditional monitoring tools have been designed around on-premises datacenters with the idea that servers are going to be in place for an extended amount of time. Because of this, these tools have difficulty distinguishing between an Amazon Elastic Compute Cloud (Amazon EC2) instance that has failed and one that was terminated purposely. AWS created its own monitoring service, Amazon CloudWatch, that is integrated with other AWS Cloud services and uses AWS Identity and Access Management (IAM) to keep monitoring data secure.

Amazon CloudWatch

Amazon CloudWatch is a service that monitors the health and status of AWS resources in real time. It provides system-wide visibility into resource utilization, application performance, and operational health by tracking, measuring, reporting, alerting, and reacting to events that occur in an environment.

Amazon CloudWatch Logs collects and monitors log files, can set alarms, and automatically reacts to changes in AWS resources. Logs can be monitored in real time or stored for analysis.

Amazon CloudWatch Alarms monitor a single metric and perform one or more actions based on customer-defined criteria.

Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. These streams are delivered to Amazon EC2 instances, AWS Lambda functions, Amazon Kinesis Streams, Amazon EC2 Container Service (Amazon ECS) tasks, AWS Step Functions state machines, Amazon Simple Notification Service (Amazon SNS) topics, Amazon Simple Queue Service (Amazon SQS) queues, or built-in targets.

AWS CloudTrail

AWS CloudTrail monitors calls made to the Amazon CloudWatch Events Application Programming Interface (API) for an account. This includes calls made by the AWS Management Console, the AWS Command Line Interface (AWS CLI), and other AWS Cloud services. When AWS CloudTrail logging is turned on, Amazon CloudWatch Events writes log files to an Amazon Simple Storage Service (Amazon S3) bucket.

AWS Config

AWS Config provides a detailed view of the configuration of AWS resources in an AWS account, including how the resources are related to one another. It also provides historical information to show how configurations and relationships have changed over time. Related to monitoring, AWS Config allows customers to create rules that check the configuration of their AWS resources and check for compliance with an organization’s policies. When an AWS Config rule is triggered, it generates an event that can be captured by Amazon CloudWatch Events.

Amazon CloudWatch can monitor AWS resources, such as Amazon EC2 instances, Amazon DynamoDB tables, Amazon Relational Database Service (Amazon RDS) DB instances, custom metrics generated by applications and services, and log files generated by applications and operating systems.

AWS Trusted Advisor

AWS Trusted Advisor is an online resource designed to help reduce cost, increase performance, and improve security by optimizing an AWS environment. It provides real-time guidance to help provision resources following AWS best practices.

AWS Trusted Advisor checks for best practices in four categories:

  • Cost Optimization
  • Security
  • Fault Tolerance
  • Performance Improvement

The following four AWS Trusted Advisor checks are available at no charge to all AWS customers to help improve security and performance:

  • Service Limits
  • Security Groups – Specific Ports Unrestricted
  • IAM Use
  • Multi-Factor Authentication (MFA) on the Root Account

AWS Service Health Dashboard

The AWS Service Health Dashboard provides access to current status and historical data about every AWS Cloud service. If there’s a problem with a service, it is possible to expand the appropriate line in the details section to get more information.

In addition to the dashboard, it is also possible to subscribe to the RSS feed for any service.

For anyone experiencing a real-time operational issue with one of the AWS Cloud services currently reporting as being healthy, there is a Contact Us link at the top of the page to report an issue.

The AWS Service Health Dashboard is available at http://status.aws.amazon.com/.


AWS Personal Health Dashboard

The AWS Personal Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that impact customers. While the AWS Service Health Dashboard displays the general status of AWS Cloud services, the AWS Personal Health Dashboard provides a personalized view into the performance and availability of the AWS Cloud services underlying provisioned AWS resources.

The dashboard displays relevant and timely information to help manage events in progress and provides proactive notification to help plan scheduled activities. Alerts are automatically triggered by changes in the health of AWS resources and provide event visibility and also guidance to help diagnose and resolve issues quickly.

Now let’s take a deep dive into each of these services.

Amazon CloudWatch

Amazon CloudWatch monitors, in real time, AWS resources and applications running on AWS. Amazon CloudWatch is used to collect and track metrics, which are variables used to measure resources and applications. Amazon CloudWatch Alarms send notifications and can automatically make changes to the resources being monitored based on user-defined rules. Amazon CloudWatch is basically a metrics repository. An AWS product such as Amazon EC2 puts metrics into the repository and customers retrieve statistics based on those metrics. Additionally, custom metrics can be placed into Amazon CloudWatch for reporting and statistical analysis.

For example, it is possible to monitor the CPU usage and disk reads and writes of Amazon EC2 instances. With this information, it is possible to determine when additional instances should be launched to handle the increased load. Additionally, these new instances can be launched automatically before there is a problem, eliminating the need for human intervention. Conversely, monitoring data can be used to stop underutilized instances automatically in order to save money.

In addition to monitoring the built-in metrics that come with AWS, it is possible to create, monitor, and trigger actions using custom metrics. Amazon CloudWatch provides system-wide visibility into resource utilization, application performance, and operational health. Figure 9.1 illustrates how Amazon CloudWatch connects to both AWS and on-premises environments.

Image shows AWS connects to Amazon cloud watch containing metrics, Amazon CloudWatch Alarm (divided into actions: Email and auto scaling), and available statistics (AWS management console and statistics consumer).

FIGURE 9.1 Amazon CloudWatch integration

Amazon CloudWatch can be accessed using the following methods:


Metrics

At the core of Amazon CloudWatch are metrics, which are time-ordered sets of data points that contain information about the performance of resources. By default, several services provide some metrics at no additional charge. These metrics include information from Amazon EC2 instances, Amazon Elastic Block Store (Amazon EBS) volumes, and Amazon RDS DB instances. It is also possible to enable detailed monitoring for some resources, such as Amazon EC2 instances.



Custom Metrics

In addition to monitoring AWS resources, Amazon CloudWatch can be used to monitor data produced from applications, scripts, and services. A custom metric is any metric provided to Amazon CloudWatch via an agent or an API.

Custom metrics can be used to monitor the time it takes to load a web page, capture request error rates, monitor the number of processes or threads on an instance, or track the amount of work performed by an application.

Ways to create custom metrics include the following:

  • The PutMetricData API.
  • AWS-provided sample monitoring scripts for Windows and Linux.
  • The Amazon CloudWatch collectd plugin.
  • Applications and tools offered by AWS Partner Network (APN) partners.

Custom metrics come at an additional cost based on usage.


Amazon CloudWatch Metrics Retention

In November 2016, Amazon CloudWatch changed the length of time metrics are stored inside the service as follows:

  • One-minute data points are available for 15 days.
  • Five-minute data points are available for 63 days.
  • One-hour data points are available for 455 days.

If metrics need to be available for longer than those periods, they can be archived using the GetMetricStatistics API call.

Metrics cannot be deleted. They automatically expire after 15 months if no new data is published to them.



Namespaces

A namespace is a container for a collection of Amazon CloudWatch metrics. Each namespace is isolated from other namespaces. This isolation ensures that data collected is only from services specified and prevents different applications from mistakenly aggregating the same statistics.

There are no default namespaces. When creating a custom metric, a namespace is required. If the specified namespace does not exist, Amazon CloudWatch will create it.

Namespace names must contain valid XML characters and be fewer than 256 characters in length.

Allowed characters in namespace names are as follows:

  • Alphanumeric characters (0-9, A-Z, a-z)
  • Period (.)
  • Hyphen (-)
  • Underscore (_)
  • Forward Slash (/)
  • Hash (#)
  • Colon (:)

AWS namespaces use the following naming convention: AWS/service. For example, Amazon EC2 uses the AWS/EC2 namespace. Sample AWS Namespaces are shown in Table 9.1.


TABLE 9.1 A Small Sample of AWS Namespaces

AWS Product Namespace
Auto Scaling AWS/AutoScaling
Amazon EC2 AWS/EC2
Amazon EBS AWS/EBS
Elastic Load Balancing AWS/ELB (Classic Load Balancers)
Elastic Load Balancing AWS/ApplicationELB (Application Load Balancers)
For a comprehensive list of namespaces, see http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-namespaces.html.

Dimensions

A dimension is a name/value pair that uniquely identifies a metric and further clarifies the metric data stored. A metric can have up to 10 dimensions.

Every metric has specific characteristics that describe it. Think of dimensions as categories or metadata for those characteristics. The categories can aid in the design of a structure for a statistics plan. Because dimensions are part of the unique identifier for a metric, whenever a unique name/value pair is added to a metric, a new metric is created.


Dimensions can be used to filter the results from Amazon CloudWatch. For example, it is possible to get statistics for a specific Amazon EC2 instance by specifying the instance ID dimension when doing a search.


Dimension Combinations

Amazon CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics use the same metric name. It is not possible to retrieve statistics using combinations of dimensions that have not been specifically published. When retrieving statistics, specify the same values for the namespace, metric name, and dimension parameters that were used when the metrics were created. The start and end times can be specified for Amazon CloudWatch to use for aggregation.

To illustrate, here are four distinct metrics named ServerStats in the DataCenterMetric namespace that have the following properties:

Dimensions: Server=Prod, Domain=Titusville, Unit: Count, Timestamp: 2017-05-18T12:30:00Z, Value: 105
 
Dimensions: Server=Test, Domain=Titusville, Unit: Count, Timestamp: 2017-05-18T12:31:00Z, Value: 115
 
Dimensions: Server=Prod, Domain=Rockets, Unit: Count, Timestamp: 2017-05-18T12:32:00Z, Value: 95
 
Dimensions: Server=Test, Domain=Rockets, Unit: Count, Timestamp: 2017-05-18T12:33:00Z, Value: 97

If those four metrics are the only ones that have been published, statistics can be retrieved for these combinations of dimensions:

  • Server=Prod,Domain=Titusville
  • Server=Prod,Domain=Rockets
  • Server=Test,Domain=Titusville
  • Server=Test,Domain=Rockets

It is not possible to receive statistics for the following dimensions. In this case, dimensions must be specified to retrieve statistics:

  • Server=Prod
  • Server=Test
  • Domain=Titusville
  • Domain=Rockets


Statistics

Statistics are metric data aggregations over specified periods of time. Amazon CloudWatch provides statistics based on the metric data points provided by custom data or by other services in AWS to Amazon CloudWatch. Aggregations are made using the namespace, metric name, dimensions, and the data point unit of measure within the time period specified. Available CloudWatch statistics are provided in Table 9.2.

TABLE 9.2 Available CloudWatch Statistics

Statistic Description
Minimum The lowest value observed during the specified period. Use this value to determine low volumes of activity for an application.
Maximum The highest value observed during the specified period. Use this value to determine high volumes of activity for an application.
Sum All values submitted for the matching metric added together. This statistic can be useful for determining the total volume of a metric.
Average The value of Sum/SampleCount during the specified period. By comparing this statistic with Minimum and Maximum, the full scope of a metric can be determined, and it is possible to discover how close average use is to the Minimum and Maximum.
SampleCount The number of data points used for the statistical calculation.
pNN.NN The value of the specified percentile. You can specify any percentile using up to two decimal places.


Pre-calculated statistics can be added to Amazon CloudWatch. Instead of data point values, specify values for SampleCount, Minimum, Maximum, and Sum. Amazon CloudWatch calculates the average for you. Values added in this way are aggregated with any other values associated with the matching metric.

Units

Each statistic has a unit of measure. Example units include bytes, seconds, count, and percent.

A unit can be specified when creating a custom metric. If one is not specified, Amazon CloudWatch uses None as the unit. Units provide conceptual meaning to data.


Metric data points that specify a unit of measure are aggregated separately. When getting statistics without specifying a unit, Amazon CloudWatch aggregates all data points of the same unit together. If there are two identical metrics that have different units, two separate data streams are returned—one for each unit.

Periods

A period is the length of time associated with a specific Amazon CloudWatch statistic. Each statistic represents an aggregation of the metrics data collected for a specified period of time. Although periods are expressed in seconds, the minimum granularity for a period is one minute.

Because of this minimum granularity, period values are expressed as multiples of 60. By varying the length of the period, the data aggregation can be adjusted.


When retrieving statistics, specify a period, start time, and an end time. These parameters determine the overall length of time associated with the collected statistic.

Default values for the start time and end time return statistics from the past hour.

The values specified for the start time and end time determine how many periods Amazon CloudWatch will return.


Periods are also important for Amazon CloudWatch Alarms. When creating an alarm to monitor a specific metric, Amazon CloudWatch will compare that metric to a specified threshold value. Customers have extensive control over how Amazon CloudWatch makes comparisons. In addition to the period length, the number of evaluation periods can be specified as well. For example, if three evaluation periods are specified, Amazon CloudWatch compares a window of three data points. Amazon CloudWatch only sends a notification if the oldest data point is breaching and the others are breaching or are missing.


Aggregation

Amazon CloudWatch aggregates statistics according to the period length specified when retrieving statistics. When publishing as multiple data points with the same or similar timestamps, Amazon CloudWatch aggregates them by period length.


Data points for a metric that share the same timestamp, namespace, and dimension can be published, and Amazon CloudWatch will return aggregated statistics for them. It is also possible to publish multiple data points for the same or different metrics with any timestamp.

For large datasets, a pre-aggregated dataset can be inserted called a statistic set. With statistic sets, give Amazon CloudWatch the Min, Max, Sum, and SampleCount for a number of data points. This is commonly used data that needs to be collected many times in a minute.


Amazon CloudWatch doesn’t differentiate the source of a metric. If a metric is published with the same namespace and dimensions from different sources, Amazon CloudWatch treats it as a single metric. This can be useful for service metrics in a distributed, scaled system.


Dashboards

Amazon CloudWatch dashboards are customizable pages in the Amazon CloudWatch console that can be used to monitor resources in a single view. Monitored resources can be in a single region or multiple regions. Use Amazon CloudWatch dashboards to create customized views of the metrics and alarms for AWS resources.

With dashboards, it is possible to create the following:

  • A single view for selected metrics and alarms to help assess the health of resources and applications across one or more regions
  • An operational playbook that provides guidance for team members during operational events about how to respond to specific incidents
  • A common view of critical resource and application measurements that can be shared by team members for faster communication flow during operational events

Percentiles

A percentile indicates the relative standing of a value in a dataset and is often used to isolate anomalies. For example, the 95th percentile means that 95 percent of the data is below this value and 5 percent of the data is above this value. Percentiles help in a better understanding of the distribution of metric data.

Percentiles can be used with the following services:

  • Amazon EC2
  • Amazon RDS
  • Amazon Kinesis
  • Application Load Balancer
  • Elastic Load Balancing
  • Amazon API Gateway

 


Monitoring Baselines

Monitoring is an important part of maintaining the reliability, availability, and performance of Amazon EC2 instances. Collect monitoring data from all of the parts of an AWS solution to be able to debug any multi-point failures easily.

In order to monitor an environment effectively, have a plan that answers the following questions:

  • What are the goals for monitoring?
  • What resources need to be monitored?
  • How often will these resources be monitored?
  • What monitoring tools will be used?
  • Who will perform the monitoring tasks?
  • Who should be notified when something goes wrong?

After the monitoring goals have been defined and the monitoring plan has been created, the next step is to establish a baseline for normal Amazon EC2 performance.

Measure Amazon EC2 performance at various times and under different load conditions. While monitoring Amazon EC2, store a history of collected monitoring data. Over time, the historical data can be compared to current data to identify normal performance patterns and also anomalies.

For example, if monitoring CPU utilization, disk I/O, and network utilization for your Amazon EC2 instances and performance falls outside of an established baseline, reconfigure or optimize the instance to reduce CPU utilization, improve disk I/O, or reduce network traffic (see Table 9.3).

Amazon EC2 Status Checks

Inside Amazon EC2, there are two types of status checks: a system status check and an instance status check.

TABLE 9.3 Establishing an Amazon EC2 Baseline

Item to Monitor Amazon EC2 Metric
CPU utilization CPUUtilization
Memory utilization Requires an agent
Memory used Requires an agent
Memory available Requires an agent
Network utilization NetworkIn NetworkOut
Disk performance DiskReadOps DiskWriteOps
Disk Swap utilization (Linux instances) Requires an agent
Swap used (Linux instances) Requires an agent
Page File utilization (Windows instances only) Requires an agent
Page File used (Windows instances only) Requires an agent
Page File available (Windows instances only) Requires an agent
Disk Reads/Writes DiskReadBytes DiskWriteBytes
Disk Space utilization (Linux instances) Requires an agent
Disk Space used (Linux instances) Requires an agent
Disk Space available (Linux instances only) Requires an agent
System Status Checks

System status checks monitor the AWS systems required to use your instance in order to ensure that they are working properly. These checks detect problems on the hardware that an instance is using.

When a system status check fails, there are three possible courses of action. One option is to wait for AWS to fix the issue. If an instance boots from an Amazon EBS volume, stopping and starting the instance will move it to new hardware. If the Amazon EC2 instance is using an instance store volume, terminate it, and start a new one to put it on new hardware.

The following are examples of problems that can cause system status checks to fail:

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host that impact network reachability
Instance Status Checks

Instance status checks monitor the software and network configuration of individual instances. These checks detect problems that require user involvement to repair. When an instance status check fails, it can often be fixed with a reboot or by reconfiguring the Amazon EC2 instance.

The following are examples of problems that can cause instance status checks to fail:

  • Failed system status checks
  • Incorrect networking or startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel

 


Authentication and Access Control

Access to Amazon CloudWatch requires credentials. These credentials must have permissions to access AWS resources such as the Amazon CloudWatch console or retrieving Amazon CloudWatch metric data.

Every AWS resource is owned by an AWS account, and permissions to create or access a resource are governed by permissions policies. An account administrator can attach permissions policies to IAM identities: users, groups, and roles.

Permissions Required to Use the Amazon CloudWatch Console

For a user to work with the Amazon CloudWatch console, there is a minimum set of permissions required to allow that user to describe other AWS resources in the AWS account.

The Amazon CloudWatch management console requires permissions from the following services:

  • Auto Scaling
  • AWS CloudTrail
  • Amazon CloudWatch
  • Amazon CloudWatch Events
  • Amazon CloudWatch Logs
  • Amazon EC2
  • Amazon Elasticsearch Service
  • IAM
  • Amazon Kinesis
  • AWS Lambda
  • Amazon S3
  • Amazon SNS
  • Amazon SQS
  • Amazon Simple Workflow Service (Amazon SWF)

 


AWS Managed Policies for Amazon CloudWatch

AWS provides standalone IAM policies that cover many common use cases. These policies were created and administered by AWS and grant the required permissions for services without customers having to create and maintain their own.


Customers create their own IAM policies to allow permissions for Amazon CloudWatch actions and resources. Attach these custom policies to the IAM users or groups that require those permissions.


Amazon CloudWatch Resources and Operations

Amazon CloudWatch has no specific resources that can be controlled. As a result, there are no Amazon CloudWatch Amazon Resource Names (ARNs) to use in an IAM policy.

For example, it is not possible to give a user access to Amazon CloudWatch data for a specific set of Amazon EC2 instances or a specific load balancer.


When writing an IAM policy, use an asterisk (*) as the resource name to control access to Amazon CloudWatch actions.

AWS Cloud Services Integration

The following AWS Cloud services support Amazon CloudWatch without additional charges. Customers have the option to choose which of the preselected metrics they want to use.

Auto Scaling Groups Seven preselected metrics at a one-minute frequency

Elastic Load Balancing Thirteen preselected metrics at a one-minute frequency

Amazon Route 53 health checks One preselected metric at a one-minute frequency

Amazon EBS IOPS (Solid State Drive [SSD]) volumes Ten preselected metrics at a one-minute frequency

Amazon EBS General Purpose (SSD) volumes Ten preselected metrics at a one-minute frequency

Amazon EBS Magnetic volumes Eight preselected metrics at a five-minute frequency

AWS Storage Gateway Eleven preselected gateway metrics and five preselected storage volume metrics at a five-minute frequency

Amazon CloudFront Six preselected metrics at a one-minute frequency

Amazon DynamoDB tables Seven preselected metrics at a five-minute frequency

Amazon ElastiCache nodes Thirty-nine preselected metrics at a one-minute frequency

Amazon RDS DB instances Fourteen preselected metrics at a one-minute frequency

Amazon EMR job flows Twenty-six preselected metrics at a five-minute frequency

Amazon Redshift Sixteen preselected metrics at a one-minute frequency

Amazon SNS topics Four preselected metrics at a five-minute frequency

Amazon SQS queues Eight preselected metrics at a five-minute frequency

AWS OpsWorks Fifteen preselected metrics at a one-minute frequency

Amazon CloudWatch Logs Six preselected metrics at one-minute frequency

Estimated charges on your AWS bill It is also possible to enable metrics to monitor AWS charges. The number of metrics depends on the AWS products and services used. These metrics are offered at no additional charge.

Amazon CloudWatch Limits

Table 9.4 lists Amazon CloudWatch limits.

TABLE 9.4 Amazon CloudWatch Limits

Resource Default Limit
Actions 5/alarm. This limit cannot be changed.
Alarms 10/month/customer for no additional charge 5,000 per region per account
API requests 1,000,000/month/customer for no additional charge
Custom metrics No limit
DescribeAlarms 3 Transactions per Second (TPS) This is the maximum number of operation requests that can be made per second without being throttled. A limit increase can be requested.
Dimensions 10/metric. This limit cannot be changed.
GetMetricStatistics 400 TPS This is the maximum number of operation requests per second before being throttled. A limit increase can be requested.
ListMetrics 25 TPS This is the maximum number of operation requests per second before being throttled. A limit increase can be requested.
Metric data 15 months. This limit cannot be changed.
MetricDatum items 20/PutMetricData request A MetricDatum object can contain a single value or a StatisticSet object representing many values. This limit cannot be changed.
Metrics 10/month/customer for no additional charge
Period One day (86,400 seconds) This limit cannot be changed.
PutMetricAlarm request 3 TPS The maximum number of operation requests you can make per second without being throttled. A limit increase can be requested.
PutMetricData request 40 KB for HTTP POST requests 150 TPS The maximum number of operation requests that you can make per second without being throttled. A limit increase can be requested.
Amazon SNS email notifications 1,000/month/customer for no additional charge

Amazon CloudWatch Alarms

Amazon CloudWatch Alarms are used to initiate automatically an action in response to a predefined condition. An alarm watches a single metric over a specified time period and, based on the value of that metric relative to a threshold over time, performs one or more specified actions. Those actions include triggering an Auto Scaling policy, publishing to an Amazon SNS topic, and updating a dashboard.

Alarms only trigger actions after sustained state changes. Amazon CloudWatch Alarms are not generated simply because a metric is in a particular state. The state must change and be maintained for a user-specified number of periods.


An Amazon CloudWatch Alarm is always in one of three states: OK, ALARM, or INSUFFICIENT_DATA.

  • When the monitored metric is within the range that has been defined as acceptable, it is in the OK state.
  • When a metric breaches a user-defined threshold, it transitions to the ALARM state.
  • If the data needed to make a decision is missing or incomplete, it is in the INSUFFICIENT_DATA state.

Actions are set to respond to the transition of the metric as it moves into each of the three states. Actions only happen on state transitions and will not be re-executed if the condition persists.

Multiple actions are allowed for an alarm. If an alarm is triggered, Amazon SNS could be used to send a notification email, while at the same time an Auto Scaling policy is updated.

Alarms and Thresholds

Alarms are designed to be triggered when three things happen:

  • A monitored metric has reached a particular value or is within a defined range.
  • That metric’s value stays at that value or within the specified range for a number of times in a row in a period.
  • The value or range of the metric has been consistent for a number of user-defined periods.

In short, an Amazon CloudWatch Alarm is triggered when a monitored metric reaches a value, is reported multiple times in a row, and stays at that value for a period of time.


In Figure 9.2, the alarm threshold is set to three units and the alarm is evaluated over three periods. The alarm goes to ALARM state if the oldest of the three periods evaluated has matched the alarm criteria and the next two periods have met the criteria or are missing.

In the figure, this happens with the third through fifth time periods when the alarm’s state is set to ALARM. At period six, the value drops below the threshold and the state reverts to OK.

Later, during the ninth time period, the threshold is breached again, but for only one period. Because of this, the alarm state remains OK. Figure 9.2 shows this as a graph.

Graph shows time period versus units that record threshold and value after three periods and only one period over threshold when action and no action is invoked respectively.

FIGURE 9.2 A threshold breach without a change in alarm state


Alarms can also be added to dashboards. When an alarm is on a dashboard, it turns red when it is in the ALARM state, making it easier to monitor its status.


Missing Data Points

Similar to how each alarm is always in one of three states, each specific data point reported to Amazon CloudWatch falls under one of three categories:

  • Good: Within the threshold
  • Bad: Violating the threshold
  • Missing: Data for the metric is not available, or not enough data is available for the metric to determine the alarm state.

Customers can specify how alarms handle missing data points. They can be treated as:

  • Missing: The alarm looks back further in time to find additional data points.
  • Good: Treated as a data point that is within the threshold
  • Bad: Treated as a data point that is breaching the threshold
  • Ignored: The current alarm state is maintained.

The best choice of how to treat missing data points depends on the type of metric. For a metric that continually reports data, such as CPUUtilization of an instance, it might be best to treat missing data points as bad because their absence indicates something is wrong. For a metric that generates data points only when an error occurs, such as ThrottledRequests in Amazon DynamoDB, missing data points should be treated as good.

Choosing the best option for an alarm prevents unnecessary and misleading alarm condition changes and more accurately indicates the health of a system.

Common Amazon CloudWatch Metrics

There are hundreds of metrics available for monitoring on AWS. The common ones are listed here, broken down by service, with a brief explanation. As we have mentioned throughout, this book is designed to do more than just prepare you for an exam—it should serve you well as a day-to-day guide in working with AWS.

Amazon EC2

There are two types of EC2 status checks: system status checks and instance status checks.

System Status Checks

System Status Checks monitor AWS hardware to ensure that instances are working properly. These checks detect problems with an instance that requires AWS involvement to repair. When a system status check fails, customers can choose to wait for AWS to fix the issue, or they can resolve it by either stopping and starting an instance or by terminating and replacing it.


The following are examples of problems that can cause system status checks to fail:

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host that impact network reachability
Instance Status Checks

Instance Status Checks monitor the software and network configuration of an individual instance. These checks detect problems that require customer involvement to repair. When an instance status check fails, typical solutions include rebooting or reconfiguring the instance.

The following are examples of problems that can cause instance status checks to fail:

  • Failed system status checks
  • Incorrect networking or startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel

The following Amazon CloudWatch metrics offer insight into the usage and utilization of Amazon EC2 instances. For Amazon EC2, common metrics include the following:

  • CPUUtilization
  • NetworkIn
  • NetworkOut
  • DiskReadOps
  • DiskWriteOps
  • DiskReadBytes
  • DiskWriteBytes

CPUUtilization This metric is the percentage of allocated Amazon EC2 compute units that are currently in use on an instance. This metric identifies the processing power required to run an application on a selected instance.

NetworkIn This metric represents the number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a single instance.

Similar to NetworkOut, the number reported is the number of bytes received during the period. When using Basic Monitoring, divide this number by 300 to find bytes/second. With Detailed Monitoring, divide it by 60.

NetworkOut This metric is the number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.

Similar to NetworkIn, the number reported is the number of bytes received during the period. When using Basic Monitoring, divide this number by 300 to find bytes/second. With Detailed Monitoring, divide it by 60.

DiskReadOps This metric reports the completed read operations from all instance store volumes available to the instance in a specified period of time.

To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period.

DiskWriteOps This metric reports the completed write operations to all instance store volumes available to the instance in a specified period of time.

To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period.

DiskReadBytes This metric reports the number of bytes read from all instance store volumes available to the instance.

This metric is used to determine the volume of the data the application reads from the hard disk of the instance. It can be used to determine the speed of the application.

The number reported is the number of bytes received during the period. When using Basic Monitoring, divide this number by 300 to find bytes/second. With Detailed Monitoring, divide it by 60.

DiskWriteBytes This metric reports the number of bytes written to all instance store volumes available to the instance.

The metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application.

The number reported is the number of bytes received during the period. When using Basic Monitoring, divide this number by 300 to find bytes/second. With Detailed Monitoring, divide it by 60.

Amazon Elastic Block Store Volume Monitoring

Amazon Elastic Block Store (Amazon EBS) sends data points to Amazon CloudWatch for several metrics. Amazon EBS General Purpose SSD (gp2), Throughput Optimized HDD (st1), Cold HDD (sc1), and Magnetic (standard) volumes automatically send five-minute metrics to CloudWatch. Provisioned IOPS SSD (io1) volumes automatically send one-minute metrics to CloudWatch.

Common Amazon EBS metrics include the following:

  • VolumeReadBytes
  • VolumeWriteBytes
  • VolumeReadOps
  • VolumeWriteOps
  • VolumeTotalReadTime
  • VolumeTotalWriteTime
  • VolumeIdleTime
  • VolumeQueueLength
  • VolumeThroughputPercentage
  • VolumeConsumedReadWriteOps
  • BurstBalance

VolumeReadBytes and VolumeWriteBytes These metrics provide information on the I/O operations in a specified period of time. The Sum statistic reports the total number of bytes transferred during the period. The Average statistic reports the average size of each I/O operation during the period. The SampleCount statistic reports the total number of I/O operations during the period. The Minimum and Maximum statistics are not relevant for this metric.


VolumeReadOps and VolumeWriteOps These metrics report the total number of I/O operations in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period.

VolumeTotalReadTime and VolumeTotalWriteTime These metrics report the total number of seconds spent by all operations that completed in a specified period of time. If multiple requests are submitted at the same time, this total could be greater than the length of the period.

For example, for a period of 5 minutes (300 seconds): If 700 operations completed during that period, and each operation took 1 second, the value would be 700 seconds.

VolumeIdleTime This metric represents the total number of seconds in a specified period of time when no read or write operations were submitted.

VolumeQueueLength This metric is the number of read and write operation requests waiting to be completed in a specified period of time.


VolumeThroughputPercentage This metric is used with Provisioned IOPS SSD volumes only. It is the percentage of I/O operations per second (IOPS) delivered of the total IOPS provisioned for an Amazon EBS volume. Provisioned IOPS SSD volumes deliver within 10 percent of the provisioned IOPS performance 99.9 percent of the time over a given year.


VolumeConsumedReadWriteOps This metric is used with Provisioned IOPS SSD volumes only. The total amount of read and write operations (normalized to 256K capacity units) consumed in a specified period of time

I/O operations that are smaller than 256K each count as 1 consumed IOPS. I/O operations that are larger than 256K are counted in 256K capacity units. For example, a 1024K I/O would count as 4 consumed IOPS.

BurstBalance This metric is only used with General Purpose SSD (gp2), Throughput Optimized HDD (st1), and Cold HDD (sc1) volumes. It provides information about the percentage of I/O credits (for gp2) or throughput credits (for st1 and sc1) remaining in the burst bucket.

Data is reported to Amazon CloudWatch only when the volume is active. If the volume is not attached, no data is reported.

Amazon EBS Status Checks

Volume status checks help customers understand, track, and manage potential inconsistencies in the data on an Amazon EBS volume. They are designed to provide you with the information needed to determine if an Amazon EBS volume is impaired and to help customers control how a potentially inconsistent volume is handled.

Volume status checks are automated tests that run every five minutes and return a pass or fail status. If all checks pass, the status of the volume is ok. If a check fails, the status of the volume is impaired. If the status is insufficient-data, the checks may still be in progress on the volume.

There are four status types for Provisioned IOPS EBS Volumes: ok, warning, impaired, and insufficient-data.

ok This status means that the volume is performing as expected.

warning This status means that the volume is either Degraded or Severely Degraded.

Degraded means that the volume performance is below expectations. Severely Degraded means that the volume performance is well below expectations.

impaired Impaired means that a volume has either Stalled or is Not Available. Stalled means that the volume performance is severely impacted. Not Available means that it is unable to determine I/O performance because I/O is disabled.

insufficient-data Insufficient-data means that there have not been enough data points collected but that it is online.

Amazon ElastiCache

The following Amazon CloudWatch metrics offer insight into Amazon ElastiCache performance. In most cases, the recommendation is to set CloudWatch Alarms for these metrics to be able to take corrective action before performance issues occur. For Amazon ElastiCache, common metrics include the following:

  • CPUUtilization
  • SwapUsage
  • Evictions
  • CurrConnections

CPUUtilization This is a host-level metric reported as a percent.

  • Memcached Because Memcached is multi-threaded, this metric can be as high as 90 percent. If this threshold is exceeded, scale the cache cluster up by using a larger cache node type, or scale out by adding more cache nodes.
  • Redis Because Redis is single-threaded, the threshold is calculated as utilization/number of processor cores. For example, when using a cache.m1.xlarge node that has four cores, the threshold for a 90 percent CPUUtilization would be 90/4, or 22.5 percent.

Administrators have to determine their own threshold value based on the number of cores in the cache node being used.

If this threshold is exceeded and the main workload is from read requests, scale the cache cluster out by adding read replicas. If the main workload is from write requests, AWS recommends scaling up by using a larger cache instance type.

SwapUsage This is a host-level metric reported in bytes.

  • Memcached This metric should not exceed 50 MB. If it does, AWS recommends that the ConnectionOverhead parameter value be increased.
  • Redis At this time, AWS has no recommendation for this parameter; there is not a need to set an Amazon CloudWatch Alarm for it.

Evictions This is a metric published for both Memcached and Redis cache clusters. AWS recommends customers determine their own alarm thresholds for this metric based on application needs.

  • Memcached If the chosen threshold is exceeded, scale the cache cluster up by using a larger node type or scale out by adding more nodes.
  • Redis If you exceed your chosen threshold, scale your cluster up by using a larger node type.

CurrConnections This is a cache engine metric, published for both Memcached and Redis cache clusters. AWS recommends customers determine their own alarm thresholds for this metric based on application needs.

Whether running Memcached or Redis, an increasing number of CurrConnections might indicate a problem with an application.

Amazon RDS Metrics

When using Amazon RDS resources, Amazon RDS sends metrics and dimensions to Amazon CloudWatch every minute. Common metrics include the following:

  • DatabaseConnections
  • DiskQueueDepth
  • FreeStorageSpace
  • ReplicaLag
  • ReadIOPS
  • WriteIOPS
  • ReadLatency WriteLatency

DatabaseConnections This metric is a count of the number of database connections in use.

DiskQueueDepth This metric is a count of the number of outstanding I/O operations waiting to access the disk.

FreeStorageSpace This metric, measured in bytes, is the amount of available storage space.


ReplicaLag This metric, measured in seconds, is the amount of time a Read Replica DB instance lags behind the source DB instance. It applies to MySQL, MariaDB, and PostgreSQL Read Replicas.

ReadIOPS This metric is the average number of disk read I/O operations per second.

WriteIOPS This metric is the average number of disk write I/O operations per second.

ReadLatency This metric, measured in seconds, is the average amount of time taken per disk read I/O operation.

WriteLatency This metric, measured in seconds, is the average amount of time taken per disk write I/O operation.

AWS Elastic Load Balancer

AWS ELB reports metrics to Amazon CloudWatch only when requests are flowing through the load balancer. When there are requests flowing through the load balancer, Elastic Load Balancing measures and sends its metrics in 60-second intervals. If there are no requests flowing through the load balancer or no data for a metric, the metric is not reported.

Common metrics reported to Amazon CloudWatch from an ELB include the following:

  • BackendConnectionErrors
  • HealthyHostCount
  • UnHealthyHostCount
  • RequestCount
  • Latency
  • HTTPCode_Backend_2XX
  • HTTPCode_Backend_3XX
  • HTTPCode_Backend_4XX
  • HTTPCode_Backend_5XX
  • HTTPCode_ELB_4XX
  • HTTPCode_ELB_5XX
  • SpilloverCount
  • SurgeQueueLength

BackendConnectionErrors This metric is the number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. This count also includes any connection errors related to health checks.

HealthyHostCount This metric is the number of healthy instances registered with a load balancer. A newly registered instance is considered healthy after it passes the first health check.

If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone.

UnHealthyHostCount This metric is the number of unhealthy instances registered with a load balancer. An instance is considered unhealthy after it exceeds the unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy threshold configured for health checks.

RequestCount This metric is the number of requests completed or connections made during the specified interval, which is either one or five minutes.

  • HTTP listener The number of requests received and routed, including HTTP error responses from the registered instances
  • TCP listener The number of connections made to the registered instances

Latency This metric represents the elapsed time, in seconds, between when a request has been sent to an instance and the reply received.

  • HTTP listener The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the instance started to send the response headers
  • TCP listener The total time elapsed, in seconds, for the load balancer to establish a connection to a registered instance successfully.

HTTPCode_Backend_2XX This metric is the number of HTTP 2XX response codes generated by registered instances. 2XX status codes report success. The action was successfully received, understood, and accepted. This count does not include any response codes generated by the load balancer.

HTTPCode_Backend_3XX This metric is the number of HTTP 3XX response codes generated by registered instances. 3XX status codes report redirection. Further action must be taken in order to complete the request.

HTTPCode_Backend_4XX This metric is the number of HTTP 4XX response codes generated by registered instances. 4XX status codes report client errors. The request contains bad syntax or cannot be fulfilled.

HTTPCode_Backend_5XX This metric is the number of HTTP 5XX response codes generated by registered instances. 5XX status codes report server errors. The server failed to fulfill an apparently valid request.

HTTPCode_ELB_4XX This metric is the number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a request is malformed or incomplete. This error is generated by the ELB.

HTTPCode_ELB_5XX This metric the number of HTTP 5XX server error codes generated by the load balancer. This count does not include any response codes generated by the registered instances.

The metric is reported if there are no healthy instances registered to the load balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer. This error is generated by the ELB.

SpilloverCount This metric is the total number of requests that were rejected because the surge queue is full.

  • HTTP listener The load balancer returns an HTTP 503 error code.
  • TCP listener The load balancer closes the connection.

SurgeQueueLength This metric is the total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection with a healthy instance in order to route the request.

The maximum size of the queue is 1,024. Additional requests are rejected when the queue is full.


Amazon CloudWatch Events

Amazon CloudWatch Events delivers a near real-time stream of system events that describes changes in AWS resources. Using relatively simple rules, it is possible to route events to one or more targets for processing.

Think of Amazon CloudWatch Events as the central nervous system for an AWS environment. It is connected to supported AWS Cloud services, and it becomes aware of operational changes as they happen. Then, driven by rules, it sends messages and activates functions in response to the changes.

Events

An event is a change in an AWS environment, and it can be generated in four different ways:

  • They arise from within AWS when resources change state, like when an Amazon EC2 instance state changes from pending to running.
  • API calls and console sign-ins can generate events and deliver them to Amazon CloudWatch Events via AWS CloudTrail.
  • Code can be run to generate application-level events and publish them to Amazon CloudWatch Events for processing.
  • They can be issued on a scheduled basis, with options for periodic or Cron-style scheduling.

 


Remember, an event indicates there has been a change in an AWS environment. AWS resources can generate events when their state changes. AWS CloudTrail publishes events from API calls. Custom application-level events can be created and published to Amazon CloudWatch Events. Scheduled events are generated on a periodic basis.


Amazon CloudWatch Events can be used to schedule actions that trigger at certain times using cron or rate expressions. All scheduled events use the Universal Time (UTC) time zone and a minimum precision of one minute.

Rules

A rule matches incoming events and routes them to targets for processing. A single rule can route to multiple targets and are processed in parallel. This enables different parts of an organization to look for and process the events that are of interest to them.

A rule can customize the JSON sent to the target by passing only certain parts or by overwriting it with a constant.


Targets

A target processes data in JSON format that has been sent to it from Amazon CloudWatch Events. Amazon CloudWatch Events delivers a near real-time stream of system events to one or more target functions or streams for analysis.

These targets include the following:

  • Amazon EC2 instances
  • AWS Lambda functions
  • Amazon Kinesis Streams
  • Amazon ECS tasks
  • Amazon Step Functions state machines
  • Amazon SNS topics
  • Amazon SQS queues
  • Built-in targets

Metrics and Dimensions

Amazon CloudWatch Events sends metrics to Amazon CloudWatch every minute.

The AWS/Events namespace includes the metrics shown in Table 9.5.

TABLE 9.5 Amazon CloudWatch Events Metrics

Metric Description
Invocations
  • Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations, but it does not include throttled or retried attempts until they fail permanently.
  • Amazon CloudWatch Events only sends this metric to Amazon CloudWatch if it has a non-zero value.
  • Valid Dimensions: RuleName
  • Units: Count
FailedInvocations
  • Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt.
  • Valid Dimensions: RuleName
  • Units: Count
TriggeredRules
  • Measures the number of triggered rules that matched with any event.
  • Valid Dimensions: RuleName
  • Units: Count
MatchedEvents
  • Measures the number of events that matched with any rule.
  • Valid Dimensions: None
  • Units: Count
ThrottledRules
  • Measures the number of triggered rules that are being throttled.
  • Valid Dimensions: RuleName
  • Units: Count

Amazon CloudWatch Events metrics use a single dimension: RuleName. As the name implies, it filters available metrics by rule name.

Amazon CloudWatch Logs

Amazon CloudWatch Logs can be used to monitor, store, and access log files from Amazon EC2 instances, AWS CloudTrail, and servers running in an on-premises datacenter. It is then possible to retrieve and report on the associated log data from Amazon CloudWatch Logs.

Amazon CloudWatch Logs can monitor and store application logs, system logs, web server logs, and other custom logs. By setting alarms on these metrics, notifications can be generated for application or web server issues and can take the necessary actions.

Amazon CloudWatch Logs is made up of several components:

  • Log agents
  • Log events
  • Log streams
  • Log groups
  • Metric filters
  • Retention policies

Log Agents A log agent directs logs to Amazon CloudWatch. AWS does not have effective visibility above the Hypervisor as part of the shared responsibility model. Because of this, agents have to send data into Amazon CloudWatch.

Log Events A log event is an activity reported to the log file by the operating system or application along with a timestamp. Log events support only text format.

Log events contain two properties: the timestamp of when the event occurred and the raw log message.

By default, any line that begins with a non-whitespace character closes the previous log message and starts a new log message.

Log Streams A log stream is a group of log events reported by a single source, such as a web server.

Log Groups A log group is a group of log streams from multiple resources, such as a group of web servers managing the same content.

Retention policies and metric filters are set on log groups—not log streams.

Metric Filters Metric filters tell Amazon CloudWatch how to extract metric observations from ingested log events and turn them into Amazon CloudWatch metrics.

For example, with a metric filter called 404_Error, it will filter log events to find 404 access errors. An alarm can be created to monitor those 404 errors on different servers.

Retention Policies Retention policies determine how long events are retained inside Amazon CloudWatch Logs. Policies are assigned to log groups and applied to all of the log streams in the group.

Retention time can be set from 1 day to 10 years. You can also opt for logs to never expire.

Archived Data

All log events uploaded to Amazon CloudWatch are retained. It is possible to specify the retention duration. Data is compressed, put into an archive, and stored. Charges are incurred for storage of the archived data.


Log Monitoring

Use Amazon CloudWatch Logs to monitor applications and systems using log data. Amazon CloudWatch Logs can track the number of errors that occur in application logs and send a notification whenever the rate of errors exceeds a threshold.

Because Amazon CloudWatch Logs uses log data for monitoring, no code changes are required. The current time is used for each log event if the datetime_format isn’t provided. If the provided datetime_format is invalid for a given log message, the timestamp from the last log event with a successfully parsed timestamp is used. If no previous log events exist, the current time is used. A warning message is logged when a log event falls back to the current time or the time of a previous log event.


Agents

An agent is required to publish log data to Amazon CloudWatch Logs because AWS has no visibility above the Hypervisor. There are agents available for Linux and Windows.

Agents have the following components:

  • A plugin to the AWS CLI that pushes log data to Amazon CloudWatch Logs
  • A script that runs the Amazon CloudWatch Logs aws logs push command to send data to Amazon CloudWatch Logs
  • A cron job that ensures that the daemon is always running
Amazon CloudWatch Logs Agent for Linux

The Amazon CloudWatch Logs agent requires Python version 2.6, 2.7, 3.0, or 3.3 and any of the following versions of Linux:

  • Amazon Linux version 2014.03.02 or later
  • Ubuntu Server version 12.04, 14.04, or 16.04
  • CentOS version 6, 6.3, 6.4, 6.5, or 7.0
  • Red Hat Enterprise Linux (RHEL) version 6.5 or 7.0
  • Debian 8.0

Amazon CloudWatch Logs: Agents and IAM

The Amazon CloudWatch Logs agent requires the CreateLogGroup, CreateLogStream, DescribeLogStreams, and PutLogEvents operations.


Here is a sample IAM policy for using an agent:



Amazon CloudWatch Logs Agent for Windows

Starting with EC2Config version 2.2.5, it is possible to export all Windows Server log messages from the system log, security log, application log, and Internet Information Services (IIS) log and send them to Amazon CloudWatch Logs. EC2Config version 2.2.10 or later adds the ability to export any event log data, event tracing for Windows data, or text-based log files to Amazon CloudWatch Logs. Windows performance counter data can also be exported to Amazon CloudWatch.


TABLE 9.6 Microsoft Windows Agents

Operating System Agent Notes
Windows Server 2016 SSM Agent The EC2Config service is not supported on Windows Server 2016.
Windows Server 2008-2012 R2 EC2Config or SSM Agent If an instance is running EC2Config version 3.x or earlier, then the EC2Config service sends log data to Amazon CloudWatch. If an instance is running EC2Config version 4.x or later, then SSM Agent sends log data to Amazon CloudWatch.


It is also possible to install Amazon CloudWatch Logs agents and create log streams using AWS OpsWorks and Chef. Chef is a third-party systems and cloud infrastructure automation tool. Chef uses “recipes” to install and configure software and “cookbooks,” which are collections of recipes, to perform configuration and policy distribution tasks.


Searching and Filtering Log Data

After an agent begins publishing logs to Amazon CloudWatch, it is possible both to search for and filter log data by creating one or more filters. Metric filters define the terms and patterns to look for in log data as it is sent to Amazon CloudWatch Logs.

Amazon CloudWatch Logs uses these metric filters to turn log data into Amazon CloudWatch metrics that can be graphed or used to set an alarm condition.



Amazon CloudWatch Logs Metrics and Dimensions

Amazon CloudWatch Logs sends data to Amazon CloudWatch every minute.

Metrics

The AWS/Logs namespace includes the metrics shown in the Table 9.7.

TABLE 9.7 AWS/Logs Namespace Metrics

Metric Description
IncomingBytes The volume of log events in uncompressed bytes uploaded to Amazon CloudWatch Logs. When used with the LogGroupName dimension, this is the volume of log events in uncompressed bytes uploaded to the log group. Valid Dimensions: LogGroupName Valid Statistic: Sum Units: Bytes
IncomingLogEvents The number of log events uploaded to Amazon CloudWatch Logs. When used with the LogGroupName dimension, this is the number of log events uploaded to the log group. Valid Dimensions: LogGroupName Valid Statistic: Sum Units: None
ForwardedBytes The volume of log events in compressed bytes forwarded to the subscription destination. Valid Dimensions: LogGroupName, DestinationType, FilterName Valid Statistic: Sum Units: Bytes
ForwardedLogEvents The number of log events forwarded to the subscription destination. Valid Dimensions: LogGroupName, DestinationType, FilterName Valid Statistic: Sum Units: None
DeliveryErrors The number of log events for which Amazon CloudWatch Logs received an error when forwarding data to the subscription destination. Valid Dimensions: LogGroupName, DestinationType, FilterName Valid Statistic: Sum Units: None
DeliveryThrottling The number of log events for which Amazon CloudWatch Logs was throttled when forwarding data to the subscription destination. Valid Dimensions: LogGroupName, DestinationType, FilterName Valid Statistic: Sum Units: None
Dimensions

Amazon CloudWatch Logs supports the filtering of metrics using the dimensions shown in Table 9.8.

TABLE 9.8 Amazon CloudWatch Logs Dimensions

Dimension Description
LogGroupName The name of the Amazon CloudWatch Logs log group from which to display metrics
DestinationType The subscription destination for the Amazon CloudWatch Logs data, which can be AWS Lambda, Amazon Kinesis Streams, or Amazon Kinesis Firehose
FilterName The name of the subscription filter that is forwarding data from the log group to the destination. The subscription filter name is automatically converted by Amazon CloudWatch to ASCII and any unsupported characters get replaced with a question mark (?).

Monitoring AWS Charges

Customers can monitor AWS costs using Amazon CloudWatch. With Amazon CloudWatch, it is possible to create billing alerts that send notifications when the usage of provisioned services exceeds a customer-defined limit.

When usage exceeds these thresholds, AWS can send an email notification or publish a notification to an Amazon SNS topic. To create billing alerts and register for notifications, enable them in the AWS Billing and Cost Management console.



Detailed Billing

In December 2016, Amazon announced the addition of detailed billing to Amazon CloudWatch Logs. Reports can be generated based on usage and cost per log group.

Tags and Log Groups

You can use tags to classify log groups and give them categories such as purpose, owner, or environment. You can create a custom set of categories to meet specific needs, and you can also use tags to categorize and track AWS costs. When you apply tags to your AWS resources, including log groups, AWS cost allocation reports include usage and costs aggregated by tags.

  • Tags can be added to log groups to get a detailed view of costs across business dimensions.
  • Up to 50 tags can be added to each log group.
  • Tags are added to log groups using the AWS CLI or Amazon CloudWatch Logs API.

Log Group Tag Restrictions

The following restrictions apply to tags.

Basic Restrictions

  • The maximum number of tags per log group is 50.
  • The keys and values of a tag are case sensitive.
  • Tags in a deleted log group cannot be changed or edited.

Tag Key Restrictions

  • Each tag key must be unique. If a tag is added with a key that’s already in use, the new tag overwrites the existing key/value pair.
  • Tag keys cannot start with aws: because this prefix is reserved for use by AWS. AWS creates tags that begin with this prefix on customers’ behalfs, but customers cannot edit or delete them.
  • Tag keys must be between 1 and 128 Unicode characters in length.
  • Tag keys must consist of the following characters:
    • Unicode letters and digits
    • Whitespace
    • Underscore (_)
    • Period (.)
    • Forward slash (/)
    • Equals sign (=)
    • Plus sign (+)
    • Hyphen (-)
    • At symbol (@)

Tag Value Restrictions

  • Tag values must be between 0 and 255 Unicode characters in length.

 


At the end of the billing cycle, the total charges (tagged and untagged) on the billing report with cost allocation tags reconciles with the total charges on the Bills page total and other billing reports for the same period.

Tags can also be used to filter views in Cost Explorer.



Cost Explorer

Cost Explorer is an AWS tool that can be used to view charts of costs. This spend data can be viewed for up to the past 13 months and used to forecast the spend data for the next 3 months. It can also be used to see patterns in how much is spent on AWS resources over time, identify areas that need further inquiry, and see trends to assist in understanding costs.

Cost Explorer can reveal which service is being used the most and which Availability Zone gets the most network traffic.

With Cost Explorer, there are a variety of filters:

  • API operation
  • Availability Zone
  • AWS Cloud service
  • Custom cost allocation tags
  • Amazon EC2 instance type
  • Linked account(s)
  • Platform
  • Purchase option
  • Region
  • Tenancy
  • Usage type
  • Usage type group

 


Cost Explorer uses the same dataset used to generate the AWS Cost and Usage Reports and the detailed billing reports. The dataset can also be downloaded as a comma-separated value (CSV) file for detailed analysis.

AWS Billing and Cost Management Metrics and Dimensions

The AWS Billing and Cost Management service sends metrics to Amazon CloudWatch.

Metrics

The AWS/Billing namespace uses the metric in Table 9.9.

TABLE 9.9 AWS/Billing Namespace Metric

Metric Description
EstimatedCharges The estimated charges for AWS usage. This can be either estimated charges for one service or a roll-up of estimated charges for all services.
Dimensions

AWS Billing and Cost Management supports filtering metrics using the dimensions in Table 9.10.

TABLE 9.10 AWS/ Billing and Cost Management Metrics

Dimension Description
ServiceName The name of the AWS Cloud service This dimension is omitted for the total of estimated charges across all services.
LinkedAccount The linked account number This is used for Consolidated Billing only. This dimension is included only for accounts that are linked to a separate paying account in a Consolidated Billing relationship. It is not included for accounts that are not linked to a Consolidated Billing paying account.
Currency The monetary currency to bill the account. This dimension is required. Unit: USD

AWS CloudTrail

AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of an AWS account. With AWS CloudTrail, it is possible to log, continuously monitor, and retain events related to API calls across an AWS infrastructure.

AWS CloudTrail provides a history of AWS API calls for an account. This includes API calls made through the AWS Management Console, AWS SDKs, command-line tools, and other AWS Cloud services. This history simplifies security analysis, resource change tracking, and troubleshooting.

AWS CloudTrail provides visibility into user activity by recording API calls made on an account. It records important information about each API call, including the name of the API, the identity of the caller, the time of the API call, the request parameters, and the response elements returned by the AWS Cloud service. This information can be used in the tracking of changes made to AWS resources and help troubleshoot operational issues. This makes it easier to ensure compliance with internal policies and regulatory standards.

What Are Trails?

A trail is a configuration that enables logging of the AWS API activity and related events in an account. AWS CloudTrail delivers the logs to an Amazon S3 bucket and, optionally, to an Amazon CloudWatch Logs log group.

It is possible to specify an Amazon SNS topic that receives notifications of log file deliveries. For a trail that applies to all regions, the trail configuration in each region is identical.

Types of Trails

You can create trails with the AWS CloudTrail console, the AWS CLI, or the AWS CloudTrail API. There are two types of trails: those that apply to all regions and those that apply to one region.

Trails that Apply to All Regions

When creating a trail that applies to all regions, AWS CloudTrail creates the same trail in each region. It then records the log files in each region and delivers the log files to an Amazon S3 bucket that is user-specified. This is the default option when you create a trail in the AWS CloudTrail console.

A trail that applies to all regions has the following advantages:

  • The configuration settings for the trail apply consistently across all regions.
  • Log files from all regions are sent to a single Amazon S3 bucket and, optionally, to an Amazon CloudWatch Logs log group.
  • Trail configurations for all regions are managed from one location.
  • Events are immediately received from new regions. When a new region launches, AWS CloudTrail automatically creates a trail in the new region with the same settings as your original trail.
  • Trails can be created in regions that are not used often to monitor for unusual activity.

When applying a trail to all regions, AWS CloudTrail uses the trail created in a particular region to create trails with identical configurations in all other regions in an account. This has the following effects:

  • If an Amazon SNS topic has been configured for the trail, Amazon SNS notifications about log file deliveries in all regions are sent to that single Amazon SNS topic.
  • Global service events will be delivered from a single region to the specified Amazon S3 bucket and to, if one has been configured, the Amazon CloudWatch Logs log group.
  • If log file integrity validation has been activated, log file integrity validation is enabled in all regions for the trail.
A Trail that Applies to One Region

When creating a trail that applies to one region, AWS CloudTrail records the log files in that region only and delivers log files to a user-specified Amazon S3 bucket. When creating additional individual trails that apply to specific regions, those trails can be set to deliver log files to a single Amazon S3 bucket regardless of region.


Multiple Trails per Region

If there are different but related user groups such as developers, security personnel, and IT auditors that need access to AWS CloudTrail, create multiple trails per region. This allows each group to receive its own copy of the log files.

AWS CloudTrail supports five trails per region. A trail that applies to all regions counts as one trail in every region. To see a list of the trails in all regions, open the Trails page of the AWS CloudTrail console.

Encryption

By default, log files are encrypted using Amazon S3 Server-Side Encryption (SSE). Log files can be stored in an Amazon S3 bucket for as long as they are needed. Amazon S3 lifecycle rules can be defined to archive or delete log files automatically.

AWS CloudTrail Log Delivery

AWS CloudTrail typically delivers log files within 15 minutes of an API call. In addition, AWS CloudTrail publishes log files multiple times an hour—approximately every five minutes. These log files contain API calls from services that support AWS CloudTrail.


Overview: Creating a Trail

When creating or updating a trail with the AWS CloudTrail console or using the AWS CLI, the same steps need to be followed.

  1. Turn on AWS CloudTrail by creating a trail. By default, when you create a trail in a region in the AWS CloudTrail console, the trail applies to all regions.
  2. Create an Amazon S3 bucket or specify an existing bucket where the log files are to be delivered. By default, log files from all regions in an account are delivered to the specified bucket.
  3. Configure the trail to log the types of events desired. The choices are read-only, write-only, or all management and data events. By default, trails log all management events.
  4. Create an Amazon SNS topic to receive notifications when log files are delivered. Delivery notifications from all regions are sent to the Amazon SNS topic specified.
  5. Configure Amazon CloudWatch Logs to receive logs from AWS CloudTrail so that they can be monitored for specific log events.
  6. Turn on log file encryption. This encrypts files for added security.
  7. Turn on integrity validation for log files. This enables the delivery of digest files that you can use to validate the integrity of log files after AWS CloudTrail has delivered them.
  8. Add tags to the trail.

Monitoring with AWS CloudTrail

Amazon CloudWatch is a web service that collects and tracks metrics to monitor AWS resources and the applications that run on it. Amazon CloudWatch Logs is a feature of Amazon CloudWatch used specifically to monitor log data. Integration with Amazon CloudWatch Logs enables AWS CloudTrail to send events containing API activity in an AWS account to an Amazon CloudWatch Logs log group.

AWS CloudTrail events that are sent to Amazon CloudWatch Logs can trigger alarms according to the metric filters defined by customers. Optionally, you can configure Amazon CloudWatch Alarms to send notifications or make changes to the resources being monitored based on log stream events that metric filters extract.

Using Amazon CloudWatch Logs, you can track AWS CloudTrail events alongside events from the operating system, applications, or other AWS Cloud services that are sent to Amazon CloudWatch Logs.

AWS CloudTrail vs. Amazon CloudWatch

AWS CloudTrail adds depth to the monitoring capabilities already offered by AWS. Amazon CloudWatch focuses on performance monitoring and system health, and AWS CloudTrail focuses on API activity. While AWS CloudTrail does not report on system performance or health, you can use AWS CloudTrail in combination with Amazon CloudWatch Logs alarms to create notifications to gain a deeper understanding of AWS resources and their utilization.

AWS CloudTrail: Trail Naming Requirements

AWS CloudTrail trail names must meet the following requirements:

  • Contain only ASCII letters (a-z, A-Z)
  • Numbers (0-9)
  • Periods (.)
  • Underscores (_)
  • Dashes (-)
  • They must start with a letter or number and end with a letter or number.
  • Be between 3 and 128 characters long.
  • Have no adjacent periods, underscores, dashes, or combinations of these characters. Names like my-_namespace and my--namespace are invalid.
  • Not be in IP address format. For example, 10.9.28.68 is invalid.

Getting and Viewing AWS CloudTrail Log Files

AWS CloudTrail delivers log files to an Amazon S3 bucket specified during the creation of the trail. Typically, log files appear in the bucket within 15 minutes of the recorded AWS API call or other AWS event. Log files are generally published every five minutes.

Finding AWS CloudTrail Log Files

AWS CloudTrail publishes log files to the Amazon S3 bucket in a gzip archive. In the Amazon S3 bucket, the log file has a formatted name that includes the following elements:

  • The bucket name specified when you created the trail
  • The (optional) prefix that you specified when the trail was created
  • The string "AWSLogs"
  • The account number
  • The string "CloudTrail"
  • A region identifier
  • The year the log file was published in YYYY format
  • The month the log file was published in MM format
  • The day the log file was published in DD format
  • An alphanumeric string that separates the file from others that cover the same time period

This is what a complete log file object name looks like:

Retrieve Log Files

To retrieve a log file, use the Amazon S3 console, the AWS CLI, or the Amazon S3 API.

To find your log files with the Amazon S3 console, do the following:

  • Open the Amazon S3 console.
  • Choose the bucket specified for the trails.
  • Navigate through the object hierarchy to find the correct log.

All log files have a .gz extension.



Configuring Amazon SNS Notifications for AWS CloudTrail

It is possible to be notified when AWS CloudTrail publishes new log files to an Amazon S3 bucket. You manage notifications using Amazon SNS.

Notifications are optional. To activate them, configure AWS CloudTrail to send update information to an Amazon SNS topic whenever a new log file has been sent. To receive these notifications, subscribe to the topic. To handle notifications programmatically, subscribe an Amazon SQS queue to the topic.

Controlling User Permissions for AWS CloudTrail

AWS CloudTrail integrates with IAM, which controls access to AWS CloudTrail and other AWS resources that AWS CloudTrail requires, including Amazon S3 buckets and Amazon SNS topics. Use IAM to control which AWS users can create, configure, or delete AWS CloudTrail trails, start and stop logging, and access the buckets that contain log information.

Granting Permissions for AWS CloudTrail Administration

To administer an AWS CloudTrail trail, grant explicit permissions to IAM users to perform the actions associated with the AWS CloudTrail tasks. For most scenarios, you can accomplish this by using an AWS managed policy that contains predefined permissions.

A typical approach is to create an IAM group that has the appropriate permissions and then add individual IAM users to that group. For example, you can create one IAM group for users that should have full access to AWS CloudTrail actions and a separate group for users who should be able to view trail information but not create or change trails.

These are the AWS Managed Policies for AWS CloudTrail:

AWSCloudTrailFullAccess This policy gives users in the group full access to AWS CloudTrail actions and permissions to manage the Amazon S3 bucket, the log group for Amazon CloudWatch Logs, and an Amazon SNS topic for a trail.

AWSCloudTrailReadOnlyAccess This policy lets users in the group view trails and buckets.

Log Management and Data Events

When creating a trail, the trail logs read-only and write-only management events for your account. If desired, update the trail to specify whether or not the trail should log data events. Data events are object-level API operations that access Amazon S3 object resources, such as GetObject, DeleteObject, and PutObject. Only events that match the trail settings are delivered to the Amazon S3 bucket and Amazon CloudWatch Logs log group. If the event doesn’t match the settings for a trail, the trail doesn’t log the event.

Amazon SNS Topic Policy for AWS CloudTrail

To send notifications to an Amazon SNS topic, AWS CloudTrail must have the required permissions. AWS CloudTrail automatically attaches the required permissions to the topic when the following occurs:

  • Create an Amazon SNS topic as part of creating or updating a trail in the AWS CloudTrail console.
  • Create an Amazon SNS topic with the AWS CLI create-subscription and update-subscription commands.

AWS CloudTrail adds the following fields in the policy automatically:

  • The allowed SIDs
  • The service principal name for AWS CloudTrail
  • The Amazon SNS topic, including region, account ID, and topic name

The following policy allows AWS CloudTrail to send notifications about log file delivery from supported regions:

AWS Config

AWS Config is a fully managed service that provides AWS resource inventory, configuration history, and configuration change notifications to enable security and governance. AWS Config can discover existing AWS resources, export a complete inventory of AWS resources with all configuration details, and determine how a resource was configured at any point in time. These capabilities enable compliance auditing, security analysis, resource change tracking, and troubleshooting.

AWS Config makes it easy to track resource configuration without the need for upfront investments and to avoid the complexity of installing and updating agents for data collection or maintaining large databases. After AWS Config is enabled, continuously updated details can be viewed of all configuration attributes associated with AWS resources. Amazon SNS can be configured to provide notifications of every configuration change.

AWS Config provides a detailed view of the configuration of AWS resources in an AWS account. This includes how resources are related to one another and how they were configured in the past to show how configurations and relationships change over time.

An AWS resource is an entity in AWS such as an Amazon EC2 instance, an Amazon EBS volume, a security group, or an Amazon Virtual Private Cloud (Amazon VPC).

With AWS Config, you can do the following:

  • Evaluate AWS resource configurations for desired settings.
  • Get a snapshot of the current configurations of the supported resources that are associated with an AWS account.
  • Retrieve configurations of one or more resources that exist in an account.
  • Retrieve historical configurations of one or more resources.
  • Receive a notification whenever a resource is created, modified, or deleted.
  • View relationships between resources, such as those that use a particular security group.

Ways to Use AWS Config

When running applications on AWS, resources must be created and managed collectively. As the demand for an application grows, so too does the need to keep track of the addition of AWS resources. AWS Config is designed to help oversee application resources in the following scenarios.

Resource Administration

To exercise better governance over resource configurations and to detect resource misconfigurations, fine-grained visibility is needed into what resources exist and how these resources are configured at any time. Use AWS Config to automatically send notifications whenever resources are created, modified, or deleted. There is no need to monitor these changes by polling calls made to each individual resource.

Use AWS Config rules to evaluate the configuration settings of AWS resources. When AWS Config detects that a resource violates the conditions in one of the established rules, AWS Config flags the resource as noncompliant and sends a notification. AWS Config continuously evaluates resources as they are created, changed, or deleted.

Auditing and Compliance

Some data requires frequent audits to ensure compliance with internal policies and best practices. To demonstrate compliance, access is needed to the historical configurations of the resources. This information is provided by AWS Config.

Managing and Troubleshooting Configuration Changes

When using multiple AWS resources that depend on one another, a change in the configuration of one resource might have unintended consequences on related resources. With AWS Config, it is possible to view how one resource is related to other resources and assess the impact of the proposed change.

The historical configurations of resources provided by AWS Config can assist troubleshooting issues by providing access to the last known good configuration of a problem resource.

Security Analysis

To analyze potential security weaknesses, detailed historical information about AWS resource configurations is required. This information could include the IAM permissions that are granted to your users or the Amazon EC2 security group rules that control access to your resources.

Use AWS Config to view the IAM policy that was assigned to an IAM user, group, or role at any time in which AWS Config was recording. This information can help determine the permissions that belonged to a user at a specific time.

Use AWS Config to view the configuration of Amazon EC2 security groups and the port rules that were open at a specific time. This information can help determine whether a security group was blocking incoming TCP traffic to a specific port.

AWS Config Rules

An AWS Config rule represents desired configurations for a resource, and it is evaluated against configuration changes on the relevant resources, as recorded by AWS Config. The results of evaluating a rule against the configuration of a resource are available on a dashboard. Using AWS Config rules, customers can assess their overall compliance and risk status from a configuration perspective, view compliance trends over time, and pinpoint which configuration change caused a resource to drift out of compliance with a rule.

A rule represents desired Configuration Item (CI) attribute values for resources, which are evaluated by comparing those attribute values with CIs recorded by AWS Config. There are two types of rules: AWS managed rules and customer managed rules.

AWS Managed Rules

AWS managed rules are prebuilt and managed by AWS. Choose the rule to enable and then supply a few configuration parameters to get started.

Customer Managed Rules

It is possible to develop custom rules and add them to AWS Config. Associate each custom rule with an AWS Lambda function. This Lambda function contains the logic that evaluates whether AWS resources comply with the rule.

Associate this function with a rule, and the rule invokes the function either in response to configuration changes or periodic intervals. The function then evaluates whether resources comply with the rule, and it sends its evaluation results to AWS Config.

Using AWS Config rules, customers can assess their overall compliance and risk status from a configuration perspective, view compliance trends over time, and pinpoint which configuration change caused a resource to drift out of compliance with a rule.


How Rules Are Evaluated

Any rule can be set up as a change-triggered rule or as a periodic rule.

A change-triggered rule is executed when AWS Config records a configuration change for any of the resources specified. Additionally, one of the following must be specified:

Tag Key A tag key:value implies any configuration changes recorded for resources with the specified tag key:value will trigger an evaluation of the rule.

Resource Type(s) Any configuration changes recorded for any resource within the specified resource type(s) will trigger an evaluation of the rule.

Resource ID Any changes recorded to the resource specified by the resource type and resource ID will trigger an evaluation of the rule.

A periodic rule is triggered at a specified frequency. Available frequencies are 1 hour, 3 hours, 6 hours, 12 hours, or 24 hours. A periodic rule has a full snapshot of current CIs for all resources available to the rule.

Configuration Items

A CI is the configuration of a resource at a given point in time. A CI consists of five sections:

  • Basic information about the resource that is common across different resource types (for example, ARNs, tags)
  • Configuration data specific to the resource (such as an Amazon EC2 instance type)
  • Map of relationships with other resources (for example, Amazon EC2::Volume vol-6886ff28 is “attached to instance” Amazon EC2 instance i-24601abc)
  • AWS CloudTrail event IDs that are related to this state
  • Metadata that helps identify information about the CI, such as the version of the CI and when the CI was captured
Rule Evaluation

Evaluation of a rule determines whether a rule is compliant with a resource at a particular point in time. It is the result of evaluating a rule against the configuration of a resource. AWS Config rules will capture and store the result of each evaluation. This result will include the resource, rule, time of evaluation, and a link to the CI that caused noncompliance.

Rule Compliance

A resource is compliant if it conforms with all rules that apply to it. Otherwise, it is noncompliant. Similarly, a rule is compliant if all resources evaluated by the rule comply with the rule. Otherwise, it is noncompliant.

In some cases, such as when inadequate permissions are available to the rule, an evaluation may not exist for the resource, leading to a state of insufficient data. This state is excluded from determining the compliance status of a resource or rule.

AWS Config and AWS CloudTrail

AWS CloudTrail records user API activity on an account and allows access to information about this activity. You can use AWS CloudTrail to get full details about API actions, such as identity of the caller, the time of the API call, the request parameters, and the response elements returned by the AWS Cloud service.

AWS Config records point-in-time configuration details for AWS resources as CIs. You can use a CI to answer “What did my AWS resource look like?” at a point in time. You can use AWS CloudTrail to answer “Who made an API call to modify this resource?”

In practice, you can use the AWS Config console to detect that a security group was incorrectly configured in the past. With the integrated AWS CloudTrail information, you can find the user that misconfigured the security group and learn when it happened.

Each custom rule is simply an AWS Lambda function. When the function is invoked in order to evaluate a resource, it is provided with the resource’s CI. The function can inspect the item and make calls to other AWS API functions as desired. After the AWS Lambda function makes its decision about compliance, it calls the PutEvaluations function to record the decision.

Pricing

With AWS Config, customers are charged based on the number of CIs recorded for supported resources in an AWS account and are charged only once for recording the CI. There is no additional fee or any upfront commitment for retaining the CI. Users can stop recording CIs at any time and continue to access the CIs previously recorded. Charges per CI are rolled up into the monthly bill.

If you are using AWS Config rules, charges are based on active AWS Config rules in that month. When a rule is compared with an AWS resource, the result is recorded as an evaluation. A rule is active if it has one or more evaluations in a month.

Configuration snapshots and configuration history files are delivered to an Amazon S3 bucket. Configuration change notifications are delivered via Amazon SNS. Standard rates for Amazon S3 and Amazon SNS apply. Customer-managed rules are authored using AWS Lambda. Standard rates for AWS Lambda apply.

Summary

Amazon CloudWatch monitors AWS resources and the applications run on AWS in real time. Use CloudWatch to collect and track metrics, which are variables used to measure resources and applications.

Amazon CloudWatch Alarms send notifications or automatically make changes to the resources being monitored based on customer-defined rules. This monitoring data can be used to determine whether additional instances should be launched in order to handle the increased load or stop under-utilized instances to save money.

In addition to monitoring the built-in metrics that come with AWS, custom metrics can be imported and monitored. Custom metrics can include detailed information about an Amazon EC2 Instance or data from servers running in an on-premises datacenter.

Amazon CloudWatch provides system-wide visibility into resource utilization, application performance, and operational health.

AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of an AWS account. With CloudTrail, customers can log, continuously monitor, and retain events related to API calls across an AWS infrastructure.

AWS CloudTrail provides a history of AWS API calls for an account. This includes API calls made through the AWS Management Console, AWS SDKs, command-line tools, and other AWS services. This history simplifies security analysis, resource change tracking, and troubleshooting.

AWS Config is a service that enables customers to assess, audit, and evaluate the configurations of their AWS resources. Config continuously monitors and records your AWS resource configurations and allows the automation of the evaluation of recorded configurations against desired configurations.

With AWS Config, review changes in configurations and relationships between AWS resources, dive into detailed resource configuration histories, and determine the overall compliance against the configurations specified in internal guidelines. This simplifies compliance auditing, security analysis, change management, and operational troubleshooting.

Resources to Review


Exam Essentials

Be familiar with Amazon CloudWatch. Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications that you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, and set alarms.

Amazon CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by your applications and services and any log files your applications generate.

You can use Amazon CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. You can use these insights to react and keep your application running smoothly.

Amazon CloudWatch Events has three components: Events, Rules, and Targets. Events indicate a change in an AWS environment. Targets process events. Rules match incoming events and route them to targets for processing.

Understand what Amazon CloudWatch Alarms is and what it can do. Amazon CloudWatch Logs lets you monitor and troubleshoot systems and applications using existing system, application, and custom log files.

With Amazon CloudWatch Logs, monitor logs in near real time for specific phrases, values, or patterns. Log data can be stored and accessed indefinitely in highly durable, low-cost storage without filling hard drives.

Be able to create or edit an Amazon CloudWatch Alarm. You can choose specific metrics to trigger the alarm and specify thresholds for those metrics. You can then set your alarm to change state when a metric exceeds a threshold that you have defined.

Know how to create a monitoring plan. Creating a monitoring plan involves answering some basic questions. What are your goals for monitoring? What resources will you monitor? How often will you monitor these resources? What monitoring tools will you use? Who will perform the monitoring tasks? Who should be notified when something goes wrong?

Know and understand custom metrics. You can now store your business and application metrics in Amazon CloudWatch. You can view graphs, set alarms, and initiate automated actions based on these metrics, just as you can for the metrics that Amazon CloudWatch already stores for your AWS resources.

Visibility for metrics above the Hypervisor requires an agent. Amazon CloudWatch can tell you CPU and RAM utilization at the Hypervisor level but it has no way of knowing what specific tasks or processes are affecting performance. CloudWatch can see disk I/O but cannot see disk usage. To do this requires an agent.

Be familiar with what an Amazon CloudWatch Alarm is and how it works. You can create a CloudWatch Alarm that watches a single metric. The alarm performs one or more actions based on the value of the metric relative to a threshold over a number of time periods. The action can be an Amazon EC2 action, an Auto Scaling action, or a notification sent to an Amazon SNS topic.

Know the three states of an Amazon CloudWatch Alarm. These are OK, ALARM, and INSUFFICIENT_DATA. If an alarm is in the OK state, a monitored metric is within the range you have defined as acceptable. If the alarm is in the ALARM state, the metric has breached a threshold. If data is missing or incomplete, it is in the INSUFFICIENT_DATA state.

There are two levels of monitoring: Basic and Detailed. Basic Monitoring for Amazon EC2 sends CPU load, disk I/O, and network I/O metric data to Amazon CloudWatch in five-minute periods by default. To send metric data for an instance to CloudWatch in one-minute periods, enable Detailed Monitoring on the instance. Some services, like Amazon RDS, have Detailed Monitoring on by default.

Amazon CloudWatch performs two types of Amazon EC2 status checks: System and Instance. A System Status Check monitors the AWS systems required to use your instance to ensure that they are working properly. These checks detect problems with your instance that require AWS involvement to repair. An Instance Status Check monitors the software and network configuration of your individual instance. These checks detect problems that require your involvement to repair.

Be familiar with some common metrics used for monitoring. There are many metrics available, and not all of them are tested on the exam. Some are tested, however, and it is a good idea to know the common ones: VolumeQueueLength, DatabaseConnections, DiskQueueDepth, FreeStorageSpace, ReplicaLag, ReadIOPS, WriteIOS, ReadLatency, WriteLatency, SurgeQueueLength, and SpilloverCount.

Know how to set up an Amazon CloudWatch Event subscription for Amazon RDS. Amazon RDS uses the Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint. Details can be found here: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Events.html.

Amazon ElastiCache has two engines available: Redis and Memcached. Memcached is multi-threaded, and Redis is single-threaded.

Know the Amazon EBS Volume Status Checks. Degraded and Severely Degraded performance means that the EBS Volume Status Check is in a Warning state and there is some I/O. Stalled or Not Available means that the EBS Volume is in the Impaired state, and there is no I/O.

Learn the metrics SurgeQueueLength and SpilloverCount. When the SurgeQueueLength is exceeded, spillover occurs. Requests are dropped without notifying end users. Customer experience is negatively impacted.


Exercises

By now you should have set up an account in AWS. If you haven’t, now would be the time to do so. It is important to note that these exercises are in your AWS account and thus are not free.

Use the Free Tier when launching resources. For more information, see https://aws.amazon.com/s/dm/optimization/server-side-test/free-tier/free_np/.

If you have not yet installed the AWS Command Line utilities, refer to Chapter 2, “Working with AWS Cloud Services,” Exercise 2.1 (Linux) or Exercise 2.2 (Windows).

The reference for the AWS CLI can be found at http://docs.aws.amazon.com/cli/latest/reference/.












Review Questions

  1. Which of the following requires a custom Amazon CloudWatch metric to monitor?

    1. Amazon EC2 CPU Utilization
    2. Amazon EC2 Disk IO
    3. Amazon EC2 Memory Utilization
    4. Amazon EC2 Network IO
  2. While using Auto Scaling with an ELB in front of several Amazon EC2 instances, you want to configure Auto Scaling to remove one instance when CPU Utilization is below 20 percent. How is this accomplished?

    1. Configure Amazon CloudWatch Logs to send a notification to the Auto Scaling Group when CPU utilization is less than 20 percent, and configure the Auto Scaling policy to remove the instance.
    2. Configure Amazon CloudWatch to send a notification to the Auto Scaling Group when the aggregated CPU Utilization is less than 20 percent, and configure the Auto Scaling policy to remove the instance.
    3. Monitor the Amazon EC2 instances with Amazon CloudWatch, and use Auto Scaling to remove an instance with scheduled actions.
    4. Configure Amazon CloudWatch to generate an email using Amazon SNS when CPU utilization is less than 20 percent. Log into the console, and lower the desired capacity number inside Auto Scaling to remove the instance.
  3. Your company has configured the custom metric upload with Amazon CloudWatch, and it has authorized employees to upload data using AWS CLI as well as AWS SDK. How can you track API calls made to CloudWatch?

    1. Use AWS CloudTrail to monitor the API calls.
    2. Create an IAM role to allow users who assume the role to view the data using an Amazon S3 bucket policy.
    3. Enable logging with Amazon CloudWatch to capture metrics for the API calls.
    4. Enable detailed monitoring with Amazon CloudWatch.
  4. Of the services listed here, which provide detailed monitoring without extra charges being incurred? (Choose two.)

    1. AWS Auto Scaling
    2. Amazon Route 53
    3. Amazon Elastic Map Reduce
    4. Amazon Relational Database Service
    5. Amazon Simple Notification Service
  5. You have configured an ELB Classic Load Balancer to distribute traffic among multiple Amazon EC2 instances. Which of the following will aid troubleshooting efforts related to back-end servers?

    1. HTTPCode_Backend_2XX
    2. HTTPCode_Backend_3XX
    3. HTTPCode_Backend_4XX
    4. HTTPCode_Backend_5XX
  6. What is the minimum time interval for data that Amazon CloudWatch receives and aggregates?

    1. Fifteen seconds
    2. One minute
    3. Three minutes
    4. Five minutes
  7. Using the Free Tier, what is the frequency of updates received by Amazon CloudWatch?

    1. Fifteen seconds
    2. One minute
    3. Three minutes
    4. Five minutes
  8. The type of monitoring automatically available in five-minute periods is called what?

    1. Elastic
    2. Simple
    3. Basic
    4. Detailed
  9. You have created an Auto Scaling Group using the AWS CLI. You now want to enable Detailed Monitoring for this group. How is this accomplished?

    1. Enable Detailed Monitoring from the AWS console.
    2. When creating an alarm on the Auto Scaling Group, Detailed Monitoring is automatically activated.
    3. When creating Auto Scaling Groups using the AWS CLI or API, Detailed Monitoring is enabled for Auto Scaling by default.
    4. Auto Scaling Groups do not support Detailed Monitoring.
  10. There are 10 Amazon EC2 instances running in multiple regions using an internal memory management tool to capture log files and send them to Amazon CloudWatch in US-West-2. Additionally, you are using the AWS CLI to configure CloudWatch to use the same namespace and metric in all regions. Which of the following is true?

    1. Amazon CloudWatch will receive and aggregate statistical data based on the namespace and metric.
    2. Amazon CloudWatch will process the data only for the server that responds first and ignore the other regions.
    3. Amazon CloudWatch will process the statistical data for the most recent response regardless of region and overwrite other data.
    4. Amazon CloudWatch cannot receive data across regions.
  11. You have misconfigured an Amazon EC2 instance’s clock and are sending data to Amazon CloudWatch via the API. Because of the misconfiguration, logs are being sent 60 minutes in the future. Which of the following is true?

    1. Amazon CloudWatch will process the data.
    2. It is not possible to send data from the future.
    3. It is not possible to send data manually to Amazon CloudWatch.
    4. Agents cannot send data for more than 60 minutes in the future.
  12. You have a system that sends data to Amazon CloudWatch every five minutes for tracking/ monitoring. Which of these parameters is required as part of the put-metric-data request?

    1. Key
    2. Namespace
    3. Metric Name
    4. Timestamp
  13. To monitor API calls against AWS use _______________ to capture the history API requests and use _______________ to respond to operational changes in real time.

    1. AWS Config; Amazon Inspector
    2. AWS CloudTrail; AWS Config
    3. AWS CloudTrail; Amazon CloudWatch Events
    4. AWS Config; AWS Lambda
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset