THE AWS CERTIFIED ADVANCED NETWORKING – SPECIALTY EXAM OBJECTIVES COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:
AWS provides a number of networking features to connect within the AWS Cloud, outside the AWS Cloud to the Internet, and in a hybrid manner to an on-premises environment. This chapter discusses tools and techniques for troubleshooting networking issues that can arise with these connections. In addition, the chapter discusses a number of common troubleshooting scenarios. Knowledge of AWS troubleshooting tools and how to troubleshoot common scenarios are both required skills for the exam, which are highlighted in this chapter.
Troubleshooting can follow either a bottom-up approach (traversing through the Open Systems Interconnection [OSI] model one by one) or a top-down approach (working through likely areas that can cause issues). There are scenarios where each has its own merits, and they can be used in combination to help resolve issues quickly. The approach can also change based on the environment in which the troubleshooting is occurring.
Traversing through the OSI model systematically from Layer 1 through Layer 7 is often a useful way to pinpoint issues. Such a method can be optimized by taking into account the environment. For example, there is implicit routing for all subnets by default within a Virtual Private Cloud (VPC). This can rule out Layer 2 and Layer 3 communication issues within a VPC; thus, troubleshooting should start at Layer 4. In another example, when custom routing is set up through Amazon Elastic Compute Cloud (Amazon EC2) instances, Layer 3 troubleshooting may be required to ensure that routing is occurring as expected.
Stepping back and taking a top-down approach to pinpointing potential areas for network issues is also a valuable way to troubleshoot. Knowing service limits, for example, can help resolve otherwise difficult issues to fix. Being able to recognize security group and network Access Control List (ACL) issues without having to dig through the network stack layer-by-layer is also another example of how this approach can be helpful.
AWS offers a rich set of tools that can be combined with traditional tools to help troubleshoot networking connectivity issues. Both traditional tools and AWS-native tools are discussed in this section.
In this section, we discuss traditional network troubleshooting tools, many of which you may be already familiar with.
For troubleshooting when deep packet inspection is necessary, packet captures can be useful. Packet capture tools like Wireshark (Windows/Linux) and tcpdump (Linux) can be run on an Amazon EC2 instance. By listening at the interface level, these tools are able to view the packets as they are sent to and received from the network, revealing both packet header and payload.
ping is a utility that records network round trip times using the Internet Control Message Protocol (ICMP). It is commonly used to test if a host is up and responsive on a network. ping can be useful for troubleshooting within AWS. It is important to note that network ACLs, security groups, and operating system firewalls must all be configured to allow ICMP traffic for this tool to be useful. Note that ICMP traffic is typically not enabled by default on many network devices and operating systems.
traceroute is a utility that discovers the path to a destination IP or hostname. This tool can be helpful in verifying the route that traffic is following through a network. It works by sending out an ICMP packet with increasing Time-To-Live (TTL) values. Note that not all devices in a network path will respond to the ICMP request, so there may not be a value for all hops in the route. In addition, this tool will not provide meaningful results within a VPC or across VPC peering links because each is only one network hop away.
Telnet is a text-based TCP utility. While the default telnet port is 23, telnet can be set to initiate a TCP connection on any user-specified port. This can be very helpful for troubleshooting if a service is running on a port and responding to traffic.
nslookup is a command-line utility that resolves hostnames into IP addresses. It can be useful in network troubleshooting to confirm your Domain Name System (DNS) server settings and determine to what IP address a hostname is being resolved.
In this section, we discuss AWS-native tools for troubleshooting, which provide additional insight to augment traditional troubleshooting tools.
Amazon CloudWatch is a monitoring service for AWS Cloud resources and the applications that you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.
The services shown in Table 13.1 send network metrics to Amazon CloudWatch that can be useful in troubleshooting.
TABLE 13.1 Amazon CloudWatch Metrics
Amazon CloudWatch Metric | Description |
Amazon EC2 | Sends metrics to Amazon CloudWatch recording the number of bytes and packets in and out of each Amazon EC2 instance |
Amazon VPC Virtual Private Network (VPN) | Sends metrics to Amazon CloudWatch recording tunnel state and bytes in and out |
AWS Direct Connect | Sends metrics to Amazon CloudWatch recording connection state, bits per second egress and ingress, packets per second egress and ingress, Cyclic Redundancy Check (CRC) error count, and connection-light level egress and ingress (only for 10 Gbps port speeds) |
Amazon Route 53 | Sends metrics to Amazon CloudWatch recording health check count, connection time, health check percentage, health check status, Secure Sockets Layer (SSL) handshake time, and time to first byte |
Amazon CloudFront | Sends metrics to Amazon CloudWatch recording requests, bytes downloaded and uploaded, bytes uploaded, total error rate, 4xx error rate, and 5xx error rate |
Elastic Load Balancing | Sends metrics to Amazon CloudWatch recording healthy host count, 4xx and 5xx load balancer error count, back-end/target error count (2xx, 3xx, 4xxx, and 5xx), and a number of additional metrics |
Amazon Relational Database Service (Amazon RDS) | Sends metrics to Amazon CloudWatch recording network receive and transmit throughput |
Amazon Redshift | Sends metrics to Amazon CloudWatch recording network receive and transmit throughput |
Note: There are many more Amazon CloudWatch metrics that are recorded.
Amazon VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC. Flow log data is stored using Amazon CloudWatch Logs. After you have created a flow log, you can view and retrieve its data in Amazon CloudWatch Logs.
Flow logs can help you with a number of tasks, such as troubleshooting why specific traffic is not reaching an instance. If you see that the traffic is logged in flow logs, then you know it is reaching the VPC. In turn, if you notice that there is a DENY entry, then it points to a permission issue and can help you diagnose overly restrictive security group rules or network ACLs. You can also use flow logs as a security tool to monitor the traffic that is reaching your instance. Flow logs can also be exported to other services like the Amazon Elasticsearch Service to gain insight into traffic and to enable visualizations.
With AWS Config, you can capture a comprehensive history of your AWS resource configuration changes to simplify troubleshooting of your operational issues. AWS Config can be used to help identify AWS resource changes that may have caused operational issues. AWS Config leverages AWS CloudTrail records to correlate configuration changes to particular events in your account. You can obtain the details of the event Application Programming Interface (API) call that invoked the change (for example, who made the request, at what time, and from which IP address) from the AWS CloudTrail logs.
AWS Trusted Advisor is an online resource to help you reduce cost, increase performance, and improve security by optimizing your AWS environment. AWS Trusted Advisor provides real-time guidance to help you provision your resources following AWS best practices. Part of the metrics and recommendations that are reported are network-related service limits around VPCs, elastic IPs, and load balancers. These service limit metrics can help quickly identify whether a service limit has been reached, and they allow you to request limit increases proactively.
The AWS Identity and Access Management (IAM) Policy Simulator is a very useful tool in troubleshooting IAM permissions. The simulator evaluates the policies that you choose and determines the effective permissions for each of the actions that you specify. The simulator uses the same policy evaluation engine that is used during real requests to AWS Cloud services.
There are some common scenarios and technologies that can experience network connectivity issues. A description of each, situations in which they occur, and key points to consider in troubleshooting are discussed in this section.
New VPCs do not have public Internet connectivity by default. User action is required to set up the appropriate requirements for Internet connectivity.
There are five requirements for connectivity to the Internet from an Amazon EC2 instance:
AWS provides a managed Virtual Private Network (VPN) service to allow for easy connectivity between on-premises environments and VPCs. After you have created your VPN, you can download the IP Security (IPsec) VPN configuration from the VPC console to configure the firewall or device in your local network that will connect to the VPN.
The following should be checked if there are issues with traffic over a VPN tunnel:
These steps can be optimized using a top-down approach. For example, if some traffic is traversing a VPN connection and other traffic is not, then you can skip the steps on troubleshooting the VPN tunnel connectivity and start looking at routing and security groups/network ACLs.
In the event that the VPN tunnels are not established, IKE phase 1 followed by IKE phase 2 of the IPsec tunnel should be investigated.
If there are issues establishing an IKE phase 1 connection, then the following should be checked:
If there are issues establishing an IKE phase 2 connection, then the following should be checked:
AWS Direct Connect lets you establish a dedicated network connection between your network and one of the AWS Direct Connect locations. Using industry standard 802.1q Virtual Local Area Networks (VLANs), this dedicated connection can be partitioned into multiple Virtual Interfaces (VIFs). This allows you to use the same connection to access public resources, such as objects stored in Amazon Simple Storage Service (Amazon S3) using public IP address space and private resources such as Amazon EC2 instances running within a VPC using private IP space, all while maintaining network separation between the public and private environments. VIFs can be reconfigured at any time to meet your changing needs.
An AWS Direct Connect connection can either be established directly within an AWS Direct Connect location or extended to your location through an AWS Partner Network (APN) partner. Some APN partners also offer hosted VIFs at sub-1 Gbps speeds. Note that these hosted VIFs are not full AWS Direct Connect connections and only support a single VIF.
The following are items to consider when troubleshooting an AWS Direct Connect connection:
Security groups are implicit DENY. Unless a rule allowing incoming or outgoing traffic is created, traffic will not flow. This means that even if two instances are in the same subnet, they will not be able to communicate with each other unless there is a rule created allowing such traffic. Security groups are also stateful, meaning that if an inbound or outbound rule is created, it will allow the return traffic.
The following are items to consider when troubleshooting security groups:
Amazon VPC Flow Logs can be useful for troubleshooting security group-related issues. Traffic will be recorded as a rejected packet if there is not a rule in place to allow it.
By default, the network access control list (ACL) on a VPC is set to allow all inbound and outbound traffic. If network ACLs are set to be more restrictive, care must be taken to allow all required traffic. Note that network ACLs are not stateful like security groups—return traffic to an outbound port must be explicitly permitted with an ALLOW rule. For example, locking down outbound to port 80 and port 443 only in a subnet would also require an inbound ALLOW rule for ephemeral ports (1024-65535). A good understanding of network traffic flows and ports and protocols in use should be established prior to implementing network ACLs on a subnet.
The following are items to consider when troubleshooting network ACLs:
Applications commonly communicate over a number of ports and often require an inbound rule for return traffic. Amazon VPC Flow Logs can be useful for troubleshooting network ACL-related issues. Traffic will be recorded as a rejected packet if there is not a rule to allow it or there is an explicit rule to block it.
Routing within a VPC is controlled by a route table attached to each subnet. Note that unless a route table is explicitly associated with a subnet, the main route table for the VPC will be used for each subnet. Knowing the caveats of routing and how routing within VPC works is beneficial to troubleshooting. The following are some common considerations when troubleshooting route tables:
A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses. Instances in either VPC can communicate with each other as if they are within the same network. You can create a VPC peering connection between your own VPCs or with a VPC in another AWS account. In both cases, the VPCs must be in the same AWS Region.
AWS uses the existing infrastructure of a VPC to create a VPC peering connection; it is neither a gateway nor a VPN connection, and it does not rely on a separate piece of physical hardware. There is no single point of failure for communication or a bandwidth bottleneck.
The following are items to consider when troubleshooting VPC peering connections:
AWS Cloud services that reside outside of a VPC require a public IP address for access. This can be accomplished through a NAT gateway, public IP address, proxy server, or by setting up an endpoint on a VPC (if an endpoint is available for the service).
The following considerations should be checked when there are issues accessing AWS Cloud services:
Amazon CloudFront is a global Content Delivery Network (CDN) service that securely delivers data, videos, applications, and APIs to viewers with low latency and high transfer speeds. Amazon CloudFront is integrated with AWS—both physical locations which are directly connected to the AWS global infrastructure, as well as software that works seamlessly with services including AWS Shield for Distributed Denial of Service (DDoS) mitigation, Amazon S3, Elastic Load Balancing, or Amazon EC2 as origins for your applications, and AWS Lambda to run custom code close to your viewers.
The following are some common items to consider when troubleshooting Amazon CloudFront connectivity problems:
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve fault tolerance in your applications, seamlessly providing the required amount of load balancing capacity needed to route application traffic.
Due to the scalable nature of Elastic Load Balancing, the fleet of AWS-managed Elastic Load Balancing instances will grow and shrink to meet demand. This scaling requires allocating a sufficient amount of available IP addresses to the Elastic Load Balancing subnet. Failure to account for the scaling of an Elastic Load Balancing fleet can result in errors and the inability of the load balancer to balance traffic.
The following are some common considerations to take into account when troubleshooting Elastic Load Balancing:
Domain Name System (DNS) provides hostname to IP resolution. AWS creates a DNS server by default within a VPC. (This option can be disabled.) AWS has a managed DNS service called Amazon Route 53 that provides the ability to create public and private hosted zones. Private hosted zones are only available within a VPC, while public hosted zones are globally accessible throughout the world.
DNS entries have a TTL value for each DNS record within a domain that specifies how long a client can cache the values of a DNS query. These TTL values are only a suggestion, and they can be ignored by caching at intermediary DNS servers, operating systems, and individual applications. For this reason, DNS queries may take some time to resolve to the correct values, even after the TTL period has expired.
The following are items to consider when troubleshooting DNS issues:
The nslookup command is a useful tool for troubleshooting DNS issues to determine to which IP address a hostname resolves.
Every AWS Cloud service has some type of limit. There are hard limits, which cannot be increased, and soft limits, which can be increased with an AWS Support ticket. It is very important to understand these limits. A lot of troubleshooting time can be saved by recognizing when a service limit is the root cause of an issue. Each service’s page on the AWS website lists the limits. A subset of network limits can also be seen in AWS Trusted Advisor:
Many services will show an error in the API response or AWS Management Console when trying to create or allocate resources once a limit has been reached.
In this chapter, you reviewed core concepts of troubleshooting connectivity within AWS and connectivity from AWS to on-premises networks.
Core troubleshooting tools consist of the following:
In this chapter, you also reviewed some common troubleshooting scenarios. The best way to get experience in troubleshooting is to use the tools and address common issues that may arise. It is recommended that you complete the exercises at the end of this chapter in order to gain hands-on experience with network troubleshooting in AWS.
Understand methodologies for troubleshooting. It is important to understand how to troubleshoot common network anomalies that occur and how doing so in a cloud or hybrid environment can be different from on-premises networking.
Understand tools for troubleshooting. In addition to traditional troubleshooting tools, there are a number of AWS tools discussed in this chapter with which you should be familiar.
Understand the conditions required for Internet connectivity. There are five conditions that must be met for connectivity to the Internet from an Amazon EC2 instance:
Understand network ACLs vs. security groups. Security groups are stateful, whereas network ACLs are not. There is an implicit DENY with security groups. Rules must be added to allow network traffic. If network ACLs are used, then care must be taken to ensure that return traffic (whether inbound or outbound) is allowed.
Understand how routing works with Amazon VPC. There is an implicit route within a VPC for its CIDR. All other routes to destinations outside of the CIDR need to be added to the route table. There is a master route table for all subnets when a VPC is initially created, and additional route tables can be added. There is a one-to-one mapping of route table to subnet; however, multiple subnets can share the same route table. More specific routes have a higher preference.
Understand VPN IPsec and how to troubleshoot. There are two phases to IPsec to establish a VPN tunnel. You should know the requirements for each phase and how to troubleshoot when one or both fail to complete. You should also understand how routing works with VPN tunnels and how it works as a standby if an AWS Direct Connect is also in use.
Understand AWS Direct Connect and how to troubleshoot. There are a number of requirements that must be completed before traffic can flow over an AWS Direct Connect connection. There is also a difference between a private VIF (connectivity to a VPC) and public VIF (connectivity to public AWS Cloud services). In the case of a hosted VIF, there is only one VIF that can be created with each.
Understand VPC peering and valid versus invalid configurations. VPC peering will not be established if there are overlapping or conflicting CIDR addresses. Peering connections are not transitive. Any traffic that is not in the CIDR range of the VPC peer will not flow over the peering connection.
Understand how DNS and Amazon Route 53 work and how to troubleshoot. DNS resolution is provided by default within a VPC by an AWS-managed endpoint. Amazon Route 53 can be used for hosting private zones within a VPC and public zones outside of a VPC. CNAMEs should be used to point to AWS-provided endpoint hostnames.
For further information, refer to the following pages on the AWS website:
The best way to become familiar with troubleshooting is to install and leverage the tools mentioned in this chapter. There is no substitute for the experience that comes from working within the AWS environment, becoming familiar with how networking works, and learning how to work through common troubleshooting situations.
You place an application load balancer in front of two web servers that are stateful. Users begin to report intermittent connectivity issues when accessing the website. Why is the site not responding?
You create a new instance, and you are able to connect over Secure Shell (SSH) to its private IP address from your corporate network. The instance does not have Internet access, however. Your internal policies forbid direct access to the Internet. What is required to enable access to the Internet?
You create a Network Address Translation (NAT) gateway in a private subnet. Your instances cannot communicate with the Internet. What action must you take?
What is not required for Internet connectivity from a public subnet?
You are trying to add two new Virtual Private Cloud (VPC) peering connections to a VPC with 24 existing peering connections. The first connection works fine, but the second connection returns an error message. What should you do?
You created a new endpoint for your Virtual Private Cloud (VPC) that does not have Internet connectivity. Your instance cannot connect to Amazon Simple Storage Service (Amazon S3). What could be the problem?
You recently set up Amazon Route 53 for a private hosted zone for a highly-available application hosted on AWS. After adding a few A records, you notice that the instance hostnames are not resolving within the Virtual Private Cloud (VPC). What actions should be taken? (Choose two.)
You discover that the default Virtual Private Cloud (VPC) has been deleted from region us-east-1 by a coworker in the morning. You will be deploying a lot of new services during the afternoon. What should you do?
You are responsible for your company’s AWS resources. You notice a significant amount of traffic from an IP address range in a foreign country where your company does not have customers. Further investigation of the traffic indicates that the source of the traffic is scanning for open ports on your Amazon Elastic Compute Cloud (Amazon EC2) instances. Which one of the following resources can prevent the IP address from reaching the instances?
Which of the following tools can be used to record the source and destination IP addresses of traffic? (Choose two.)