Chapter 7. Networking

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Networking

This chapter provides an overview of the network component in IBM Private Cloud and discusses how communication flows between the pods and the external network and how a pod is exposed to the network.

The chapter has the following sections:

•7.1, “Introduction to container networking” on page 258

•7.2, “Pod network” on page 259

•7.3, “High availability” on page 262

•7.4, “Service discovery (kube-dns)” on page 270

7.1 Introduction to container networking¹

A container needs to have at least one network interface. Through this interface, the container communicates with the other containers or endpoints. Kubernetes has its own network on top of the physical network to enable communication between nodes and external systems.

A pod could consist of various containers that are collocated at the same host, share the same network stack, and share other resources (such as volumes). From an application developer point of view, all containers in a Kubernetes cluster are considered on a flat subnet.

Figure 7-1 shows the communication coming from the physical Interface (eth0) going to the Docker0 and then to the virtual network.

Each docker instance creates its own network and enables the pod communication.

Figure 7-1 Overview of pod network

As shown in Figure 7-1, the communication is bidirectional.

The next sections demonstrate the networking concepts in Kubernetes, such as pod networking, load balancing, and ingress.

7.2 Pod network

Pod network enables the pods, and all containers that are associated with the pods are network addressable. This basic network is implemented by Kubernetes and allows a pod to communicate with the other pods as though they were on their own dedicated hosts.

It is important to note that all containers in a single pod share the same port space. If container A uses port 80, you cannot have container B inside the same pod that uses port 80 as well. Using the same port would only work if these containers are on different pods. An additional component, called the pod network namespace, is provided by the Kubernetes pause container. This component creates and owns the network namespace.

Another component that is present on the pod network is the container network interface (CNI). This consists of a plug-in that connects the pod network namespace to the rest of the pods in the Kubernetes cluster. The most widely used CNIs in IBM Cloud Private are the Calico and NSX.

7.2.1 Calico

Calico enables networking and network policy in Kubernetes clusters across the cloud. It combines flexible networking capabilities with run anywhere security enforcement, providing performance similar to a native kernel and enabling real cloud-native scalability. Calico is implemented without encapsulation or overlays, providing high-performance networking. It also provides a Network security policy for Kubernetes pods through its distributed firewall.

When using Calico, policy engine enforces the same policy model at the host networking layers (and at the service mesh if using Istio) helping to protect the infrastructure from workloads from compromised infrastructures.

Because Calico uses the Linux Kernel’s forwarding and access control, it provides a high-performance solution without the resources used by encapsulation and decapsulation.

Calico creates a flat layer-3 network, and assigns a fully routable IP address to every pod. To do that, it divides a large network of CIDR (Classless Inter-Domain Routing) into smaller blocks of IP addresses, and assigns one or more of these smaller blocks to the nodes in the cluster. This configuration is specified at the IBM Cloud Private installation time using the network_cidr parameter in config.yaml in CIDR notation.

By default, Calico creates a BGP (Border Gateway Protocol) mesh between all nodes of the cluster, and broadcasts the routes for container networks to all of the worker nodes. Each node is configured to act as a layer 3 gateway for the subnet assigned to the worker node, and serves the connectivity to pod subnets hosted on the host.

All nodes participate in the BGP mesh, which advertises the local routes that the worker node owns to the rest of the nodes. BGP peers external to the cluster can participate in this mesh as well, but the size of the cluster affects how many BGP advertisements these external peers will receive. Route reflectors can be required when the cluster scales past a certain size.

When routing the pod traffic, Calico uses the system capabilities, such as the node’s local route tables and iptables. All pod traffic traverses iptables rules before they are routed to their destination.

Calico maintains its state using an etcd key/value store. By default, in IBM Cloud Private Calico uses the same etcd key/value store as Kubernetes to store the policy and network configuration states.

Calico can be configured to allow pods to communicate with each other with or without IP-in-IP tunneling. IP-in-IP adds an additional header for all packets as part of the encapsulation, but containers can communicate on its overlay network through almost any non-overlapping underlay network. In some environments where the underlay subnet address space is constrained and there is no access to add additional IP Pools, like on some public clouds, Calico can be a good fit.

However, in environments that do not require an overlay, IP-in-IP tunneling should be disabled to remove the packet encapsulation resource use, and enable any physical routing infrastructure to do packet inspection for compliance and audit. In such scenarios, the underlay network should be made aware of the additional pod subnets by adding the underlay network routers to the BGP mesh. See the discussion about this in “Calico components”.

Calico components

The calico network is created by the following 3 components: node agent, CNI, and kube-controller.

Calico/node agent

This entity has three components: felix, bird, and confd.

•Felix’s primary responsibility is to program the host’s iptables and routes to provide the wanted connectivity to and from the pods on that host.

•Bird is an open source BGP agent for Linux that is used to exchange routing information between the hosts. The routes that are programmed by felix are picked up by bird and distributed among the cluster hosts.

•Confd monitors the etcd data store for changes to the BGP configuration, such as IPAM information and AS number, and accordingly changes the bird configuration files and triggers bird to reload these files on each host. The calico/node creates veth-pairs to connect the Pod network namespace with the host’s default network namespace.

Calico/cni

The CNI plug-in provides the IP address management (IPAM) functionality by provisioning IP addresses for the Pods hosted on the nodes.

Calico/kube-controller

The calico/kube-controller watches Kubernetes NetworkPolicy objects and keeps the Calico data store in sync with the Kubernetes. calico/node runs on each node and uses the information in the Calico etcd data store and program the local iptables accordingly.

Calico network across different network segments

When nodes are on different network segments, they are connected using a router in the underlay infrastructure network. The traffic between two nodes on different subnets happens through the router, which is the gateway for the two subnets. If the router is not aware of the pod subnet, it will not be able to forward the packets between the hosts.

There are two scenarios that can be used to the Calico communication:

•Calico can be configured to create IP-in-IP tunnel end points on each node for every subnet hosted on the node. Any packet originated by the pod and egressing the node is encapsulated with the IP-in-IP header, and the node IP address is utilized as the source. This way, the infrastructure router does not see the pod IP addresses.

The IP-in-IP tunneling brings in extra resource use in terms of network throughput and latency due to the additional packet resource and processing at each endpoint to encapsulate and decapsulate packets.

On bare metal, the resource use is not significant, because certain network operations can be offloaded to the network interface cards. However, on virtual machines, the resource use can be significant and also affected by the number of CPU cores and network I/O technologies configured and used by the hypervisors. The additional packet encapsulation resources might also be significant when smaller MTU (maximum transmission unit) sizes are used, because it might introduce packet fragmentation; jumbo frames should be enabled whenever possible.

•The second option is to make the infrastructure router aware of the pod network. This can be done by enabling BGP on the router and adding the nodes in the cluster as BGP peers to it. With these actions, the router and the hosts can exchange the route information between each other. The size of the cluster in this scenario might come in to play because in the BGP mesh, every node in the cluster is BGP peering to the router.

7.2.2 NSX-T

NSX-T is a network virtualization and security platform that automates the implementation of network policies, network objects, network isolation, and micro-segmentation.

Figure 7-2 shows an overview of NSX-T.

Figure 7-2 NSX-T configuration

NSX-T network virtualization for Kubernetes

The NSX-T network virtualization for Kubernetes consists of L2 and L3 segregation, micro segmentation, and NAT pools.

L2 and L3 segregation

NSX-T creates a separate L2 switch (virtual distributed switch, or VDS) and L3 router (distributed logical router, or DLR) for every namespace. The namespace level router is called a T1 router. All T1 routers are connected to the T0 router, which acts like an edge gateway to the IBM Cloud Private cluster, as well as an edge firewall and load balancer. Due to separate L2 switches, all the broadcast traffic is confined to the namespace. In addition, due to the separate L3 router, each namespace can host its own pod IP subnet.

Micro segmentation

NSX-T provides distributed firewall (DFW) for managing the east-west traffic. The Kubernetes network policy gets converted into the NSX-T DFW rules. With L2 segmentation, dedicated L3 subnets for namespaces and network policies, one can achieve micro segmentation within and across a namespace.

NAT pools

Edge appliance is an important component of the NSX-T management cluster. It offers routing, firewall, load balancing, and network address translation, among other features. By creating pods on the NSX-T pod network (and not relying on the host network), all traffic can be traversed though the edge appliance using its firewall, load balancing, and network address translation capabilities.

The edge appliance assigns SNAT (Source Network Address Translation) IPs to the outbound traffic, and DNAT (Destination Network Address Translation) IPs to the inbound traffic from the NAT pool (created as part of the NSX-T deployment). By relying on the network address translation, the cluster node IPs are not exposed on the outbound traffic.

7.3 High availability

For high availability, master and proxy nodes should be deployed redundantly (at different physical locations, if possible) to tolerate hardware failures and network partitions. The following discusses options for network high availability considerations in IBM Cloud Private.

7.3.1 External load balancer

If possible, a highly available external load balancer should be leveraged to spread the traffic among separate master or proxy node instances in the cluster. The external load balancer can either be a DNS URL or an IP address, and specified using cluster_lb_address at config.yaml during install time. The cluster_CA_domain and any TLS certificates should be configured to be a CNAME (Canonical Name Record) or a record pointing at the external load balancer DNS name or IP address. In addition, all nodes in the cluster should be able to resolve this CNAME for internal communication.

When using an external load balancer, the master load balancer should monitor the Kubernetes API server port 8001 for health on all master nodes, and the load balancer needs to be configured to accept connections on the following locations:

•Forward traffic to 8001 (Kubernetes API)

•8443 (platform UI), 9443 (authentication service)

•8500 and 8600 (private registry)

When using an external load balancer, each master node can be in different subnets if the round-trip network time between the master nodes is less than 33 ms for etcd. Figure 7-3 on page 263 illustrates the load balancer option.

Figure 7-3 Load balancer in an IBM Cloud Private environment

7.3.2 Virtual IP addresses

In case a load balancer is not available, high availability of the master and proxy nodes can be achieved using a virtual IP address, which is in a subnet shared by the master/proxy nodes. IBM Cloud Private supports three types of virtual IP management solutions:

•etcd (default)

•ucarp

•keepalived

This setting is done once as part of the installation of IBM Cloud Private, using the vip_manager setting in config.yaml. For ucarp and keepalived, the advertisements happen on the management interface, and the virtual IP will be held on the interface provided by cluster_vip_iface and proxy_vip_iface. In situations where the virtual IP will be accepting a high load of client traffic, the management network performing the advertisements for master election should be separate from the data network accepting client traffic.

Note: Considerations when using a virtual IP address are as follows:

•At any point of time, only one master or proxy node holds the lease for the virtual IP address.

•Using a virtual IP, traffic is not load balanced among all replicas. Using a virtual IP requires that all candidate nodes use a cluster_vip_iface or proxy_vip_iface interface on the same subnet.

•Any long-running and stateful TCP connections from clients will be broken during a failover and must be reestablished.

The etcd solution

Etcd is a distributed key value store used internally by IBM Cloud Private to store state information. It uses a distributed census algorithm called raft. The etcd-based VIP manager leverages the distributed key/value store to control which master or proxy node is the instance holding the virtual IP address. The virtual IP address is leased to the leader, so all traffic is routed to that master or proxy node.

The etcd virtual IP manager is implemented as an etcd client that uses a key/value pair. The current master or proxy node holding the virtual IP address acquires a lease to this key/value pair with a TTL of 8 seconds. The other standby master or proxy nodes observe the lease key/value pair.

If the lease expires without being renewed, the standby nodes assume that the first master has failed and attempt to acquire their own lease to the key to be the new master node. The master node that is successful writing the key brings up the virtual IP address. The algorithm uses randomized election timeout to reduce the chance of any racing condition where one or more nodes tries to become the leader of the cluster.

Note: Gratuitous ARP is not used with the etcd virtual IP manager when it fails over. Therefore, any existing client connections to the virtual IP address after it fails over will fail until the client’s ARP cache has expired and the MAC address for the new holder of the virtual IP is acquired. However the etcd virtual IP manager avoids the use of multicast as ucarp and keepalived requires.

The ucarp solution

Ucarp is an implementation of the common address redundancy protocol (CARP) ported to Linux. Ucarp allows the master node to “advertise” that it owns a particular IP address using the multicast address 224.0.0.18.

Each node sends out an advertisement message on its network interface that it can have a virtual IP address every few seconds. This is called the advertise base. Each master node sends a skew value with that CARP (Common Address Redundancy Protocol) message. This is similar to its priority of holding that IP, which is the advskew (advertising skew). Two or more systems both advertising at one second intervals (advbase=1), the one with the lower advskew will win.

Any ties are broken by the node that has the lower IP address. For high availability, moving one address between several nodes in this manner enables you to survive the outage of a host, but you must remember that this only enables you to be more available and not more scalable.

A master node will become master if one of the following conditions occurs:

•No one else advertises for 3 times its own advertisement interval (advbase).

•The --preempt option is specified by the user, and it “hears” a master with a longer (advertisement) interval (or the same advbase but a higher advskew).

An existing master node becomes a backup if on of the following conditions occur:

•Another master advertises a shorter interval (or the same advbase, but a lower advskew).

•Another master advertises the same interval, and has a lower IP address.

After failover, ucarp sends a gratuitous ARP message to all of its neighbors so that they can update their ARP caches with the new master’s MAC address.

The keepalived solution

Keepalived provides simple and robust facilities for load balancing and high-availability, originally used for high availability of virtual routers. Keepalived uses VRRP (Virtual Router Redundancy Protocol) as an election protocol to determine which master or proxy node holds the virtual IP. The keepalived virtual IP manager implements a set of checkers to dynamically and adaptively maintain and manage load balanced server pool according to the health.

VRRP is a fundamental brick for failover. The keepalived virtual IP manager implements a set of hooks to the VRRP finite state providing low-level and high-speed protocol interactions.

To ensure stability, the keepalived daemon is split into the following parts:

•A parent process called as watchdog in charge of the forked children process monitoring.

•A child process for VRRP.

•Another child process for health checking.

The keepalived configuration included with IBM Cloud Private uses the multicast address 224.0.0.18 and IP protocol number 112. This must be allowed in the network segment where the master advertisements are made. Keepalived also generates a password for authentication between the master candidates which is the MD5 sum of the virtual IP.

Keepalived by default uses the final octet of the virtual IP address as the virtual router ID (VRID). For example, for a virtual IP address of 192.168.10.50, it uses VRID 50. If there are any other devices using VRRP on the management layer 2 segment that are using this VRID, it might be necessary to change the virtual IP address to avoid conflicts.

7.3.3 Ingress controller

Ingress resources in Kubernetes are used to proxy layer 7 traffic to containers in the cluster. An ingress is a collection of rules to allow inbound connections to the Kubernetes cluster services. It can be configured to give Kubernetes services externally reachable URLs, terminate TLS connections, offer name-based virtual hosting, and more.

Note: Ingress controller in an IBM Cloud Private environment is also known as the proxy node.

Ingress resources require an ingress controller component to be running as a Layer 7 proxy service inside the cluster. In IBM Cloud Private, an nginx-based ingress controller is provided by default that is deployed on the proxy or master (in case master acts as proxy) nodes. The default ingress controller watches Kubernetes ingress objects on all namespaces through the Kubernetes API and dynamically programs the nginx proxy rules for upstream services based on the ingress resource. By default, the ingress controller is bootstrapped with some load balancing policies, such as load-balancing algorithms and a back-end weight scheme.

More than one ingress controller can also be deployed if isolation between namespaces is required. The ingress controller itself is a container deployment that can be scaled out and is exposed on a host port on the proxy nodes. The ingress controller can proxy all of the pod and service IP mesh running in the cluster.

IBM Cloud Private installation defines some node roles dedicated to running the shared IBM Cloud Private ingress controller called proxy nodes. These nodes serve as a layer 7 reverse proxy for the workload running in the cluster. In situations where an external load balancer can be used, this is the suggested configuration, because it can be difficult to secure and scale proxy nodes, and using a load balancer avoids additional network hops through proxy nodes to the pods running the actual application.

If you are planning to use an external load balancer, set up the cluster to label the master nodes as proxy nodes using the hosts file before installation. This marks the master nodes with the additional proxy label, as shown in Example 7-1, and the shared ingress controller will be started on the master nodes. This ingress controller can generally be ignored for “northbound” traffic, or used for lightweight applications exposed “southbound”, such as additional administrative consoles for some applications that are running in the cluster.

Example 7-1 The config.yaml file

[master]

172.21.13.110

172.21.13.111

172.21.13.112

[proxy]

172.21.13.110

172.21.13.111

172.21.13.112

If an ingress controller and ingress resources are required to aggregate several services that use the built-in ingress resources, a good practice is to install additional isolated ingress controllers using the included Helm chart for the namespace, and expose these individually through the external load balancer.

Single service ingress

It is possible to expose a single service through ingress. In Example 7-2, a Node.js server was created with service name mynode-ibm-nodejs-sample on port 3000.

Example 7-2 Ingress controller configuration

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

spec:

backend:

serviceName: test-app-using-nodejs

servicePort: 3000

In this case, all traffic on the ingress controller’s address and port (80 or 443) will be forwarded to this service.

Simple fanout

With the simple fanout approach you can define multiple HTTP services at different paths and provide a single proxy that routes to the correct endpoints in the back end. When there is a highly available load balancer managing the traffic, this type of ingress resource will be helpful in reducing the number of load balancers to a minimum.

In the Example 7-3 on page 267, “/” is the rewrite target for two services: employee-api on port 4191 and manager-api on port 9090. The context root for both of these services is at /; the ingress will rewrite the path /hr/employee/* and /hr/manager/* to / when proxying the requests to the back ends.

Example 7-3 Example of rewriting

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

annotations:

ingress.kubernetes.io/rewrite-target: /

spec:

rules:

- host: icpdemo.mydomain.com

http:

paths:

- backend:

serviceName: employee-api

servicePort: 4191

path: /hr/employee/*

- backend:

serviceName: manager-api

servicePort: 9090

path: /hr/manager/*

Example 7-3 demonstrates that it is possible to expose multiple services and rewrite the URI.

Name-based virtual hosting

Name-based virtual hosting provides the capability to host multiple applications using the same ingress controller address. This kind of ingress routes HTTP requests to different services based on the Host header.

In Example 7-4 and Example 7-5 on page 268, two Node.js servers are deployed. The console for the first service can be accessed using the host name myserver.mydomain.com and the second using superserver.mydomain.com. In DNS, myserver.mydomain.com and superserver.mydomain.com can either be an A record for the proxy node virtual IP 10.0.0.1, or a CNAME for the load balancer forwarding traffic to where the ingress controller is listening.

Example 7-4 First service

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

annotations:

spec:

rules:

- host: myserver.mydomain.com

http:

paths:

- backend:

serviceName: nodejs-myserver

servicePort: 3000

Deploying the second service is shown in Example 7-5.

Example 7-5 Second service

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

annotations:

spec:

rules:

- host: superserver.mydomain.com

http:

paths:

- backend:

serviceName: nodejs-superserver

servicePort: 3000

It is usually a good practice to provide some value for the host, because the default is *, which forwards all requests to the back end.

Transport Layer Security (TLS)

An ingress service can be secured using a TLS private key and certificate. The TLS private key and certificate should be defined in a secret with key names tls.key and tls.crt. The ingress assumes TLS termination and traffic is proxied only on port 443. Example 7-6 shows how to create this.

Example 7-6 TLS configuration

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

annotations:

ingress.kubernetes.io/rewrite-target: /

spec:

rules:

- host: myserver.mydomain.com

http:

paths:

- backend:

serviceName: main-sales-api

servicePort: 4191

path: /api/sales/*

- backend:

serviceName: backorder-api

servicePort: 9090

path: /api/backorder/*

tls:

- hosts:

- myserver.mydomain.com

secretName: api-tls-secret

In Example 7-6, the TLS termination is added to the myserver.mydomain.com ingress resource.

The certificate’s subject name or subject alternative names (SANs) should match the host value in the ingress resource and should be unexpired. In addition, the full certificate chain (including any intermediate and root certificates) should be trusted by the client. Otherwise, the application gives a security warning during the TLS handshake. In the example, the tls.crt subject name contains either api.example.com or is a wildcard certificate for *.mydomain.com. The DNS entry for myserver.mydomain.com is an A record for the proxy nodes’ virtual IP address.

The secret api-tls-secret is created in the same namespace as the ingress resource using the command:

kubectl create secret tls api-tls-secret --key=/path/to/tls.key --cert=/path/to/tls.crt

The secret can also declaratively be created in a YAML file if the TLS key and certificate payloads are base-64 encoded, as shown in Example 7-7.

Example 7-7 TLS configuration

apiVersion: v1

type: Opaque

kind: Secret

metadata:

data:

tls.crt: <base64-encoded cert>

tls.key: <base64-encoded key>

Shared ingress controller

By default in IBM Cloud Private, a global ingress controller is installed and deployed on all proxy nodes. This provides the capability to define the ingress resources for applications across all namespaces. The global ingress controller runs in the kube-system namespace; if a NetworkPolicy is used to isolate namespace traffic, another one needs to be created to allow traffic from the ingress controller to any proxied back-end services in other namespaces.

Advantages:

•A common ingress controller reduces compute resources required to host applications.

•Ready for use.

Disadvantages:

•All client traffic passes through a shared ingress controller. One service’s client traffic can affect the other.

•Limited ability to isolate northbound ingress resource traffic from southbound ingress traffic, such as a public-facing API versus an operations dashboard running in the same cluster would share the same ingress controller.

•If an attacker were to gain access to the ingress controller they would be able to observe unencrypted traffic for all proxied services.

•Need to maintain different ingress resource documents for different stages. For example, the need to maintain multiple copies of the same ingress resource YAML file with different namespace fields.

•The ingress controller needs access to read ingress, service, and pod resources in every namespace in the Kubernetes API to implement the ingress rules.

Isolated ingress controllers per namespace

An ingress controller can be installed as a Helm chart in an isolated namespace and perform ingress for services in the namespace. In this deployment type, the ingress controller is given a role that can only access ingress and resources in the namespace.

Advantages:

•Delineation of ingress resources for various stages of development, production.

•Performance for each namespace can be scaled individually.

•Traffic is isolated; when combined with isolated worker nodes on separate VLANs, true Layer 2 isolation can be achieved as the upstream traffic does not leave the VLAN.

•Continuous integration and continuous delivery teams can use the same ingress resource document to deploy (assuming that the dev namespace is different from the production namespace) across various stages.

Disadvantages:

•Additional ingress controllers must be deployed, using extra resources.

•Ingress controllers in separate namespaces might require either a dedicated node or a dedicated external load balancer.

7.4 Service discovery (kube-dns)

Kubernetes expects that a service should be running within the pod network mesh that performs name resolution and should act as the primary name server within the cluster. In IBM Cloud Private, this is implemented using CoreDNS running on the master nodes, which resolves names for all services running in Kubernetes. In addition, the service forwards name lookups against upstream name servers on behalf of containers. The DNS service itself runs as a ClusterIP service that is backed by one or more containers for high availability. See Figure 7-4.

Figure 7-4 Kube-dns

Kubernetes service names are resolved to ClusterIPs representing one or more pods matching a label selector. The cluster is assigned a cluster domain that is specified at installation time that uses cluster_domain (this is cluster.local by default) to distinguish between names local to the cluster and external names.

Each Kubernetes cluster is logically separated into namespaces, and each namespace acts as a subdomain for name resolution. Upon examining a container’s /etc/resolv.conf, observe that the name server line points at an IP address internal to the cluster, and the search suffixes are generated in a particular order, as shown in Example 7-8.

Example 7-8 The /etc/resolv.conf

# cat /etc/resolv.conf

Name server <kube-dns ClusterIP>

search <namespace>.svc.<cluster_domain> svc.<cluster_domain> <cluster_domain> <additional ...>

options ndots:5

The <additional ...> is a list of search suffixes obtained from the worker node’s /etc/resolv.conf file. By default, a short host name like account-service has <namespace>.svc.<cluster_domain> appended to it, so a pod matching the label selector that is running in the same namespace as the running pod will be selected. A pod can look up the ClusterIP of a pod in a different namespace by appending the namespace to the host name.

For example, account-service.prod will target account-service running in the prod namespace, because the search suffix svc.<cluster_domain> is appended to the end. Figure 7-5 shows how the naming segmentation works.

Figure 7-5 KubeDNS with namespace segmentation

Note the last line in /etc/resolv.conf, options ndots:5, which indicates to the container’s system resolver that any host names being resolved that have fewer than five dots in the name should have the search domain suffixes appended to it. This might affect performance because lookups of external network resources with fewer than 5 dots in the name results in lookups for every entry in the search line.

For example, a lookup of www.ibm.com results in lookups of www.ibm.com.<namespace>.svc.<cluster_domain>, www.ibm.com.svc.<cluster_domain>, www.ibm.com.<cluster_domain>, and so on before finally trying www.ibm.com. To resolve this issue, adding an additional period (“.”) to the end of the fully qualified domain names used in the application configuration will prevent the system resolver from cycling through the list of suffixes in the name lookups (for example www.ibm.com.).

7.4.1 Headless services

In some cases, it is desirable not to create a ClusterIP service at all; this can be achieved by specifying a service type of None. This will create A records for each pod matching the label selector in the DNS, but no ClusterIP. This is typically used with StatefulSets where each of the pods needs to have a resolvable name for communication between all pods in the set (for example, a clustered database, such as MongoDB). When a service without a ClusterIP is created, each pod’s A record will be in the following format:

<pod-name>.<service-name>.<namespace>.svc.<cluster-domain>

7.4.2 External services

It is possible to have Kubernetes proxy endpoints outside of the cluster by creating a Service resource with no label selector, and by either creating Endpoints resources manually containing the IPs outside of the cluster to proxy, or creating a Service resource with the type ExternalName containing an external DNS name, which creates a CNAME record in the cluster’s DNS. By using these functions, the cluster DNS can be leveraged as service discovery for services both inside and outside of the cluster.

¹ Some of the content in this chapter is based on the following GitHub document created by Rachappa Goni, Eswara Kosaraju, Santosh Ananda, and Jeffrey Kwong.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Networking

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7. Networking