The Google Cloud Professional Architect exam will test your ability to understand technical requirements that are explicitly stated, as well as implied, in case studies and questions. Technical requirements may specify a particular hardware or software constraint. For example, an application may need to use a MySQL 8.0 database or be able to transmit 1 GB of data between an on-premises data center and the Google Cloud Platform. Technical requirements do not necessarily specify all details that you will need to know. If a question states that a virtual private cloud will have three subnets, then you will have to infer from that statement that the subnets will need to be configured with distinct, nonoverlapping address spaces. It is common for questions about technical requirements to require you to choose among multiple solutions and to understand some unstated implication of the requirement so that you can make a choice among possible solutions.
In this chapter, we will consider three broad categories of technical requirements.
We will use the case studies as jumping-off points for discussing these kinds of requirements. We will consider how each of these factors influences the choices we make about compute, storage, networking, and specialized services.
The most important piece of information to take away from this chapter is that availability, scalability, and reliability are not just important at the component or subsystem level but across the entire application infrastructure. Highly reliable storage systems will not confer high reliability on a system if the networking or compute services are not reliable.
High availability is the continuous operation of a system at sufficient capacity to meet the demands of ongoing workloads. Availability is usually measured as a percent of time that a system is available and responding to requests with latency not exceeding some certain threshold. Table 3.1 shows the amount of allowed downtime at various service-level agreement (SLA) levels. An application with a 99 percent availability SLA can be down for 14.4 minutes per day, while a system with a 99.999 percent availability can be down for less than one second per day without violating the SLA.
TABLE 3.1 Example availability SLAs and corresponding downtimes
Percent Uptime | Downtime/Day | Downtime/Week | Downtime/Month |
---|---|---|---|
99.00 | 14.4 minutes | 1.68 hours | 7.31 hours |
99.90 | 1.44 minutes | 10.08 minutes | 43.83 minutes |
99.99 | 8.64 seconds | 1.01 minutes | 4.38 minutes |
99.999 | 864 milliseconds | 6.05 seconds | 26.3 seconds |
99.9999 | 86.4 milliseconds | 604.8 milliseconds | 2.63 seconds |
High availability SLAs, such as these, must account for the fact that hardware and software fails. Individual physical components, such as a disk drive running in a particular disk array, may have a small probability of failing in a one-month period. If you are using thousands of drives, then it is much more likely that at least one of them will fail.
When designing high availability applications, you have to plan for failures. Failures can occur at multiple points in an application stack:
We can mitigate the risk of hardware failures, in part, with redundancy. Instead of writing data to one disk, we write it to three disks. Rather than have a single server running an application, we create instance groups with multiple servers and load balance workload among them. We install two direct network connections between our data center and the GCP—preferably with two different telecommunication vendors. Redundancy is also a key element of ensuring scalability, but it also requires autohealing or other automated repair mechanisms to ensure continued availability.
We compensate for software and configuration errors with software engineering and DevOps best practices. Code reviews, multiple levels of testing, and running new code in a staging environment can help identify bugs before code is released to production. Canary deployments, in which a small portion of a system's workload is routed to a new version of the software, allow us to test code under production conditions without exposing all users to new code. If there is a problem with the new version of software, it will affect only a portion of the users before it is rolled back. Automating infrastructure deployments, by treating infrastructure as code, reduces the need for manual procedures and the chance to make a mistake when entering commands.
As you design systems with an eye for high availability, keep in mind the role of redundancy and best practices for software development and DevOps.
The GCP offers several compute services. We'll consider availability in four of these services.
Each of these services can provide high availability compute resources, but they vary in the amount of effort required to achieve high availability.
High availability in Compute Engine is ensured by several different mechanisms and practices.
At the physical hardware level, the large number of physical servers in the GCP provide redundancy for hardware failures. If a physical server fails, others are available to replace it.
Google also provides live migration, which moves VMs to other physical servers when there is a problem with a physical server or scheduled maintenance occurs. Live migration is also used when network or power systems are down, security patches need to be applied, or configurations need to be modified. Live migration is not available for preemptible VMs, however, but preemptible VMs are not designed to be highly available. At the time of this writing, VMs with GPUs attached are not available to live migrate. Constraints on live migration may change in the future. The descriptions of Google services here are illustrative and designed to help you learn how to reason about GCP services so you can answer exam questions. For up-to-date details on services, always consult Google Cloud documentation.
High availability also comes from the use of redundant VMs. Managed instance groups are the best way to create a cluster of VMs, all running the same services in the same configuration. A managed instance group uses an instance template to specify the configuration of each VM in the group. Instance templates specify machine type, boot disk image, and other VM configuration details. If a VM in the instance group fails, another one will be created using the instance template.
Managed instance groups (MIGs) provide other features that help improve availability. A VM may be operating correctly, but the application running on the VM may not be functioning as expected. Instance groups can detect this using an application-specific health check. If a VM instance fails the health check, the managed instance group will kill the failing instance and create a new instance. This feature is known as autohealing.
Managed instance groups use load balancing to distribute workload across instances. If an instance is not available, traffic will be routed to other servers in the instance group. Instance groups can be configured as regional instance groups. This distributes instances across multiple zones. If there is a failure in a zone, the application can continue to run in the other zones.
Beyond the regional instance group level, you can further ensure high availability by running your application in multiple regions and using a global load balancer to distribute workload. This would have the added advantage of allowing users to connect to an application instance in the closest region, which could reduce latency. You would have the option of using the HTTP(S), SSL Proxy, or TCP Proxy load balancers for global load balancing.
Kubernetes Engine is a managed Kubernetes service that is used for container orchestration. Kubernetes is designed to provide highly available containerized services. High availability in GKE Kubernetes clusters comes both from Google's technical processes and from the design of Kubernetes itself.
VMs in a GKE Kubernetes cluster are members of a managed instance group, so they have all the high availability features described previously.
Kubernetes continually monitors the state of containers and pods. Pods are the smallest unit of deployment in Kubernetes; they usually have one container, but in some cases a pod may have two or more tightly coupled containers. If pods are not functioning correctly, they will be shut down and replaced. Kubernetes collects statistics, such as the number of desired pods and the number of available pods, which can be reported to Cloud Monitoring.
Kubernetes Engine clusters can be zonal or regional. To improve availability, you can create a regional cluster in GKE, the managed service that distributes the underlying VMs across multiple zones within a region. GKE replicates control plane servers and nodes across zones. Control plane servers run several services including the API server, scheduler, and resource controller and, when deployed to multiple zones, provide for continued availability in the event of a zone failure.
App Engine and Cloud Functions are fully managed compute services. Users of these services are not responsible for maintaining the availability of the computing resources. The Google Cloud Platform ensures the high availability of these services.
Of course, App Engine and Cloud Functions applications and functions may fail and leave the application unavailable. This is a case where the software engineering and DevOps best practices can help improve availability.
All four case studies have requirements for high availability computing.
Highly available storage is storage that is available and functional at nearly all times. The storage services can be grouped into the following categories:
Let's look at availability in each type of storage service.
Cloud Storage is a fully managed object storage service. Google maintains high availability of the service. As with other managed services, users do not have to do anything to ensure high availability.
Cloud Filestore is another managed storage service. It provides filesystem storage that is available across the network. High availability is ensured by Google.
Persistent disks (PDs) are SSDs and hard disk drives that can be attached to VMs. These disks provide block storage so that they can be used to implement filesystems and database storage. Persistent disks continue to exist even after the VMs shut down. One of the ways in which persistent disks enable high availability is by supporting online resizing. Also, GCP offers both zone persistent disks and regional persistent disks. Regional persistent disks are replicated in two zones within a region. Persistent disks are further categorized by performance characteristics into several types:
The higher the performance of the persistent disk, the higher the cost. Durability also varies across persistent disk type. Zonal standard persistent disks have better than 99.99 percent durability while zonal balanced PDs, zonal SSD PDs, and regional standard PDs, have better than 99.999 percent durability. Zonal extreme PDs and regional SSD PDs have better than 99.9999 percent durability.
GCP users can choose between running database servers in VMs that they managed or using one of the managed database services.
When running and managing a database, you will need to consider how to maintain availability if the database server or underlying VM fails. Redundancy is the common approach to ensuring availability in databases. How you configure multiple database servers will depend on the database system you are using.
For example, PostgreSQL has several options for using combinations of primary servers, hot standby servers, and warm standby servers. A hot standby server can take over immediately in the event of a primary server failure. A warm standby may be slightly behind in reflecting all transactions. PostgreSQL employs several methods for enabling failover, including the following:
Other database management systems offer similar capabilities. The details are not important for taking the Professional Cloud Architect exam, but it is important to understand how difficult it is to configure and maintain highly available databases. In contrast, if you were using Cloud SQL, you could configure high availability in the console by opting for a high availability configuration.
GCP offers several managed databases. All have high availability features.
Fully managed and serverless databases, such as Cloud Firestore and BigQuery, are highly available, and Google attends to all of the deployment and configuration details to ensure high availability.
The database servers that require users to specify some server configuration options, such as Cloud SQL and Bigtable, can be made more or less highly available based on the use of regional replication. For example, in Bigtable, regional replication enables primary-primary replication among clusters in different zones. This means that both clusters can accept reads and writes, and changes are propagated to the other cluster. In addition to reads and writes, regional replication in Bigtable replicates other changes, such as updating data, adding or removing column families, and adding or removing tables.
In general, the availability of databases is based on the number of replicas and their distribution. The more replicas and the more they are dispersed across zones, the higher the availability. Keep in mind that as you increase the number of replicas, you will increase costs and possibly latency if all replicas must be updated before a write operation is considered successful. Also consider if the data storage system you choose is available within a zone or across regions.
Caching is the practice of storing data in low-latency storage to improve application or database performance. For example, if a particular query is frequently invoked in a database application, the query may respond faster if the data is in memory than if the memory were retrieved from a standard persistent disk. Caches are typically optimized for low latency and often come with low durability. Snapshots of the state of a cache may be saved to persistent storage to provide a point of recovery, but such snapshots are not as general purpose as a database table saved to persistent disk.
Cloud Memorystore is a high availability cache service in Google Cloud that supports both Memcached and Redis. This managed cache service can be used to improve availability of data that requires low latency access. Instead of storing data in the memory of a virtual machine or container, which can fail and lose the state of memory, application designers can use Cloud Memorystore to provide high availability of data that requires low latency.
The Mountkirk Games case study notes that the company plans to offer a global leader board using Cloud Spanner, which provides for both high availability and multiregion to global strongly consistent transactions. Game player data, such as the state of play and possessions and attributes of players, could be stored in a NoSQL database such as Bigtable, which provides both low-latency reads/writes and scalability.
One of the technical requirements is to “store game activity logs in structured files for future analysis,” which is a good candidate for Cloud Storage, which can scale to store log files as needed. When it is time to analyze log data, files can be loaded into BigQuery, a fully managed analytical database, or accessed by as external, federated tables stored in Cloud Storage.
TerramEarth needs to store telemetry data ingested in real time. This data is time-series data, which means that each record has a time stamp, identifiers indicating the piece of equipment that generated the data, and a series of metrics. Bigtable is a good option when you need to write large volumes of data in real time at low latency. Bigtable has support for regional replication, which improves availability.
EHR Healthcare uses a combination of relational and NoSQL databases. If the company continues to manage their databases rather than use a managed service, such as Cloud SQL, then they should consider using regional or zonal persistent disks and choose based on their availability requirements.
Helicopter Racing League performs encoding and transcoding in the cloud. If low-latency persistent storage access is important for these processes, then the company should consider extreme PDs or local SSDs if their encoding and transcoding pipelines can tolerate the loss of a zonal or local disk. Object storage is used with Helicopter Racing League's current cloud provider to store content; Cloud Storage could provide the same function in Google Cloud. The focus on building predictive models means the company will need to store large volumes of content, such as all race recordings. Telemetry data from racing helicopters as well as from viewers watching races could be stored in Bigtable, which provides for scalability, low-latency reads and writes, as well as key lookup and range scan lookups. Bigtable could be the source of structured data, such as time-series data, for building machine learning models, while Cloud Storage could store unstructured contents, such as audio and video.
When network connectivity is down, applications are unavailable. There are two primary ways to improve network availability:
Redundant network connections can be used to increase the availability of the network between an on-premises data center and Google's data center. One type of connection is a Dedicated Interconnect, which can be used with a minimum of 10 Gbps throughput and does not traverse the public internet. A Dedicated Interconnect is possible when both your network and the Google Cloud network have a point of presence in a common location, such as a data center. When your network does not share a common point of presence with the Google Cloud network, you have the option of using a Partner Interconnect. When using a Partner Interconnect, you provision a network link between your data center and a Google network point of presence. Traffic flows through a telecommunication provider's network from your data center to Google Cloud's network. Traffic does not travel over the internet.
VPNs can also be used when sending data over the internet is not a problem. You should choose among these options based on cost, security, throughput, latency, and availability considerations. Google Cloud offers a high availability VPN, known as HA VPN, which uses redundant connections and offers a 99.99 percent SLA.
Data within the GCP can be transmitted among regions using the public internet or Google's internal network. The latter is available as the Premium Network Tier, which costs more than the Standard Network Tier, which uses the public internet. The internal Google network is designed for high availability and low latency, so the Premium Tier should be considered if global network availability is a concern. Note, if you plan to use global load balancing, you will need to use Premium Tier networking.
The case studies do not provide explicit networking requirements other than an implied expectation that the network is always available. An architect should inquire about additional requirements that might determine if Premium Tier networking is required or if multiple network connections among on-premises and Google data centers are needed.
Application availability builds on compute, storage, and networking availability. It also depends on the application itself. Designing software for high availability is beyond the scope of this book, and it is not a subject you will likely be tested on when taking the Professional Cloud Architect exam.
Architects should understand that they can use Cloud Monitoring and Cloud Logging to observe the state of applications so that they can detect problems as early as possible. Applications that are instrumented with custom metrics can provide application-specific details that could be helpful in diagnosing problems with an application.
Scalability is the process of adding and removing infrastructure resources to meet workload demands efficiently. Different kinds of resources have different scaling characteristics. Here are some examples:
As a general rule, scaling stateless applications horizontally is straightforward. Stateful applications are difficult to scale horizontally, and vertical scaling is often the first choice when stateful applications must scale. Alternatively, stateful applications can move state information out of the individual containers or VMs and store it in a cache, like Cloud Memorystore, or in a database. This makes scaling horizontally less challenging.
Remember that different kinds of resources will scale at different rates. Compute-intensive applications may need to scale compute resources at a faster rate than storage. Similarly, a database that supports large volumes that is not often queried may need to scale up storage faster than compute resources. To facilitate efficient scaling, it helps to decouple resources that scale at different rates.
For example, front-end applications are often needed to scale according to how many users are active on the system and how long requests take to process. Meanwhile, the database server may have enough resources to meet peak demand load without scaling up. When resources are difficult to scale, consider deploying for peak capacity. Relational databases, other than Cloud Spanner, and network interconnects are examples of resources that are difficult to scale. In the case of a non-Spanner relational database, you could scale by running the database on a server with more CPUs and memory. This is vertical scaling, which is limited to the size of available instances. For networks, you could add additional interconnects to add bandwidth between sites. Both of these are disruptive operations compared to scaling a stateless application by adding virtual machines to a cluster, which users might never notice.
Compute Engine and Kubernetes Engine support automatic scaling of compute resources. App Engine and Cloud Functions autoscale as well, but they are managed by the Google Compute Platform.
In Compute Engine, you can scale the number of VMs running your application using managed instance groups, which support autoscaling. Adding VMs to a managed instance group is known as scaling out or scaling up. Removing VMs from a managed instance group is known as scaling in or scaling down. Autoscaling is not available when a managed instance group has a stateful configuration. Unmanaged instance groups do not support autoscaling. Compute Engine autoscaling should not be used by managed instance groups owned by Kubernetes Engine; cluster autoscaling should be used in those cases.
Autoscaling can be configured to scale based on several attributes, including the following:
The autoscaler collects the appropriate performance data and compares it to targets set in an autoscaling policy. For instance, if you set the target CPU utilization to 80 percent, then the autoscaler will add or remove VMs from the managed instance group to keep the CPU utilization average for the group close to 80 percent.
Autoscalers can make decisions based on multiple metrics. An autoscaler will calculate a recommended number of VMs per metric and then choose the maximum number of VMs recommended.
In addition to autoscaling based on metrics, you can also schedule autoscaling based on time using a scaling schedule. A scaling schedule has a capacity, which is the minimum number of required VMs, and a schedule that includes a start time, duration, and recurrence frequency, such as daily or weekly. You can also enable predictive autoscaling to forecast future loads. This works best when an application has a long startup time and the workload varies predictably over days or weeks.
Keep in mind that autoscaling is independent of health checks. If you use autohealing and a VM fails a health check, the autohealer will try to re-create the instance that failed.
When adding a VM to a managed instance group, the application running on the VM will take some time to initialize. This is known as the cooldown period. Autoscalers will use data from VMs in a cooldown period for scale-in decisions but not scale-out decisions. By default, the cooldown period is 60 seconds, but that can be changed.
When scaling in, the autoscaler considers the peak load during the previous 10 minutes, which is known as the stabilization period. The autoscaler ensures there are enough VMs to meet the peak load during the stabilization period.
Abrupt scale-in events can increase application latency. You can control scale-in operations by specifying a maximum allowed reduction in VMs within a specified time period known as the trailing time window. The trailing time window is the time window the autoscaler monitors for making scaling decisions. The autoscaler does not resize below the peak size less the maximum allowed reduction in VMs.
Before a VM is removed from a group, it can optionally run a shutdown script to clean up. The shutdown script is run on a best-effort basis.
When an instance is added to the group, it is configured according to the configuration details in the instance template.
Kubernetes is designed to manage containers in a cluster environment. Recall that containers are an isolation mechanism that allows processes on the same operating system to run with isolated resources. Kubernetes does not scale containers directly; instead, autoscaling is based on Kubernetes abstractions.
The smallest computational resource in Kubernetes is a pod. Pods contain containers. Pods run on nodes, which are VMs in managed instance groups. Pods usually contain one container, but they can include more. When pods have more than one container, those containers are usually tightly coupled, such as one container running analysis code while the other container runs ancillary services, such as data cleansing services. Containers in the same pod should have the same scaling characteristics since they will be scaled up and down together.
A deployment specifies updates for pods and ReplicaSets, which are sets of identically configured pods running at some point in time. An application may be run in more than one deployment at a time. This is commonly done to roll out new versions of code. A new deployment can be run in a cluster, and a small amount of traffic can be sent to it to test the new code in a production environment without exposing all users to the new code. This is an example of a canary deployment.
Applications running on a set of pods can be exposed using a Service. A Service provides a stable abstraction for accessing an application running in a deployment, which can have pods and associated IP addresses that change.
Kubernetes can scale the number of nodes in a cluster, and it can scale the number of replicas and pods running a deployment. Kubernetes Engine automatically scales the size of the cluster based on load. If a new pod is created and there are not enough resources in the cluster to run the pod, then the autoscaler will add a node. Nodes exist within node pools, which are nodes with the same configuration. When a cluster is first created, the number and type of nodes created become the default node pool. Other node pools can be added later if needed.
When you deploy applications to Kubernetes clusters, you have to specify how many replicas of an application should run. A replica is implemented as a pod running application containers. Scaling an application is changing the number of replicas to meet the demand.
Kubernetes provides for autoscaling the number of replicas. When using autoscaling, you specify a minimum and maximum number of replicas for your application along with a target that specifies a resource, like CPU utilization, and a threshold, such as 80 percent. Since Kubernetes Engine 1.9, you can specify custom metrics in Cloud Metrics as a target.
One of the advantages of containerizing applications is that they can be run in Kubernetes Engine, which can automatically scale the number of nodes or VMs in a cluster. It can also scale how those cloud resources are allocated to different services and their deployments.
Storage resources are virtualized in GCP, and some are fully managed services, so there are parallels between scaling storage and compute resources.
The least scalable storage system is locally attached SSDs on VMs. Locally attached storage is not considered a persistent storage option. Data will be retained during reboots and live migrations, but it is lost when the VM is terminated or stopped. Local data is lost from preemptible VMs when they are preempted.
Zonal and regional persistent disks and persistent SSDs can currently scale up to 64 TB per VM instance. You should also consider read and write performance when scaling persistent storage. Standard disks have a maximum sustained read IO operations per second (IOPS) of 0.75 per gigabyte and write IOPS of 1.5 per gigabyte. Persistent SSDs have a maximum sustained read and write IOPS of 30 per gigabyte. As a general rule, persistent disks are well suited for large-volume batch processing when low-cost and high-storage volume are important. When performance is a consideration, such as when running a database on a VM, persistent SSDs are the better option.
Adding storage to a VM is a two-step process. You will need to allocate persistent storage and then issue operating system commands to make the storage available to the filesystem. The commands are operating system specific.
Managed services, such as Cloud Storage and BigQuery, ensure that storage is available as needed. In the case of BigQuery, even if you do not scale storage directly, you may want to consider partitioning data to improve query performance. Partitioning organizes data in a way that allows the query processor to scan smaller amounts of data to answer a query. For example, assume that Mountkirk Games is storing summary data about user sessions in BigQuery. The data includes a date indicating the day that the session data was collected. Analysts typically analyze data at the week and month levels. If the data is partitioned by week or month, the query processor would scan only the partitions needed to answer the query. Data that is outside the date range of the query would not have to be scanned. Since BigQuery charges by the amount of data scanned, this can help reduce costs.
Connectivity between on-premises data centers and Google data centers doesn't scale the way storage and compute scales. You need to plan ahead for what is the upper limit of what will be needed. You should plan for peak capacity, although you may only pay for bandwidth used depending on your provider.
Reliability is a measure of the likelihood of a system being available and able to meet the needs of the load on the system. When analyzing technical requirements, it is important to look for reliability requirements. As with availability and scalability, these requirements may be explicit or implicit.
Designing for reliability requires that you consider how to minimize the chance of system failures. For example, we employ redundancy to mitigate the risk of a hardware failure leaving a crucial component unavailable. We also use DevOps best practices to manage risks with configuration changes and when managing infrastructure as code. These are the same practices that we employ to ensure availability.
You also need to consider how to respond when systems do fail. Distributed applications are complicated. A single application may depend on multiple microservices, each with a number of dependencies on other services, which may be developed and managed by another team within the organization or may be a third-party service.
There are different ways to measure reliability, but some are more informative than others.
Total system uptime is one measure. This sounds simple and straightforward, but it is not—at least when dealing with distributed systems. Specifically, what measure do you use to determine whether a system is up? If at least one server is available, is a system up? If there was a problem with an instance group in Compute Engine or the pods in a Kubernetes deployment, you may be able to respond to some requests but not others. If your definition of uptime is based on just having one or some percentage of desired VMs or pods running, then this may not accurately reflect user experience regarding reliability.
Rather than focus on the implementation metrics, such as the number of instances available, reliability is better measured as a function of the work performed by the service. The number of requests that are successfully responded to is a good basis for measuring reliability. Successful request rate is the percentage of all application requests that are successfully responded to. This measure has the advantage of being easy to calculate and of providing a good indicator for the user experience.
As an architect, you should consider ways to support reliability early in the design stage. This should include the following:
Designing for reliability engineering requires an emphasis on organizational and management issues. This is different than designing for high availability and scalability, which is dominated by technical considerations. As an architect, it is important to remember that your responsibilities include both technical and management aspects of system design.
Architects are constantly working with technical requirements. Sometimes these requirements are explicitly stated, such as when a line-of-business manager states that the system will need to store 10 TB of data per day or that the data warehouse must support SQL. In other cases, you must infer technical requirements from other statements. If a streaming application must be able to accept late-arriving data, this implies the need to buffer data when it arrives and to specify how long to wait for late data.
Some technical requirements are statements of constraints, such as requiring that a database be implemented using MySQL 8.0. Other technical requirements require architects to analyze multiple business needs to identify specific requirements. Many of these fall into the categories of high availability, scalability, and reliability. Compute, storage, and networking services should be designed to support the levels of availability, scalability, and reliability that the business requires.