Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Designing for Technical Requirements

The Professional Cloud Architect Certification Exam objectives covered in this chapter include the following:

✓ 1.2 Designing a solution infrastructure that meets technical requirements

images The Google Cloud Professional Architect exam will test your ability to understand technical requirements that are explicitly stated, as well as implied, in case studies and questions. Technical requirements may specify a particular hardware or software constraint. For example, an application may need to use a MySQL 5.7 database or be able to transmit 1 GB of data between an on-premises data center and the Google Cloud Platform. Technical requirements do not necessarily specify all of the details that you will need to know. If a question states that a virtual private cloud will have three subnets, then you will have to infer from that statement that the subnets will need to be configured with distinct, nonoverlapping address spaces. It is common for questions about technical requirements to require you to choose among multiple solutions and to understand some unstated implication of the requirement so that you can make a choice among possible solutions.

In this chapter, we will consider three broad categories of technical requirements:

High availability
Scalability
Reliability

We will use the case studies as jumping-off points for discussing these kinds of requirements. We will consider how each of these factors influences the choices we make about compute, storage, networking, and specialized services.

The most important piece of information to take away from this chapter is that availability, scalability, and reliability are not just important at the component or subsystem level but across the entire application infrastructure. Highly reliable storage systems will not confer high reliability on a system if the networking or compute services are not reliable.

High Availability

High availability is the continuous operation of a system at sufficient capacity to meet the demands of ongoing workloads. Availability is usually measured as a percent of time that a system is available and responding to requests with latency not exceeding some certain threshold. Table 3.1 shows the amount of allowed downtime at various service-level agreement (SLA) levels. An application with a 99 percent availability SLA can be down for 14.4 minutes per day, while a system with a 99.999 percent availability can be down for less than 1 second per day without violating the SLA.

Table 3.1 Example availability SLAs and corresponding downtimes

Percent Uptime	Downtime/Day	Downtime/Week	Downtime/Month
99.00	14.4 minutes	1.68 hours	7.31 hours
99.90	1.44 minutes	10.08 minutes	43.83 minutes
99.99	8.64 seconds	1.01 minutes	4.38 minutes
99.999	864 milliseconds	6.05 seconds	26.3 seconds
99.9999	86.4 milliseconds	604.8 milliseconds	2.63 seconds

High availability SLAs, such as these, have to account for the fact that hardware and software fails. Individual physical components, such as a disk drive running in a particular disk array, may have a small probability of failing in a one-month period. If you are using thousands of drives, then it is much more likely that at least one of them will fail.

When designing high availability applications, you have to plan for failures. Failures can occur at multiple points in an application stack:

An application bug
A service that the application depends on is down
A database disk drive fills up
A network interface card fails
A router is down
A network engineer misconfigures a firewall rule

We can compensate for hardware failures with redundancy. Instead of writing data to one disk, we write it to three disks. Rather than have a single server running an application, we create instance groups with multiple servers and load balance workload among them. We install two direct network connections between our data center and the GCP—preferably with two different telecommunication vendors. Redundancy is also a key element of ensuring scalability.

We compensate for software and configuration errors with software engineering and DevOps best practices. Code reviews, multiple levels of testing, and running new code in a staging environment can help identify bugs before code is released to production. Canary deployments, in which a small portion of a system’s workload is routed to a new version of the software, allows us to test code under production conditions without exposing all users to new code. If there is a problem with the new version of software, it will affect only a portion of the users before it is rolled back. Automating infrastructure deployments, by treating infrastructure as code, reduces the need for manual procedures and the chance to make a mistake when entering commands.

As you design systems with an eye for high availability, keep in mind the role of redundancy and best practices for software development and DevOps.

Compute Availability

The GCP offers several compute services. We’ll consider availability in four of these services:

Compute Engine
Kubernetes Engine
App Engine
Cloud Functions

Each of these services can provide high availability compute resources, but they vary in the amount of effort required to achieve high availability.

High Availability in Compute Engine

High availability in Compute Engine is ensured by several different mechanisms and practices.

Hardware Redundancy and Live Migration

At the physical hardware level, the large number of physical servers in the GCP provide redundancy for hardware failures. If a physical server fails, others are available to replace it.

Google also provides live migration, which moves VMs to other physical servers when there is a problem with a physical server or scheduled maintenance has to occur. Live migration is also used when network or power systems are down, security patches need to be applied, or configurations need to be modified. Live migration is not available for preemptible VMs, however, but preemptible VMs are not designed to be highly available. At the time of this writing, VMs with GPUs attached are not available to live migrate. Constraints on live migration may change in the future. The descriptions of Google services here are illustrative and designed to help you learn how to reason about GCP services in order to answer exam questions. This book should not be construed as documentation for GCP services.

Managed Instance Groups

High availability also comes from the use of redundant VMs. Managed instance groups are the best way to create a cluster of VMs, all running the same services in the same configuration. A managed instance group uses an instance template to specify the configuration of each VM in the group. Instance templates specify machine type, boot disk image, and other VM configuration details. If a VM in the instance group fails, another one will be created using the instance template.

Managed instance groups (MIGs) provide other features that help improve availability. A VM may be operating correctly, but the application running on the VM may not be functioning as expected. Instance groups can detect this using an application-specific health check. If an application fails the health check, the managed instance group will create a new instance. This feature is known as auto-healing.

Managed instance groups use load balancing to distribute workload across instances. If an instance is not available, traffic will be routed to other servers in the instance group. Instance groups can be configured as regional instance groups. This distributes instances across multiple zones. If there is a failure in a zone, the application can continue to run in the other zones.

Multiple Regions and Global Load Balancing

Beyond the regional instance group level, you can further ensure high availability by running your application in multiple regions and using a global load balancer to distribute workload. This would have the added advantage of allowing users to connect to an application instance in the closest region, which could reduce latency. You would have the option of using the HTTP(S), SSL Proxy, or TCP Proxy load balancers for global load balancing.

High Availability in Kubernetes Engine

Kubernetes Engine is a managed Kubernetes service that is used for container orchestration. Kubernetes is designed to provide highly available containerized services. High availability in GKE Kubernetes clusters comes both from Google’s technical processes and from the design of Kubernetes itself.

VMs in a GKE Kubernetes cluster are members of a managed instance group, and so they have all of the high availability features described previously.

Kubernetes continually monitors the state of containers and pods. Pods are the smallest unit of deployment in Kubernetes; they usually have one container, but in some cases a pod may have two or more tightly coupled containers. If pods are not functioning correctly, they will be shut down and replaced. Kubernetes collects statistics, such as the number of desired pods and the number of available pods, which can be reported to Stackdriver.

By default, Kubernetes Engine creates a cluster in a single zone. To improve availability, you can create a regional cluster in GKE, the managed service that distributes the underlying VMs across multiple zones within a region. GKE replicates masters and nodes across zones. Masters are servers that run the Kubernetes control plane, which includes the API server, scheduler, and resource controllers. This provides continued availability in the event of a zone failure. The redundant masters allow the cluster to continue to function even if one of the masters is down for maintenance or fails.

High Availability in App Engine and Cloud Functions

App Engine and Cloud Functions are fully managed compute services. Users of these services are not responsible for maintaining the availability of the computing resources. The Google Cloud Platform ensures the high availability of these services.

Of course, App Engine and Cloud Functions applications and functions may fail and leave the application unavailable. This is a case where the software engineering and DevOps best practices can help improve availability.

High Availability Computing Requirements in Case Studies

All three case studies have requirements for high availability computing.

In the Mountkirk Games case study, there is no mention of high availability. This does not mean that it is not required. Given the nature of online games, users expect to be able to continue to play once they start and until they decide to stop. The case study specifies that the backend game engine will run on Compute Engine so that the company can take advantage of autoscaling, a feature of managed instance groups. In addition, they plan to collect streaming metrics and perform intensive analytics so that they can report on KPIs. Without highly available servers to ingest, analyze, and store streaming metrics, data would be lost.
The Dress4Win case study notes that the first phase of their project will focus on moving the development and test environments. These generally do not require the same level of availability as production services. The company is also developing a disaster recovery site that, when in use, would have to meet the same availability SLAs as the production environment.
There is more discussion about scalability than availability in the Dress4Win case study, but many of the architecture choices that you would make for scalability contribute to high availability.
In the TerramEarth case study, the need for high availability is implied in the business and technical requirements. One of the technical requirements states, “[u]se customer and equipment data to anticipate customer needs.” The 200,000 connected vehicles operating 22 hours a day stream a total of 9 TB of data each day. That data would be lost if it could not be ingested.

Storage Availability

Highly available storage is storage that is available and functional at nearly all times. The storage services can be grouped into the following categories:

Object storage
File and block storage
Database services

Let’s look at availability in each type of storage service.

Availability vs. Durability

Availability should not be confused with durability, which is a measure of the probability that a stored object will be inaccessible at some point in the future. A storage system can be highly available but not durable. For example, in Compute Engine, locally attached storage is highly available because of the way Google manages VMs. If there was a problem with the local storage system, VMs would be live migrated to other physical servers. Locally attached drives are not durable, though. If you need durable drives, you could use Persistent Disk or Cloud Filestore, the fully managed file storage service.

Availability of Object, File, and Block Storage

Cloud Storage is a fully managed object storage service. Google maintains high availability of the service. As with other managed services, users do not have to do anything to ensure high availability.

Cloud Filestore is another managed storage service. It provides filesystem storage that is available across the network. High availability is ensured by Google.

Persistent disks are SSDs and hard disk drives that can be attached to VMs. These disks provide block storage so that they can be used to implement filesystems and database storage. Persistent disks continue to exist even after the VMs shut down. One of the ways in which persistent disks enable high availability is by supporting online resizing. If you find that you need additional storage, you can add storage of up to 64 TB per persistent disk. Also, GCP offers both zone persistent disks and region persistent disks. Regional persistent disks are replicated in two zones within a region.

Availability of Databases

GCP users can choose between running database servers in VMs that they managed or using one of the managed database services.

Self-Managed Databases

When running and managing a database, you will need to consider how to maintain availability if the database server or underlying VM fails. Redundancy is the common approach to ensuring availability in databases. How you configure multiple database servers will depend on the database system you are using.

For example, PostgreSQL has several options for using combinations of master servers, hot standby servers, and warm standby servers. A hot standby server can take over immediately in the event of a master server failure. A warm standby may be slightly behind in reflecting all transactions. PostgresSQL employs several methods for enabling failover, including the following:

Shared disk, in which case multiple databases share a disk. If the master server fails, the standby starts to use the shared disk.
Filesystem replication, in which changes in the master server filesystem are mirrored on the failover server’s filesystem.
Synchronous multimaster replication, in which each server accepts writes and propagates changes to other servers.

Other database management systems offer similar capabilities. The details are not important for taking the Professional Cloud Architect exam, but it is important to understand how difficult it is to configure and maintain highly available databases. In contrast, if you were using Cloud SQL, you could configure high availability in the console by checking a Create Failover Replica box.

Managed Databases

GCP offers several managed databases. All have high availability features.

Fully managed and serverless databases, such as Cloud Datastore and BigQuery, are highly available, and Google attends to all of the deployment and configuration details to ensure high availability.

Cloud Firestore is the next generation Datastore. The exam may mention Cloud Datastore or Cloud Firestore. For the purposes of this book, we can consider the two names synonymous. Always refer to the Google documentation for the latest nomenclature.

The database servers that require users to specify some server configuration options, such as Cloud SQL and Bigtable, can be made more or less highly available based on the use of regional replication. For example, in Bigtable, regional replication enables primary-primary replication among clusters in different zones. This means that both clusters can accept reads and writes, and changes are propagated to the other cluster. In addition to reads and writes, regional replication in Bigtable replicates other changes, such as updating data, adding or removing column families, and adding or removing tables.

In general, the availability of databases is based on the number of replicas and their distribution. The more replicas and the more they are dispersed across zones, the higher the availability. Keep in mind that as you increase the number of replicas, you will increase costs and possibly latency if all replicas must be updated before a write operation is considered successful. Also consider if the data storage system you choose is available within a zone or globally.

High Availability Storage Requirements in Case Studies

The technical requirements for Mountkirk Games state that the company currently uses a MySQL database.

The executive statement notes, “replace MySQL and move to an environment that provides autoscaling, low latency load balancing, and that frees us up from managing physical servers.” Cloud SQL running MySQL meets the requirements to free the company from managing physical servers, but it does not provide autoscaling. The MySQL database is used for reporting and analytics, so this is a good candidate for BigQuery.

BigQuery is a fully managed database, with load distribution and autoscaling. This meets the requirement to “dynamically scale up or down based on game activity.” It also uses SQL as a query language, so users would not have to learn a new query language and existing reporting tools will likely work with BigQuery. BigQuery is also a good option for meeting the requirement to store at least 10 TB of historical data. (Currently, Cloud SQL can store up to 10 TB, which makes BigQuery a good choice).

There is a requirement to “process data that arrives late because of slow mobile networks.” This requirement could be satisfied using a common ingestion and preprocessing pattern of writing data to a Cloud Pub/Sub topic, which is read by a Cloud Dataflow process and transformed as needed and then written to BigQuery.

The Dress4Win storage requirements include specifications for database storage, network-attached storage, and storage for the Spark cluster. If they use Cloud SQL instead of a self-managed MySQL server, they can improve storage availability by using regional replicas. Using Cloud Storage will provide highly available storage for images, logs, and backups. If Dress4Win decides to use Cloud Dataproc instead of managing its own Spark cluster, then storage availability will be managed by the Google Cloud Platform.

TerramEarth needs to store 9 TB of equipment data per day. This data is time-series data, which means that each record has a time stamp, identifiers indicating the piece of equipment that generated the data, and a series of metrics. Bigtable is a good option when you need to write large volumes of data in real time at low latency. Bigtable has support for regional replication, which improves availability. Regional or zonal persistent disk can be used with Compute Engine VMs. Regional persistent disks provide higher availability if needed.

Network Availability

When network connectivity is down, applications are unavailable. There are two primary ways to improve network availability:

Use redundant network connections
Use Premium Tier networking

Redundant network connections can be used to increase the availability of the network between an on-premises data center and google’s data center. One type of connection is a dedicated interconnect, which can be used with a minimum of 10 Gbps throughput and does not traverse the public Internet. A Partner Interconnect is another option. In this case, traffic flows through a telecommunication provider’s network, not the Internet. VPNs can also be used when sending data over the Internet is not a problem. You should choose among these options based on cost, security, throughput, latency, and availability considerations.

Data within the GCP can be transmitted among regions using the public Internet or Google’s internal network. The latter is available as the Premium Network Tier, which costs more than the Standard Network Tier, which uses the public Internet. The internal Google network is designed for high availability and low latency, so the Premium Tier should be considered if global network availability is a concern.

High Availability Network Requirements in Case Studies

The case studies do not provide explicit networking requirements other than an implied expectation that the network is always available. The TerramEarth case study notes that cellular networks may be slow or unavailable, and the applications will need to account for late-arriving data.

An architect could inquire about additional requirements that might determine if Premium Tier networking is required or if multiple network connections among on- premises and Google data centers is needed.

Application Availability

Application availability builds on compute, storage, and networking availability. It also depends on the application itself. Designing software for high availability is beyond the scope of this book, and it is not a subject you will likely be tested on when taking the Professional Cloud Architect exam.

Architects should understand that they can use Stackdriver monitoring to monitor the state of applications so that they can detect problems as early as possible. Applications that are instrumented with custom metrics can provide application-specific details that could be helpful in diagnosing problems with an application.

Scalability

Scalability is the process of adding and removing infrastructure resources to meet workload demands efficiently. Different kinds of resources have different scaling characteristics. Here are some examples:

VMs in a managed instance group scale by adding or removing instances from the group.
Kubernetes scales pods based on load and configuration parameters.
NoSQL databases scale horizontally, but this introduces issues around consistency.
Relational databases can scale horizontally, but that requires server clock synchronization if strong consistency is required among all nodes. Cloud Spanner uses the TrueTime service, which depends on atomic clocks and GPS signals to track time.

As a general rule, scaling stateless applications horizontally is straightforward. Stateful applications are difficult to scale horizontally, and vertical scaling is often the first choice when stateful applications must scale. Alternatively, stateful applications can move state information out of the individual containers or VMs and store it in a cache, like Cloud Memorystore, or in a database. This makes scaling horizontally less challenging.

Remember that different kinds of resources will scale at different rates. Compute- intensive applications may need to scale compute resources at a faster rate than storage. Similarly, a database that supports large volumes that is not often queried may need to scale up storage faster than compute resources. To facilitate efficient scaling, it helps to decouple resources that scale at different rates.

For example, front-end applications are often needed to scale according to how many users are active on the system and how long requests take to process. Meanwhile, the database server may have enough resources to meet peak demand load without scaling up. When resources are difficult to scale, consider deploying for peak capacity. Relational databases, other than Cloud Spanner, and network interconnects are examples of resources that are difficult to scale. In the case of a non-Spanner relational database, you could scale by running the database on a server with more CPUs and memory. This is vertical scaling, which is limited to the size of available instances. For networks, you could add additional interconnects to add bandwidth between sites. Both of these are disruptive operations compared to scaling a stateless application by adding virtual machines to a cluster, which users might never notice.

Scaling Compute Resources

Compute Engine and Kubernetes Engine support automatic scaling of compute resources. App Engine and Cloud Functions autoscale as well, but they are managed by the Google Compute Platform.

Scaling Compute in Compute Engine

In Compute Engine, you can scale the number of VMs running your application using managed instance groups, which support autoscaling. Unmanaged instance groups do not support autoscaling.

Autoscaling can be configured to scale based on several attributes, including the following:

Average CPU utilization
HTTP load balancing capacity
Stackdriver monitoring metrics

The autoscaler collects the appropriate performance data and compares it to targets set in an autoscaling policy. For instance, if you set the target CPU utilization to 80 percent, then the autoscaler will add or remove VMs from the managed instance group to keep the CPU utilization average for the group close to 80 percent.

When a Stackdriver agent is running on VMs in an instance group, you can specify targets for metrics collected by the agent. Some of the metrics collected by the Stackdriver agent are as follows:

api_request_count
log_entry_count
memory_usage
uptime

By default, autoscalers use data from the previous 10 minutes when making decisions about scaling down. This is done to prevent frequent changes to the cluster based on short-term changes in utilization, which could lead to thrashing, or adding and removing VMs in rapid succession.

Before a VM is removed from a group, it can optionally run a shutdown script to clean up. The shutdown script is run on a best-effort basis.

When an instance is added to the group, it is configured according the configuration details in the instance template.

Scaling Compute in Kubernetes Engine

Kubernetes is designed to manage containers in a cluster environment. Recall that containers are an isolation mechanism that allows processes on the same operating system to run with isolated resources. Kubernetes does not scale containers directly; instead, autoscaling is based on Kubernetes abstractions.

The smallest computational resource in Kubernetes is a pod. Pods contain containers. Pods run on nodes, which are VMs in managed instance groups. Pods usually contain one container, but they can include more. When pods have more than one container, those containers are usually tightly coupled, such as one container running analysis code while the other container runs ancillary services, such as data cleansing services. Containers in the same pod should have the same scaling characteristics since they will be scaled up and down together.

Pods are organized into deployments. A deployment is a functioning version of an application. An application may run more than one deployment at a time. This is actually commonly done to roll out new versions of code. A new deployment can be run in a cluster, and a small amount of traffic can be sent to it to test the new code in a production environment without exposing all users to the new code. This is an example of a canary deployment. Groups of deployments constitute a service, which is the highest level of application abstraction.

Kubernetes can scale the number of nodes in a cluster, and it can scale the number of replicas and pods running a deployment. Kubernetes Engine automatically scales the size of the cluster based on load. If a new pod is created and there are not enough resources in the cluster to run the pod, then the autoscaler will add a node. Nodes exist within node pools, which are nodes with the same configuration. When a cluster is first created, the number and type of nodes created become the default node pool. Other node pools can be added later if needed.

When you deploy applications to Kubernetes clusters, you have to specify how many replicas of an application should run. A replica is implemented as a pod running application containers. Scaling an application is changing the number of replicas to meet the demand.

Kubernetes provides for autoscaling the number of replicas. When using autoscaling, you specify a minimum and maximum number of replicas for your application along with a target that specifies a resource, like CPU utilization, and a threshold, such as 80 percent. Since Kubernetes Engine 1.9, you can specify custom metrics in Stackdriver as a target.

One of the advantages of containerizing applications is that they can be run in Kubernetes Engine, which can automatically scale the number of nodes or VMs in a cluster. It can also scale how those cloud resources are allocated to different services and their deployments.

Scaling Storage Resources

Storage resources are virtualized in GCP, and some are fully managed services, so there are parallels between scaling storage and compute resources.

The least scalable storage system is locally attached SSDs on VMs. Up to eight local SSDs can be attached to a VM. Locally attached storage is not considered a persistent storage option. Data will be retained during reboots and live migrations, but it is lost when the VM is terminated or stopped. Local data is lost from preemptible VMs when they are preempted.

Zonal and regional persistent disks and persistent SSDs can scale up to 64 TB per VM instance. You should also consider read and write performance when scaling persistent storage. Standard disks have a maximum sustained read IO operations per second (IOPS) of 0.75 per gigabyte and write IOPs of 1.5 per gigabyte. Persistent SSDs have a maximum sustained read and write IOPS of 30 per gigabyte. As a general rule, persistent disks are well suited for large-volume, batch processing when low cost and high storage volume are important. When performance is a consideration, such as when running a database on a VM, persistent SSDs are the better option.

Adding storage to a VM is a two-step process. You will need to allocate persistent storage and then issue operating system commands to make the storage available to the filesystem. The commands are operating system specific.

Managed services, such as Cloud Storage and BigQuery, ensure that storage is available as needed. In the case of BigQuery, even if you do not scale storage directly, you may want to consider partitioning data to improve query performance. Partitioning organizes data in a way that allows the query processor to scan smaller amounts of data to answer a query. For example, assume that Mountkirk Games is storing summary data about user sessions in BigQuery. The data includes a date indicating the day that the session data was collected. Analysts typically analyze data at the week and month levels. If the data is partitioned by week or month, the query processor would scan only the partitions needed to answer the query. Data that is outside the date range of the query would not have to be scanned. Since BigQuery charges by the amount of data scanned, this can help reduce costs.

Network Design for Scalability

Connectivity between on-premise data centers and Google data centers doesn’t scale the way storage and compute scales. You need to plan ahead for what is the upper limit of what will be needed. You should plan for peak capacity, although you may only pay for bandwidth used depending on your provider.

Reliability

Reliability is a measure of the likelihood of a system being available and able to meet the needs of the load on the system. When analyzing technical requirements, it is important to look for reliability requirements. As with availability and scalability, these requirements may be explicit or implicit.

Designing for reliability requires that you consider how to minimize the chance of system failures. For example, we employ redundancy to mitigate the risk of a hardware failure leaving a crucial component unavailable. We also use DevOps best practices to manage risks with configuration changes and when managing infrastructure as code. These are the same practices that we employ to ensure availability.

You also need to consider how to respond when systems do fail. Distributed applications are complicated. A single application may depend on multiple microservices, each with a number of dependencies on other services, which may be developed and managed by another team within the organization or may be a third-party service.

Measuring Reliability

There are different ways to measure reliability, but some are more informative than others.

Total system uptime is one measure. This sounds simple and straightforward, but it is not—at least when dealing with distributed systems. Specifically, what measure do you use to determine whether a system is up? If at least one server is available, is a system up? If there was a problem with an instance group in Compute Engine or the pods in a Kubernetes deployment, you may be able to respond to some requests but not others. If your definition of uptime is based on just having one or some percentage of desired VMs or pods running, then this may not accurately reflect user experience with regard to reliability.

Rather than focus on the implementation metrics, such as the number of instances available, reliability is better measured as a function of the work performed by the service. The number of requests that are successfully responded to is a good basis for measuring reliability. Successful request rate is the percentage of all application requests that are successfully responded to. This measure has the advantage of being easy to calculate and of providing a good indicator for the user experience.

Reliability Engineering

As an architect, you should consider ways to support reliability early in the design stage. This should include the following:

Identifying how to monitor services. Will they require custom metrics?
Considering alerting conditions. How do you balance the need for early indication that a problem may be emerging with the need to avoid overloading DevOps teams with unactionable alerts?
Using existing incident response procedures with the new system. Does this system require any specialized procedures during an incident? For example, if this is the first application to store confidential, personally identifying information, you may need to add procedures to notify the information security team if an incident involves a failure in access controls.
Implementing a system for tracking outages and performing post-mortems to understand why a disruption occurred.

Designing for reliability engineering requires an emphasis on organizational and management issues. This is different than designing for high availability and scalability, which is dominated by technical considerations. As an architect, it is important to remember that your responsibilities include both technical and management aspects of system design.

Summary

Architects are constantly working with technical requirements. Sometimes these requirements are explicitly stated, such as when a line-of-business manager states that the system will need to store 10 TB of data per day or that the data warehouse must support SQL. In other cases, you have to infer technical requirements from other statements. If a streaming application must be able to accept late-arriving data, this implies the need to buffer data when it arrives and to specify how long to wait for late data.

Some technical requirements are statements of constraints, like requiring that a database be implemented using MySQL 5.7. Other technical requirements require architects to analyze multiple business needs and other constraints in order to identify requirements. Many of these fall into the categories of high availability, scalability, and reliability. Compute, storage, and networking services should be designed to support the levels of availability, scalability, and reliability that the business requires.

Exam Essentials

Understand the differences between availability, scalability, and reliability. High availability is the continuous operation of a system at sufficient capacity to meet the demands of ongoing workloads. Availability is usually measured as a percentage of time that a system is available. Scalability is the process of adding and removing infrastructure resources to meet workload demands efficiently. Reliability is a measure of how likely it is that a system will be available and capable of meeting the needs of the load on the system.

Understand how redundancy is used to improve availability. Compute, storage, and network services all use redundancy to improve availability. Clusters of identically configured VMs behind a load balancer is an example of using redundancy to improve availability. Making multiple copies of data is an example of redundancy used to improve storage availability. Using multiple direct connections between a data center and Google Cloud is an example of redundancy in networking.

Know that managed services relieve users of many responsibilities for availability and scalability. Managed services in GCP take care of most aspects of availability and scalability. For example, Cloud Storage is highly available and scalable, but users of the service do not have to do anything to enable these capabilities.

Understand how Compute Engine and Kubernetes Engine achieve high availability and scalability. Compute Engine uses managed instance groups, which include instance templates and autoscalers, to achieve high availability and scale to meet application load. Kubernetes is a container orchestration service that provides higher-level abstractions for deploying applications on containers. Pods scale as needed to meet demands on the cluster and on application services.

Understand reliability engineering is about managing risk. Designing for reliability requires you to consider how to minimize the chance of system failures. For example, architects employ redundancy to mitigate the risk of a hardware failure leaving a crucial component unavailable. Rather than focus on the implementation metrics, such as number of instances available, reliability is better measured as a function of the work performed by the service. The number of requests that are successfully responded to is a good basis for measuring reliability.

Review Questions

You are advising a customer on how to improve the availability of a data storage solution. Which of the following general strategies would you recommend?
1. Keeping redundant copies of the data
2. Lowering the network latency for disk writes
3. Using a NoSQL database
4. Using Cloud Spanner
A team of data scientists is analyzing archived data sets. The model building procedures run in batches. If the model building system is down for up to 30 minutes per day, it does not adversely impact the data scientists’ work. What is the minimal percentage availability among the following options that would meet this requirement?
1. 99.99 percent
2. 99.90 percent
3. 99.00 percent
4. 99.999 percent
Your development team has recently triggered three incidents that resulted in service disruptions. In one case, an engineer mistyped a number in a configuration file and in the other cases specified an incorrect disk configuration. What practices would you recommend to reduce the risk of these types of errors?
1. Continuous integration/continuous deployment
2. Code reviews of configuration files
3. Vulnerability scanning
4. Improved access controls
Your company is running multiple VM instances that have not had any downtime in the past several weeks. Recently, several of the physical servers suffered disk failures. The applications running on the servers did not have any apparent service disruptions. What feature of Compute Engine enabled that?
1. Preemptible VMs
2. Live migration
3. Canary deployments
4. Redundant array of inexpensive disks
You have deployed an application on an instance group. The application is not functioning correctly. What is a possible outcome?
1. The application shuts down when the instance group time-to-live (TTL) threshold is reached.
2. The application shuts down when the health check fails.
3. The VM shuts down when the instance group TTL threshold is reached and a new VM is started.
4. The VM shuts down when the health check fails and a new VM is started.
Mountkirk Games is growing its user base in North America, Europe, and Asia. Executives are concerned that players in Europe and Asia will have a degraded experience if the game backend runs only in North America. What would you suggest as a way to improve latency and game experience for users in Europe and Asia?
1. Use Cloud Spanner to have a globally consistent, horizontally scalable relational database.
2. Create instance groups running the game backend in multiple regions across North America, Europe, and Asia. Use global load balancing to distribute the workload.
3. Use Standard Tier networking to ensure that data sent between regions is routed over the public Internet.
4. Use a Cloud Memorystore cache in front of the database to reduce database read latency.
What configuration changes are required to ensure high availability when using Cloud Storage or Cloud Filestore?
1. A sufficiently long TTL must be set.
2. A health check must be specified.
3. Both a TTL and health check must be specified.
4. Nothing. Both are managed services. GCP manages high availability.
The finance director in your company is frustrated with the poor availability of an on-premises finance data warehouse. The data warehouse uses a commercial relational database that only scales by buying larger and larger servers. The director asks for your advice about moving the data warehouse to the cloud and if the company can continue to use SQL to query the data warehouse. What GCP service would you recommend to replace the on-premises data warehouse?
1. Bigtable
2. BigQuery
3. Cloud Datastore
4. Cloud Storage
TerramEarth has determined that it wants to use Cloud Bigtable to store equipment telemetry data transmitted over their cellular network. They have also concluded that they want two clusters in different regions. Both clusters should be able to respond to read and write requests. What kind of replication should be used?
1. Primary–hot primary
2. Primary–warm primary
3. Primary–primary
4. Primary read–primary write
Your company is implementing a hybrid cloud computing model. Line-of-business owners are concerned that data stored in the cloud may not be available to on-premises applications. The current network connection is using a maximum of 40 percent of bandwidth. What would you suggest to mitigate the risk of that kind of service failure?
1. Configure firewall rules to improve availability.
2. Use redundant network connections between the on-premises data center and Google Cloud.
3. Increase the number of VMs allowed in Compute Engine instance groups.
4. Increase the bandwidth of the network connection between the data center and Google Cloud.
A team of architects in your company is defining standards to improve availability. In addition to recommending redundancy and code reviews for configuration changes, what would you recommend to include in the standards?
1. Use of access controls
2. Use of managed services for all compute requirements
3. Use of Stackdriver monitoring to alert on changes in application performance
4. Use of Bigtable to collect performance monitoring data
Why would you want to run long-running, compute-intensive backend computation in a different managed instance group than on web servers supporting a minimal user interface?
1. Managed instance groups can run only a single application.
2. Managed instance groups are optimized for either compute or HTTP connectivity.
3. Compute-intensive applications have different scaling characteristics from those of lightweight user interface applications.
4. There is no reason to run the applications in different managed instance groups.
An instance group is adding more VMs than necessary and then shutting them down. This pattern is happening repeatedly. What would you do to try to stabilize the addition and removal of VMs?
1. Increase the maximum number of VMs in the instance group.
2. Decrease the minimum number of VMs in the instance group.
3. Increase the time autoscalers consider when making decisions.
4. Decrease the time autoscalers consider when making decisions.
Dress4Win has just developed a new feature for its social networking service. Customers can upload images of their clothes, create montages from those images, and share them on social networking sites. Images are temporarily saved to locally attached drives as the customer works on the montage. When the montage is complete, the final version is copied to a Cloud Storage bucket. The services implementing this feature run in a managed instance group. Several users have noted that their final montages are not available even though they saved them in the application. No other problems have been reported with the service. What might be causing this problem?
1. The Cloud Storage bucket is out of storage.
2. The locally attached drive does not have a filesystem.
3. The users experiencing the problem were using a VM that was shut down by an autoscaler, and a cleanup script did not run to copy the latest version of the montage to Cloud Storage.
4. The network connectivity between the VMs and Cloud Storage has failed.
Kubernetes uses several abstractions to model and manage computation and applications. What is the progression of abstractions from the lowest to the highest level ?
1. Pods → Deployments → Services
2. Pods → Services → Deployments
3. Deployments → Services → Pods
4. Deployments → Pods → Services
Your development team has implemented a new application using a microservices architecture. You would like to minimize DevOps overhead by deploying the services in a way that will autoscale. You would also like to run each microservice in containers. What is a good option for implementing these requirements in Google Cloud Platform?
1. Run the containers in Cloud Functions.
2. Run the containers in Kubernetes Engine.
3. Run the containers in Cloud Dataproc.
4. Run the containers in Cloud Dataflow.
TerramEarth is considering building an analytics database and making it available to equipment designers. The designers require the ability to query the data with SQL. The analytics database manager wants to minimize the cost of the service. What would you recommend?
1. Use BigQuery as the analytics database, and partition the data to minimize the amount of data scanned to answer queries.
2. Use Bigtable as the analytics database, and partition the data to minimize the amount of data scanned to answer queries.
3. Use BigQuery as the analytics database, and use data federation to minimize the amount of data scanned to answer queries.
4. Use Bigtable as the analytics database, and use data federation to minimize the amount of data scanned to answer queries.
Line-of-business owners have decided to move several applications to the cloud. They believe the cloud will be more reliable, but they want to collect data to test their hypothesis. What is a common measure of reliability that they can use?
1. Mean time to recovery
2. Mean time between failures
3. Mean time between deployments
4. Mean time between errors
A group of business executives and software engineers are discussing the level of risk that is acceptable for a new application. Business executives want to minimize the risk that the service is not available. Software engineers note that the more developer time dedicated to reducing risk of disruption, the less time they have to implement new features. How can you formalize the group’s tolerance for risk of disruption?
1. Request success rate
2. Uptime of service
3. Latency
4. Throughput
Your DevOps team recently determined that it needed to increase the size of persistent disks used by VMs running a business-critical application. When scaling up the size of available persistent storage for a VM, what other step may be required?
1. Adjusting the filesystem size in the operating system
2. Backing up the persistent disk before changing its size
3. Changing the access controls on files on the disk
4. Update disk metadata, including labels

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3 Designing for Technical Requirements

Create new playlist

Sign In

Sign Up