Chapter 5
Designing Storage Systems

Storage is an essential component of virtually any cloud-based system. Storage needs can range from long-term archival storage for rarely accessed data to highly volatile, frequently accessed data cached in memory. Google Cloud Platform (GCP) provides a full range of storage options, such as the following:

  • Object storage
  • Persistent local and attached storage
  • Relational and NoSQL databases

This chapter reviews the key concepts and design criteria that you will need to know to pass the Professional Cloud Architect exam, including data retention and lifecycle management considerations as well as addressing networking and latency issues with storage system design.

Overview of Storage Services

Cloud architects often must select one or more storage systems when designing an application. Several factors influence the choice of storage systems, such as the following:

  • Is the data structured or unstructured?
  • How frequently will the data be accessed?
  • What is the read/write pattern? What is the frequency of reads versus writes?
  • What are the consistency requirements?
  • Can Google managed keys be used for encryption, or do you need to deploy customer managed keys?
  • What are the most common query patterns?
  • Does your application require mobile support, such as synchronization?
  • For structured data, is the workload analytic or transactional?
  • Does your application require low latency writes?

The answer to these and similar questions will help you decide which storage services to use and how to configure them.

Object Storage with Google Cloud Storage

Google Cloud Storage is an object storage system. It is designed for persisting unstructured data, such as data files, images, videos, backup files, and any other data. It is unstructured in the sense that objects, that is, files stored in Cloud Storage, are treated as atomic. When you access a file in Cloud Storage, you access the entire file. You cannot treat it as a file on a block storage device that allows for seeking and reading specific blocks in the file. There is no presumed structure within the file that Cloud Storage can exploit.

Organizing Objects in a Namespace

Cloud Storage uses buckets to group objects. Objects are different from files in that they cannot be updated. They are versioned and accessed as a single unit; that is, you cannot access a subset of an object the way you can access a block of a file in a filesystem. A bucket is a group of objects that share access controls at the bucket level. For example, the service account assigned to a virtual machine may have permissions to write to one bucket and read from another bucket. Individual objects within buckets can have their own access controls as well.

Google Cloud Storage uses a global namespace for bucket names, so all bucket names must have unique names. Object names do not have to be unique. A bucket is named when it is created and cannot be renamed. To simulate renaming a bucket, you will need to copy the contents of the bucket to a new bucket with the desired name and then delete the original bucket.

The following are best practice suggestions for bucket naming:

  • Do not use personally identifying information, such as names, email addresses, IP addresses, and so forth in bucket names. That kind of information could be useful to an attacker.
  • Follow DNS naming conventions because bucket names can appear in a CNAME record in DNS.
  • Use globally unique identifiers (GUIDs) if creating many buckets.
  • Do not use sequential names or timestamps if uploading files in parallel. Files with sequentially close names will likely be assigned to the same server. This can create a hotspot when writing files to Cloud Storage.
  • Bucket names can also be subdomain names, such as mybucket.example.com.

The Cloud Storage service does not provide a filesystem. This means that there is no ability to navigate a path through a hierarchy of directories and files. The object store does support a naming convention that allows for the naming of objects in a way that looks similar to the way that a hierarchical filesystem would structure a file path and filename. If you would like to use Google Cloud Storage as a filesystem, the Cloud Storage FUSE open source project provides a mechanism to map from object storage systems to filesystems.

Cloud Storage FUSE

Filesystem in Userspace (FUSE) is a framework for exposing a filesystem to the Linux kernel. FUSE uses a stand-alone application that runs on Linux and provides a filesystem API along with an adapter for implementing filesystem functions in the underlying storage system. Cloud Storage FUSE is an open source adapter that allows users to mount Cloud Storage buckets as filesystems on Linux and macOS platforms.

Cloud Storage FUSE is not a filesystem like NFS. It does not implement a filesystem or a hierarchical directory structure. It does interpret / characters in filenames as directory delimiters.

For example, when using Cloud Storage FUSE, a user could mount a Cloud Storage bucket to a mount point called gcs. The user could then interact with the local operating system to save a file named mydata.csv to /gcs/myproject/mydirectory/mysubdirectory. The user could execute the ls command at the command line to list the contents of the simulated mysubdirectory to see the listing for mydata.csv along with any other files with a name prefixed by myproject/mydirectory. The user could then use a gsutil command or cloud console to list the contents of the mounted bucket, and they would see an object named myproject/mydirectory/mysubdirectory.

Cloud Storage FUSE is useful when you want to move files easily back and forth between Cloud Storage and a Compute Engine VM, a local server, or your development device using filesystem commands instead of gsutil commands or the cloud console.

Cloud Storage FUSE is a Google-developed and community-supported open source project under the Apache license. Cloud Storage FUSE is available at github.com/GoogleCloudPlatform/gcsfuse.

Storage Tiers

Cloud Storage offers four tiers or types of storage. It is essential to understand the characteristics of each tier and when it should be used for the Professional Cloud Architect exam. The four types of Cloud Storage are as follows:

  • Standard
  • Nearline
  • Coldline
  • Archive

Standard storage is well suited for use with frequently accessed (“hot”) data or for data that is stored briefly. Standard storage has three location types: region, dual-region, and multiregion.

The region location type stores multiple copies of an object in multiple zones in one region. All Cloud Storage options provide high durability, which means the probability of losing an object during any period is extremely low. Cloud Storage provides 99.999999999 percent (eleven 9s) annual durability.

This level of durability is achieved by keeping redundant copies of the object. Availability is the ability to access an object when you want it. An object can be durably stored but unavailable. For example, a network outage in a region would prevent you from accessing an object stored in that region, although it would continue to be stored in multiple zones.

Dual-region and multiregion storage mitigate the risk of a regional outage by storing replicas of objects in two or multiple regions, respectively. This can also improve access time and latency by distributing copies of objects to locations that are closer to the users of those objects. Consider a user in California in the western United States accessing an object stored in us-west1, which is a region located in the northwest state of Oregon in the United States. That user can expect under 5 ms latency with a user in New York, in the United States northeast, and would likely experience latencies closer to 30ms.

Dual-region and multiregion storage are also known as geo-redundant storage.

The latencies mentioned here are based on public internet network infrastructure. Google offers two network tiers: Standard and Premium. With Standard network tier, data is routed between regions using public internet infrastructure and is subject to network conditions and routing decisions beyond Google's control. The Premium network tier routes data over Google's global high-speed network. Users of Premium tier networking can expect lower latencies.

Nearline and Coldline storage are used for storing data that is not frequently accessed. Data that is accessed less than once in 30 days is a good candidate for Nearline storage. Data that is accessed less than once in 90 days is a good candidate for Coldline storage. All storage classes have the same latency to return the first byte of data. Archive storage is optimal for data accessed less than once per year.

Nearline, Coldline, and Archive storage have slightly lower availability than Standard storage, lower at-rest storage costs, and higher data access costs.

Cloud Storage also has three other classes: Multi-Regional storage, Regional storage, and Durable Reduced Availability (DRA) storage. Google Cloud recommends using Standard storage instead of these unless you are already using one of these additional classes.

Cloud Storage Use Cases

Cloud Storage is used for a few broad use cases.

  • Storage of data shared among multiple instances that does not need to be on persistent attached storage. For example, log files may be stored in Cloud Storage and analyzed by programs running in a Cloud Dataproc Spark cluster.
  • Backup and archival storage, such as persistent disk snapshots, backups of on-premises systems, and data kept for audit and compliance requirements but not likely to be accessed.
  • As a staging area for uploaded data. For example, a mobile app may allow users to upload images to a Cloud Storage bucket. When the file is created, a Cloud Function could trigger to initiate the next steps of processing.

Each of these examples fits well with Cloud Storage's treatment of objects as atomic units. If data within the file needs to be accessed and processed, that is done by another service or application, such as a Spark analytics program.

Different tiers are better suited for some use cases. For example, Archive class storage is best used for archival storage, but Multiregional storage may be the best option for uploading user data, especially if users are geographically dispersed.

As an architect, it is important to understand the characteristics of the four storage tiers, their relative costs, the assumption of atomicity of objects by Cloud Storage, and how Cloud Storage is used in larger workflows.

Network-Attached Storage with Google Cloud Filestore

Cloud Filestore is a network-attached storage service that provides a filesystem that is accessible from Compute Engine and Kubernetes Engine. Cloud Filestore is designed to provide low latency and IOPS, so it can be used for databases and other performance-sensitive services.

To use Cloud Filestore, you create a Filestore instance, which has a name, service tier, storage type, and capacity. The service tier determines the Filestore's capacity, scalability, and performance. Storage types are HDD or SSD.

You can create backups of an instance to preserve copies of all files and metadata associated with a Filestore instance. Backups are regional resources. You can also create snapshots which are copies of the state of the filesystem instance at a point in time. Snapshots are stored within the Filestore instance.

Cloud Filestore Service Tiers

Cloud Filestore is especially useful when organizations are lifting and shifting applications that require a filesystem and cannot use object storage systems, such as Cloud Storage. Some typical use cases for Cloud Filestore are home directories and shared directories, web server content, and migrated applications that require a filesystem.

Cloud Filestore has three service tiers.

  • Filestore Basic
  • Filestore High Scale
  • Filestore Enterprise

Filestore Basic is designed for file sharing, software, and Google Kubernetes Engine workloads that benefit from a persistent, managed filesystem. Filestore Basic is a zonal resource.

Filestore High Scale is designed to meet the demands of high-performance computing workloads, such as genome analysis and financial services applications requiring low-latency file operations. Filestore High Scale is a zonal resource.

Filestore Enterprise is for mission-critical applications and Google Kubernetes Engine workloads. This version provides 99.99 percent regional availability by provisioning multiple NFS shares across multiple zones within a region.

Filestore provides for snapshots of the filesystem, which can be taken periodically. If you need to recover a filesystem from a snapshot, it would be available within 10 minutes.

Cloud Filestore Networking

Cloud Filestore connects to VPC networks using either VPC Network Peering or private services access. VPC Network Peering is used when creating a filesystem instance with a stand-alone VPC network, when creating an instance within the host project of a Shared VPC, or when accessing the filesystem from an on-premises network using Cloud VPN or Cloud Interconnect. Private service access is used when creating an instance on a Shared VPC network from a service project, not the host project, or when you are using centralized IP range management for multiple Google Cloud services. Note, Cloud Filestore does not support transitive peering.

Cloud Filestore Access Controls

Access controls to Cloud Filestore and to files on the filesystem are managed with a combination of IAM roles and POSIX file permissions.

The roles/file.editor and roles/file.viewer are used to grant permissions on the Cloud Filestore instance to users. The viewer role allows users to see details about an instance, its location, backups, and snapshots as well as the operational status of the instance. The editor role includes the viewer permissions as well as permissions to create and delete instances, backups, and snapshots. These roles do not provide access to the files in the Filestore instance.

When a Filestore instance is created, it has default POSIX permissions of rwxr-xr-x permissions that are changed using operating system commands such as chmod. Access control lists are also supported using operating systems commands such as setfacl.

Databases

Google Cloud provides a variety of database storage systems. The Professional Cloud Architect exam may include questions that require you to choose an appropriate database solution when given a set of requirements. The databases can be broadly grouped as relational, analytical, and NoSQL databases.

Relational Database Overview

Relational databases are highly structured data stores that are designed to store data in ways that minimize the risk of data anomalies and to support a comprehensive query language. An example of a data anomaly is inserting an employee record into a database department ID that does not exist in the department table. Relational databases support data models that include constraints that help prevent anomalies.

Another common characteristic of relational databases is support for ACID transactions. (ACID stands for atomicity, consistency, isolation, and durability.) NoSQL databases may support ACID transactions as well, but not all do.

Atomicity

Atomic operations ensure that all steps in a transaction complete or no steps take effect. For example, a sales transaction might include reducing the number of products available in inventory and charging a customer's credit card. If there isn't sufficient inventory, the transaction will fail, and the customer's credit card will not be charged.

Consistency

Consistency, specifically transactional consistency, is a property that guarantees that when a transaction executes, the database is left in a state that complies with constraints, such as uniqueness requirements and referential integrity, which ensures foreign keys reference a valid primary key. When a database is distributed, consistency also refers to querying data from different servers in a database cluster and receiving the same data from each.

For example, some NoSQL databases replicate data on multiple servers to improve availability. If there is an update to a record, each copy must be updated. In the time between the first and last copies being updated, it is possible to have two instances of the same query receive different results. This is considered an inconsistent read. Eventually, all replicas will be updated, so this is referred to as eventual consistency.

There are other types of consistency; for those interested in the details of other consistency models, see “Consistency in Non-Transactional Distributed Storage Systems” by Paolo Viotti and Marko Vukolić at arxiv.org/pdf/1512.00168.pdf.

Isolation

Isolation refers to ensuring that the effects of transactions that run at the same time leave the database in the same state as if they ran one after the other. Let's consider an example.

Transaction 1 is as follows:

equation
equation
equation

Transaction 2 is as follows:

equation
equation
equation

When high isolation is in place, the value of C will be either 5 or 15, which are the results of either of the transactions. The data will appear as if the three operations of Transaction 1 executed first and the three operations of Transaction 2 executed next. In that case, the value of C is 15. If the operations of Transaction 2 execute first followed by the Transaction 1 operations, then C will have the value 5.

What will not occur is that some of the Transaction 1 operations will execute followed by some of the Transaction 2 operations and then the rest of the Transaction 1 operations.

Here is an execution sequence that cannot occur when isolation is in place:

equation
equation
equation

This sequence of operations would leave C with the assigned value of 7, which would be an incorrect state for the database to be in.

Durability

The durability property ensures that once a transaction is executed, the state of the database will always reflect or account for that change. This property usually requires databases to write data to persistent storage—even when the data is also stored in memory—so that in the event of a crash, the effects of the transactions are not lost.

Google Cloud Platform offers two managed relational database services: Cloud SQL and Cloud Spanner. Each is designed for distinct use cases. In addition to the two managed services, GCP customers can run their own databases on GCP virtual machines.

Cloud SQL

Cloud SQL is a managed service that provides MySQL, SQL Server, and PostgreSQL databases.

Cloud SQL allows users to deploy MySQL, SQL Server, and PostgreSQL on managed virtual servers. GCP manages patching database software, backups, and failovers. Key features of Cloud SQL include the following:

  • All data is encrypted at rest and in transit.
  • Data is replicated across multiple zones for high availability.
  • GCP manages failover to replicas.
  • Support for standard database connectors and tools is provided.
  • The service provides integrated monitoring and logging.

Cloud SQL is an appropriate choice for regional databases up to 30 TB.

Customers can choose the machine type and disk size for the virtual machine running the database server. MySQL, SQL Server, and PostgreSQL all support the SQL query language.

Relational databases, like MySQL, SQL Server, and PostgreSQL, are used with structured data. Both require well-defined schemas, which can be specified using the data definition commands of SQL. These databases also support strongly consistent transactions, so there is no need to work around issues with eventual consistency. Cloud SQL is appropriate for transaction processing applications, such as e-commerce sales and inventory systems.

By default, Cloud SQL creates an instance in a single zone, but Cloud SQL provides for high availability by optionally maintaining a failover replica of the primary instance. Any changes to the primary instance are mirrored on the failover replica. If the primary instance fails, Cloud SQL will detect the failure and promote the failover replica to the primary instance.

Cloud SQL also supports read replicas. A read replica is a copy of the primary instance data stored in the same region or different region from the primary instance. When a read replica is in a different region, it is known as a cross-region replica. Cross-region replicas provide support for disaster recovery and migration of data between regions.

Database Migration Service is a managed service for migrating MySQL and PostgreSQL databases to Cloud SQL. Support for SQL Server migration is expected soon. The service uses change data capture mechanisms to create the initial snapshot and provide for ongoing replications. The Database Migration Service uses the replication mechanisms of the underlying database. The service is useful for both lift-and-shift migrations and for multicloud continuous replication.

A limiting factor of Cloud SQL is that databases can scale only vertically, that is, by moving the database to a larger machine. For use cases that require horizontal scalability or support, a globally accessed database, Cloud Spanner, is an appropriate choice.

Cloud Spanner

Cloud Spanner is a managed database service that supports horizontal scalability across regions. This database supports common relational features, such as schemas for structured data and SQL for querying. Cloud Spanner supports both Google Standard SQL (ANSI 2011 with extensions) and Postgres dialects.

It supports strong consistency, so there is no risk of data anomalies caused by eventual consistency. Cloud Spanner also manages replication. Cloud Spanner is used for applications that require strong consistency on a global scale. Here are some examples:

  • Financial trading systems require a globally consistent view of markets to ensure that traders have a consistent view of the market when making trades.
  • Logistics applications managing a global fleet of vehicles need accurate data on the state of vehicles.
  • Global inventory tracking requires global-scale transaction to preserve the integrity of inventory data.

Cloud Spanner provides 99.999 percent availability, which guarantees less than 5 minutes of downtime per year. Like Cloud SQL, all patching, backing up, and failover management is performed by GCP.

Data is encrypted at rest and in transit. Cloud Spanner is integrated with Cloud Identity to support the use of user accounts across applications and with Cloud Identity and Access Management to control authorizations to perform operations on Cloud Spanner resources.

As with any distributed database, there is the potential for hot spotting. That is skewing the database workload so that a small number of nodes are doing a disproportionate amount of work. When using Cloud Spanner, it is recommended that you use primary keys that do not lead to hotspotting. Incremented values, time stamps, and other values that monotonically increase in the first part of the key should not be used as a primary key since it will lead to writes being directed to a single server instead of more evenly distributed across all servers.

Cloud Spanner supports secondary indexes in addition to primary key indexes.

Cloud Spanner stores data encrypted at rest and by default uses Google-managed encryption. If you need to manage your encryption, you have the option to use Cloud Key Management Service (KMS) with a symmetric key, a Cloud HSM key, or a Cloud External Key Manager key.

Analytical Database: BigQuery

BigQuery is a managed data warehouse and analytics database solution. It is designed to support queries that scan and return large volumes of data, and it performs aggregations on that data. BigQuery uses SQL as a query language. Customers do not need to choose a machine instance type or storage system. BigQuery is a serverless application from the perspective of the user.

Analytics Features

BigQuery is a managed service built on several Google technologies, including Dremel, Colossus, Borg, and Jupiter. Dremel is the query execution engine that maps SQL statements into execution trees. The leaves of the tree are known as slots, which read data from storage and do some computation. Branches of the tree do aggregation. Colossus is Google's distributed filesystem, and it provides persistent storage, including replication and encryption at rest. Borg is a cluster management system that routes jobs to nodes of the cluster and works around any failed nodes. Jupiter is Google's 1 petabit/second networking infrastructure, which is fast enough to eliminate concerns such as rack-aware placement of data to reduce the need to copy data between racks.

In BigQuery, data is stored in a columnar format, known as Capacitor, which means that values from a single column in a database are stored together rather than storing data from the same row together. This is used in BigQuery because analytics and business intelligence queries often filter and group by values in a small number of columns and do not need to reference all columns in a row. Capacitor is designed to support nested and repeated fields as well.

BigQuery uses the concept of a job for executing tasks such as loading and exporting data, running queries, and copying data. Batch and streaming jobs are supported for loading data.

BigQuery uses the concept of a data set for organizing tables and views. A dataset is contained in a project. A dataset may have a regional or multiregional location. Regional datasets are stored in a single region such as us-west2 or europe-north1. Multiregional locations store data in multiple regions within either the United States or Europe.

BigQuery is billed based on the amount of data stored and the amount of data scanned when responding to queries, or in the case of flat-rate query billing, the allocation is used based on the size of the query. For this reason, it is best to craft queries that return only the data that is needed, and filter criteria should be as specific as possible.

If you are interested in viewing the structure of a table or view or you want to see sample data, it is best to use the Preview Option in the console or use the bq head command from the command line. BigQuery also provides a --dry-run option for command-line queries. It returns an estimate of the number of bytes that would be returned if the query were executed.

IAM Roles for BigQuery

BigQuery is integrated with Cloud IAM, which has several predefined roles for BigQuery. Access can be granted at the organization, project, dataset, and table/view levels. When access is provided at the organization or project level, that access applies to all of a project's BigQuery resources. Datasets are children of projects in the resource hierarchy, so access granted at the dataset level apply only to that dataset and its tables and views. You can also assign access at the table and view levels.

  • roles/bigquery.dataViewer: This role allows a user to list projects and tables and get table data and metadata.
  • roles/bigquery.dataEditor: This has the same permissions as dataViewer, plus permissions to create and modify tables and datasets.
  • roles/bigquery.dataOwner: This role is similar to dataEditor, but it can also create, modify, and delete datasets.
  • roles/bigquery.metadataViewer: This role gives permissions to list tables, projects, and datasets.
  • roles/bigquery.user: The user role gives permissions to list projects and tables, view metadata, create datasets, and create jobs.
  • roles/bigquery.jobUser: A jobUser can list projects and create jobs and queries.
  • roles/bigquery.admin: An admin can perform all operations on BigQuery resources.

Additional roles are available for controlling access to BigQuery ML, BigQuery Data Transfer Service, and BigQuery BI Engine.

Loading Data into BigQuery

The two patterns of data loading are batch loading and streaming. BigQuery supports both.

Batch Loading

Batch loading jobs are common in data warehousing and typically involve extracting data from a source system, transforming the data, and then loading it. This is known as extraction, transformation, and load (ETL) although it is increasingly common to extract, load, and then transform (ELT).

Load jobs are BigQuery jobs for loading data from Cloud Storage or local filesystems. Load jobs work with files in Avro, CSV, ORC, and Parquet format.

The BigQuery Data Transfer Service is a specialized service for loading data from other cloud services, such as Google Ads and Google Ad Managers. It also supports transferring data from Google software-as-a-service (SaaS) applications or third-party services.

The BigQuery Storage Write API is used to batch process and commit many records in one atomic operation. BigQuery is also able to load data from Cloud Firestore exports.

Streaming

There are two options for streaming data into BigQuery; they are Storage Write API and Cloud Dataflow. The Storage Write API provides high-throughput ingestion and exactly-once delivery semantics. Cloud Dataflow implements an Apache Beam runner and can write data directly to BigQuery tables from a Dataflow job.

Choosing a Managed Relational or Analytical Database

GCP provides two managed relational databases and an analytics database with some relational features. Cloud SQL is used for transaction processing systems that do not need to scale beyond a single server. It supports SQL Server, MySQL, and PostgreSQL.

Cloud Spanner is a transaction processing relational database that scales horizontally, and it is used when a single server relational database is insufficient. Cloud Spanner is also used for applications that require a relational database with writable nodes in multiple regions.

BigQuery is designed for data warehousing and analytic querying of large datasets. BigQuery should not be used for transaction processing systems. If data is frequently updated after loading, then one of the other managed relational databases is a better option.

NoSQL Databases

GCP offers three NoSQL databases: Bigtable, Datastore, and Cloud Firestore. All three are well suited to storing data that requires flexible schemas. Cloud Bigtable is a wide-column NoSQL database. Cloud Firestore and Cloud Datastore are document NoSQL databases. Cloud Firebase is the next generation of Cloud Datastore, and in the future existing Datastore databases will be automatically upgraded to Firestore using Datastore mode.

Cloud Bigtable

Cloud Bigtable is a wide-column, sparsely populated multidimensional database designed to support petabyte-scale databases for analytic operations, such as storing data for machine learning model building, as well as operational use cases, such as streaming Internet of Things (IoT) data. It is also used for time series, marketing data, financial data, and graph data. Some of the most important features of Cloud Bigtable are as follows:

  • Sub 10 ms latency
  • Stores petabyte-scale data
  • Uses regional replication
  • Queried using a Cloud Bigtable–specific command, cbt
  • Supports use of Hadoop HBase interface
  • Runs on a cluster of servers that store metadata while data is stored in the Colossus filesystem

Bigtable stores data in tables organized by key-value maps. Each row contains data about a single entity and is indexed by a row key. Multiple cells with different time stamps can exist at the intersection of a row and column. Columns are grouped into column families, which are sets of related columns. A table may contain multiple column families.

Tables in Bigtable are partitioned into blocks of contiguous rows known as tablets. Tablets are stored in the Colossus scalable filesystem. Data is not stored on nodes in the cluster. Instead, nodes store pointers to tablets stored in Colossus. Distributing read and write load across nodes yields better performance than having hotspots where a small number of nodes are responding to most read and write requests.

Bigtable supports the HBase API, making it a good option for migrating from Hadoop HBase to a managed service such as Google Cloud. Bigtable is also a logical choice for replacing Cassandra databases with a managed database service.

Bigtable supports creating more than one cluster in a Bigtable instance. Data is automatically replicated between clusters. This is useful when the instance is performing a large number of read and write operations at the same time. With multiple clusters, one can be dedicated to responding to read requests while the other receives write requests. Bigtable guarantees eventual consistency between the replicas.

Cloud Datastore

Cloud Datastore is a managed document database, which is a kind of NoSQL database that uses a flexible JSON-like data structure called a document.

The terminology used to describe the structure of a document is different than that for relational databases. A table in a relational database corresponds to a kind in Cloud Datastore, while a row is referred to as an entity. The equivalent of a relational column is a property, and a primary key in relational databases is simply called the key in Cloud Datastore.

Cloud Datastore is fully managed. GCP manages all data management operations including distributing data to maintain performance. The flexible data structure makes Cloud Datastore a good choice for applications like product catalogs or user profiles.

Cloud Firestore

Cloud Firestore is the next generation of the GCP-managed document database. Cloud Firestore:

  • Is strongly consistent
  • Supports document and collection data models
  • Supports real-time updates
  • Provides mobile and web client libraries

Cloud Firestore supports both a Datastore mode, which is backward compatible with Cloud Datastore, and a Firestore mode. In Datastore mode, all transactions are strongly consistent, unlike Cloud Datastore transactions, which are eventually consistent. Other Cloud Datastore limitations on querying and writing are removed in Firebase Datastore mode.

It's best practice for customers to use Cloud Firestore in Datastore mode for new server-based projects. This supports up to millions of writes per second. Cloud Firestore in its native mode is for new web and mobile applications. This provides client libraries and support for millions of concurrent connections.

Caching with Cloud Memorystore

Cloud Memorystore is a managed cache service supporting both Redis and Memcached. Memorystore caches are used for storing data in nonpersistent memory, particularly when low-latency access is important. Stream processing and database caching are both common use cases for Memorystore.

Cloud Memorystore for Redis

Redis is an open source, in-memory data store, which is designed for submillisecond data access that has a variety of data types including strings, lists, sets, sorted sets, bitmaps, and hyperloglogs. Cloud Memorystore for Redis supports up to 300 GB instances and 12 Gbps network throughput. Caches replicated across two zones provide 99.9 percent availability.

As with other managed storage services, GCP manages Cloud Memorystore patching, replication, and failover.

Cloud Memorystore is used for low-latency data retrieval, specifically lower latency than is available from databases that store data persistently to disk or SSD.

Cloud Memorystore for Redis is available in two service tiers: Basic and Standard. Basic Tier provides a simple Redis cache on a single server without replication. Standard Tier is a highly available instance with cross-zone replication and support for automatic failover.

Cloud Memorystore for Memcached

Memcached is an open source project and is used for caching in a wide range of use cases including database query caching, reference data caching, and storing session state. Memcached does not have the range of data types found in Redis and instead stores string values indexed by key strings.

Memcached is deployed on clusters, called instances, made up of nodes with the same amount of memory and virtual CPUs. Data is distributed across nodes in the cluster.

Instances can have between one and 20 nodes, which each have one to 32 vCPUs and between 1 GB and 256 GB of memory. The instance can have a maximum combined memory of 5 TB.

Memcached, like Redis, can be accessed from Compute Engine, Google Kubernetes Engine, Cloud Functions, App Engine Flexible, and App Engine Standard. Access to Cloud Functions and App Engine Standard requires Serverless VPC access.

Data Retention and Lifecycle Management

Data has something of a life as it moves through several stages, including creation, active use, infrequent access but kept online, archived, and deleted. Not all data goes through all of the stages, but it is important to consider lifecycle issues when planning storage systems.

The choice of storage system technology usually does not directly influence data lifecycles and retention policies, but it does impact how the policies are implemented. For example, Cloud Storage lifecycle policies can be used to move objects from Nearline storage to Coldline storage after some period of time. When partitioned tables are used in BigQuery, partitions can be deleted without affecting other partitions or running time-consuming jobs that scan full tables for data that should be deleted.

If you are required to store data, consider how frequently and how fast the data must be accessed.

  • If submillisecond access time is needed, use a cache such as Cloud Memorystore.
  • If data is frequently accessed, may be updated, and needs to be persistently stored, then use a database. Choose between relational and NoSQL based on the structure of the data. Data with flexible schemas can use NoSQL databases.
  • If data is less likely to be accessed the older it gets, store data in time-partitioned tables if the database supports partitions. Time-partitioned tables are frequently used in BigQuery, and Bigtable tables can be organized by time as well.
  • If data is infrequently accessed and does not require access through a query language, consider Cloud Storage. Infrequently used data can be exported from a database, and the export files can be stored in Cloud Storage. If the data is needed, it can be imported back into the database and queried from there.
  • When data is not likely to be accessed, but it must still be stored, use the Archive storage class in Cloud Storage. This is less expensive than other Cloud Storage options.

Cloud Storage provides object lifecycle management policies to make changes automatically to the way objects are stored in the object datastore. These policies contain rules for manipulating objects and are assigned to buckets. The rules apply to objects in those buckets. The rules implement lifecycle actions, including deleting an object and setting the storage class. Rules can be triggered based on the age of the object, when it was created, the number of newer versions, and the storage class of the object.

Another control for data management is retention policies. A retention policy uses the Bucket Lock feature of Cloud Storage buckets to enforce object retention. By setting a retention policy, you ensure that any object in the bucket or future objects in the bucket are not deleted until they reach the age specified in the retention policy. This feature is particularly useful for compliance with government or industry regulations. Once a retention policy is locked, it cannot be revoked.

Networking and Latency

Network latency is a consideration when designing storage systems, particularly when data is transmitted between regions within GCP or outside GCP to globally distributed devices. Three ways of addressing network latency concerns are as follows:

  • Replicating data in multiple regions and across continents
  • Distributing data using Cloud CDN
  • Using Google Cloud Premium Network tier

The reason to consider using these options is that the network latency without them would be too high to meet application or service requirements. For some points of reference, note the following:

  • Within Europe or Japan, expect 12–15 ms latency.
  • Within North America, 30–32 ms latency is typical.
  • Trans-Atlantic latency is about 70 ms.
  • Trans-Pacific latency is about 100 ms.
  • Latency between the Europe, Middle East, and Africa (EMEA) region and Asia Pacific is closer to 120 ms.

Data can be replicated in multiple regions under the control of a GCP service or under the control of a customer-managed service. For example, Cloud Storage multiregional storage replicates data to multiple regions. Cloud Spanner distributes data automatically among multiple regions. Cloud Firestore is designed to scale globally. Using GCP services that manage multiregional and global distribution of data is preferred to managing replication at the application level.

Another way to reduce latency is to use GCP's Cloud CDN. This is particularly effective and efficient when distributing relatively static content globally. Cloud CDN maintains a set of globally distributed points of presence around the world. Points of presence are where the Google Cloud connects to the internet. Static content that is frequently accessed in an area can be cached at these edge nodes.

GCP offers two network service tiers. In the Standard Tier, network traffic between regions is routed over the public internet to the destination device. With the Premium Tier, all data is routed over the Google network up to a point of presence near the destination device. The Premium Tier should be used when high-performance routing, high availability, and low latency at multiregion scales are required.

Summary

GCP provides four types of storage systems: object storage using Cloud Storage, network-attached storage, databases, and caching. Cloud Storage is used for unstructured data that is accessed at the object level; there is no way to query or access subsets of data within an object. Object storage is useful for a wide array of use cases, from uploading data from client devices to storing long-term archives. Network-attached storage is used to store data that is actively processed. Cloud Filestore provides a network filesystem, which is used to share file-structured data across multiple servers.

Google Cloud offers several managed databases, including relational and NoSQL databases. The relational database services are Cloud SQL and Cloud Spanner. Cloud SQL is used for transaction processing systems that serve clients within a region and do not need to scale beyond a single server. Cloud Spanner provides a horizontally scalable, global, strongly consistent relational database. BigQuery is an analytical database designed for data warehousing and analytic database applications. The NoSQL managed databases in GCP are Bigtable, Datastore, and Firestore. Bigtable is a wide-column database designed for low-latency writes at petabyte scales. Datastore and Firestore are managed document databases that scale globally. Firestore is the next generation of document storage in GCP and has fewer restrictions than Cloud Datastore.

When designing storage systems, consider data lifecycle management and network latency. GCP provides services to help implement data lifecycle management policies and offers access to the Google global network through the Premium Tier network service.

Exam Essentials

  • Understand the major types of storage systems available in GCP. These include object storage, persistent local and attached storage, and relational and NoSQL databases. Object storage is often used to store unstructured data, archived data, and files that are treated as atomic units. Persistent local and attached storage provides storage to virtual machines. Relational databases are used for structured data, while NoSQL databases are used when it helps to have flexible schemas.
  • Cloud Storage has multiple classes: Standard, Nearline, Coldline, and Archive. Standard storage class is available in regional, dual regional, and multiregional services. Multiregional storage replicates objects across multiple regions, dual regional replicates data across two regions, while regional replicates data across zones within a region. Nearline storage is used for data that is accessed less than once in 30 days. Coldline storage is used for data that is accessed less than once in 90 days. Archive storage is used for data accessed not more than once per year.
  • Cloud Filestore is a network-attached storage service that provides a filesystem that is accessible from Compute Engine and Kubernetes Engine. Cloud Filestore is designed to provide low latency and IOPS, so it can be used for databases and other performance-sensitive services.
  • Cloud SQL is a managed relational database that can run on a single server. Cloud SQL allows users to deploy MySQL, SQL Server, and PostgreSQL on managed virtual servers. Database administration tasks, such as patching, backing up, and managing failover are managed by GCP.
  • Cloud Spanner is a managed database service that supports horizontal scalability across regions. Cloud Spanner is a relational database used for applications that require strong consistency on a global scale. Cloud Spanner provides 99.999 percent availability, which guarantees less than five minutes of downtime a year. Like Cloud SQL, all patching, backing up, and failover management tasks are performed by GCP.
  • BigQuery is a managed data warehouse and analytical database solution. BigQuery uses the concept of a dataset for organizing tables and views. A dataset is contained in a project. BigQuery provides its own command-line program called bq rather than using the gcloud command line. BigQuery is billed based on the amount of data stored and the amount of data scanned when responding to queries.
  • Cloud Bigtable is designed to support petabyte-scale databases for analytic operations. It is used for storing data for machine learning model building, as well as operational use cases, such as streaming Internet of Things (IoT) data. It is also used for time series, marketing data, financial data, and graph data.
  • Cloud Firestore and Cloud Datastore are managed document databases, which are a kind of NoSQL database that uses a flexible JSON-like data structure called a document. Cloud Firestore and Cloud Datastore are fully managed. GCP manages all data management operations, including distributing data to maintain performance. They are designed so that the response time to return query results is a function of the size of the data returned and not the size of the dataset that is queried. The flexible data structure makes Cloud Firestore and Cloud Datastore good choices for applications like product catalogs or user profiles. Cloud Firestore is the next generation of GCP-managed document database.
  • Cloud Memorystore is a managed cache service. Cloud Memorystore is a managed cache service that provides Redis and Memcached options. Both are open source platforms that provide in-memory data store with submillisecond data access. Cloud Memorystore Redis supports up to 300 GB instances and 12 Gbps network throughput. Caches replicated across two zones provide 99.9 percent availability. Cloud Memorystore Memcached Instances can have between one and 20 nodes, which each have one to 32 vCPUs (even values only) and between 1 GB and 256 GB of memory.
  • Cloud Storage provides object lifecycle management policies to make changes automatically to the way that objects are stored in the object datastore. Another control for data management is retention policies. You can use the Bucket Lock feature of Cloud Storage buckets to enforce object retention.
  • Network latency is a consideration when designing storage systems, particularly when data is transmitted between regions with GCP or outside GCP to globally distributed devices. Three ways of addressing network latency concerns are replicating data in multiple regions and across continents, distributing data using Cloud CDN, and using Google Cloud Premium Network tier.

Review Questions

  1. You need to store a set of files for an extended period. Anytime the data in the files needs to be accessed, it will be copied to a server first, and then the data will be accessed. Files will not be accessed more than once a year. The set of files will all have the same access controls. What storage solution would you use to store these files?
    1. Cloud Storage Archive
    2. Cloud Storage Nearline
    3. Cloud Filestore
    4. Bigtable
  2. You are uploading files in parallel to Cloud Storage and want to optimize load performance. What could you do to avoid creating hotspots when writing files to Cloud Storage?
    1. Use sequential names or time stamps for files.
    2. Do not use sequential names or time stamps for files.
    3. Configure retention policies to ensure that files are not deleted prematurely.
    4. Configure lifecycle policies to ensure that files are always using the most appropriate storage class.
  3. As a consultant on a cloud migration project, you have been asked to recommend a strategy for storing files that must be highly available even in the event of a regional failure. What would you recommend?
    1. BigQuery
    2. Cloud Datastore
    3. Multiregional Cloud Storage
    4. Regional Cloud Storage
  4. As part of a migration to Google Cloud Platform, your department will run a collaboration and document management application on Compute Engine virtual machines. The application requires a filesystem that can be mounted using operating system commands. All documents should be accessible from any instance. What storage solution would you recommend?
    1. Cloud Storage
    2. Cloud Filestore
    3. A document database
    4. A relational database
  5. Your team currently supports seven MySQL databases for transaction processing applications. Management wants to reduce the amount of staff time spent on database administration. What GCP service would you recommend to help reduce the database administration load on your teams?
    1. Bigtable
    2. BigQuery
    3. Cloud SQL
    4. Cloud Filestore
  6. Your company is developing a new service that will have a global customer base. The service will generate large volumes of structured data and require the support of a transaction processing database. All users, regardless of where they are on the globe, must have a consistent view of data. What storage system will meet these requirements?
    1. Cloud Spanner
    2. Cloud SQL
    3. Cloud Storage
    4. BigQuery
  7. Your company is required to comply with several government and industry regulations, which include encrypting data at rest. What GCP storage services can be used for applications subject to these regulations?
    1. Bigtable and BigQuery only
    2. Bigtable and Cloud Storage only
    3. Any of the managed databases, but no other storage services
    4. Any GCP storage service
  8. As part of your role as a data warehouse administrator, you occasionally need to export data from the data warehouse, which is implemented in BigQuery. What command-line tool would you use for that task?
    1. gsutil
    2. gcloud
    3. bq
    4. cbt
  9. Another task that you perform as data warehouse administrator is granting authorizations to perform tasks with the BigQuery data warehouse. A user has requested permission to view table data but not change it. What role would you grant to this user to provide the needed permissions but nothing more?
    1. dataViewer
    2. admin
    3. metadataViewer
    4. dataOwner
  10. A developer is creating a set of reports and is trying to minimize the amount of data each query returns while still meeting all requirements. What bq command-line option will help you understand the amount of data returned by a query without actually executing the query?
    1. --no-data
    2. --estimate-size
    3. --dry-run
    4. --size
  11. A team of developers is choosing between using NoSQL or a relational database. What is a feature of NoSQL databases that is not available in relational databases?
    1. Fixed schemas
    2. ACID transactions
    3. Indexes
    4. Flexible schemas
  12. A group of venture capital investors has hired you to review the technical design of a service that will be developed by a startup company seeking funding. The startup plans to collect data from sensors attached to vehicles. The data will be used to predict when a vehicle needs maintenance and before the vehicle breaks down. Thirty sensors will be on each vehicle. Each sensor will send up to 5 KB of data every second. The startup expects to start with hundreds of vehicles, but it plans to reach 1 million vehicles globally within 18 months. The data will be used to develop machine learning models to predict the need for maintenance. The startup is considering using a self-managed relational database to store the time-series data but wants your opinion. What would you recommend for a time-series database?
    1. Continue to plan to use a self-managed relational database.
    2. Use Cloud SQL.
    3. Use Cloud Spanner.
    4. Use Bigtable.
  13. A Bigtable instance increasingly needs to support simultaneous read and write operations. You'd like to separate the workload so that some nodes respond to read requests and others respond to write requests. How would you implement this to minimize the workload on developers and database administrators?
    1. Create two instances, and separate the workload at the application level.
    2. Create multiple clusters in the Bigtable instance, and use Bigtable replication to keep the clusters synchronized.
    3. Create multiple clusters in the Bigtable instance, and use your own replication program to keep the clusters synchronized.
    4. It is not possible to accomplish the partitioning of the workload as described.
  14. As a database architect, you've been asked to recommend a database service to support an application that will make extensive use of JSON documents. What would you recommend to minimize database administration overhead while minimizing the work required for developers to store JSON data in the database?
    1. Cloud Storage
    2. Cloud Firestore
    3. Cloud Spanner
    4. Cloud SQL
  15. Your Cloud SQL database is experiencing high query latency. You could vertically scale the database to use a larger instance, but you do not need additional write capacity. What else could you try to reduce the number of reads performed by the database?
    1. Switch to Cloud Spanner.
    2. Use Cloud Bigtable instead.
    3. Use Cloud Memorystore to create a database cache that stores the results of database queries. Before a query is sent to the database, the cache is checked for the answer to the query.
    4. Add read replicas to the Cloud SQL database.
  16. You would like to move objects stored in Cloud Storage automatically from regional storage to Nearline storage when the object is six months old. What feature of Cloud Storage would you use?
    1. Retention policies
    2. Lifecycle policies
    3. Bucket locks
    4. Multiregion replication
  17. A customer has asked for help with a web application. Static data served from a data center in Chicago in the United States loads slowly for users located in Australia, South Africa, and Southeast Asia. What would you recommend to reduce latency?
    1. Distribute data using Cloud CDN.
    2. Use Premium Network from the server in Chicago to client devices.
    3. Scale up the size of the web server.
    4. Move the server to a location closer to those users.
  18. A data pipeline ingests performance monitoring data about a fleet of vehicles using Cloud Pub/Sub. The data is written to Cloud Bigtable to enable queries about specific vehicles. The data will also be written to BigQuery and BigQuery ML will be used to build predictive models about failures in vehicle components. You would like to provide high throughput ingestion and exactly-once delivery semantics when writing data to BigQuery. How would you load that data into BigQuery?
    1. BigQuery Transfer Service
    2. Cloud Storage Transfer Service
    3. BigQuery Storage Write API
    4. BigQuery Load Jobs
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset