Storage is an essential component of virtually any cloud-based system. Storage needs can range from long-term archival storage for rarely accessed data to highly volatile, frequently accessed data cached in memory. Google Cloud Platform (GCP) provides a full range of storage options, such as the following:
This chapter reviews the key concepts and design criteria that you will need to know to pass the Professional Cloud Architect exam, including data retention and lifecycle management considerations as well as addressing networking and latency issues with storage system design.
Cloud architects often must select one or more storage systems when designing an application. Several factors influence the choice of storage systems, such as the following:
The answer to these and similar questions will help you decide which storage services to use and how to configure them.
Google Cloud Storage is an object storage system. It is designed for persisting unstructured data, such as data files, images, videos, backup files, and any other data. It is unstructured in the sense that objects, that is, files stored in Cloud Storage, are treated as atomic. When you access a file in Cloud Storage, you access the entire file. You cannot treat it as a file on a block storage device that allows for seeking and reading specific blocks in the file. There is no presumed structure within the file that Cloud Storage can exploit.
Cloud Storage uses buckets to group objects. Objects are different from files in that they cannot be updated. They are versioned and accessed as a single unit; that is, you cannot access a subset of an object the way you can access a block of a file in a filesystem. A bucket is a group of objects that share access controls at the bucket level. For example, the service account assigned to a virtual machine may have permissions to write to one bucket and read from another bucket. Individual objects within buckets can have their own access controls as well.
Google Cloud Storage uses a global namespace for bucket names, so all bucket names must have unique names. Object names do not have to be unique. A bucket is named when it is created and cannot be renamed. To simulate renaming a bucket, you will need to copy the contents of the bucket to a new bucket with the desired name and then delete the original bucket.
The following are best practice suggestions for bucket naming:
mybucket.example.com
.The Cloud Storage service does not provide a filesystem. This means that there is no ability to navigate a path through a hierarchy of directories and files. The object store does support a naming convention that allows for the naming of objects in a way that looks similar to the way that a hierarchical filesystem would structure a file path and filename. If you would like to use Google Cloud Storage as a filesystem, the Cloud Storage FUSE open source project provides a mechanism to map from object storage systems to filesystems.
Filesystem in Userspace (FUSE) is a framework for exposing a filesystem to the Linux kernel. FUSE uses a stand-alone application that runs on Linux and provides a filesystem API along with an adapter for implementing filesystem functions in the underlying storage system. Cloud Storage FUSE is an open source adapter that allows users to mount Cloud Storage buckets as filesystems on Linux and macOS platforms.
Cloud Storage FUSE is not a filesystem like NFS. It does not implement a filesystem or a hierarchical directory structure. It does interpret /
characters in filenames as directory delimiters.
For example, when using Cloud Storage FUSE, a user could mount a Cloud Storage bucket to a mount point called gcs
. The user could then interact with the local operating system to save a file named mydata.csv
to /gcs/myproject/mydirectory/mysubdirectory
. The user could execute the ls
command at the command line to list the contents of the simulated mysubdirectory
to see the listing for mydata.csv
along with any other files with a name prefixed by myproject/mydirectory
. The user could then use a gsutil
command or cloud console to list the contents of the mounted bucket, and they would see an object named myproject/mydirectory/mysubdirectory
.
Cloud Storage FUSE is useful when you want to move files easily back and forth between Cloud Storage and a Compute Engine VM, a local server, or your development device using filesystem commands instead of gsutil
commands or the cloud console.
Cloud Storage FUSE is a Google-developed and community-supported open source project under the Apache license. Cloud Storage FUSE is available at github.com/GoogleCloudPlatform/gcsfuse
.
Cloud Storage offers four tiers or types of storage. It is essential to understand the characteristics of each tier and when it should be used for the Professional Cloud Architect exam. The four types of Cloud Storage are as follows:
Standard storage is well suited for use with frequently accessed (“hot”) data or for data that is stored briefly. Standard storage has three location types: region, dual-region, and multiregion.
The region location type stores multiple copies of an object in multiple zones in one region. All Cloud Storage options provide high durability, which means the probability of losing an object during any period is extremely low. Cloud Storage provides 99.999999999 percent (eleven 9s) annual durability.
This level of durability is achieved by keeping redundant copies of the object. Availability is the ability to access an object when you want it. An object can be durably stored but unavailable. For example, a network outage in a region would prevent you from accessing an object stored in that region, although it would continue to be stored in multiple zones.
Dual-region and multiregion storage mitigate the risk of a regional outage by storing replicas of objects in two or multiple regions, respectively. This can also improve access time and latency by distributing copies of objects to locations that are closer to the users of those objects. Consider a user in California in the western United States accessing an object stored in us-west1, which is a region located in the northwest state of Oregon in the United States. That user can expect under 5 ms latency with a user in New York, in the United States northeast, and would likely experience latencies closer to 30ms.
Dual-region and multiregion storage are also known as geo-redundant storage.
The latencies mentioned here are based on public internet network infrastructure. Google offers two network tiers: Standard and Premium. With Standard network tier, data is routed between regions using public internet infrastructure and is subject to network conditions and routing decisions beyond Google's control. The Premium network tier routes data over Google's global high-speed network. Users of Premium tier networking can expect lower latencies.
Nearline and Coldline storage are used for storing data that is not frequently accessed. Data that is accessed less than once in 30 days is a good candidate for Nearline storage. Data that is accessed less than once in 90 days is a good candidate for Coldline storage. All storage classes have the same latency to return the first byte of data. Archive storage is optimal for data accessed less than once per year.
Nearline, Coldline, and Archive storage have slightly lower availability than Standard storage, lower at-rest storage costs, and higher data access costs.
Cloud Storage also has three other classes: Multi-Regional storage, Regional storage, and Durable Reduced Availability (DRA) storage. Google Cloud recommends using Standard storage instead of these unless you are already using one of these additional classes.
Cloud Storage is used for a few broad use cases.
Each of these examples fits well with Cloud Storage's treatment of objects as atomic units. If data within the file needs to be accessed and processed, that is done by another service or application, such as a Spark analytics program.
Different tiers are better suited for some use cases. For example, Archive class storage is best used for archival storage, but Multiregional storage may be the best option for uploading user data, especially if users are geographically dispersed.
As an architect, it is important to understand the characteristics of the four storage tiers, their relative costs, the assumption of atomicity of objects by Cloud Storage, and how Cloud Storage is used in larger workflows.
Cloud Filestore is a network-attached storage service that provides a filesystem that is accessible from Compute Engine and Kubernetes Engine. Cloud Filestore is designed to provide low latency and IOPS, so it can be used for databases and other performance-sensitive services.
To use Cloud Filestore, you create a Filestore instance, which has a name, service tier, storage type, and capacity. The service tier determines the Filestore's capacity, scalability, and performance. Storage types are HDD or SSD.
You can create backups of an instance to preserve copies of all files and metadata associated with a Filestore instance. Backups are regional resources. You can also create snapshots which are copies of the state of the filesystem instance at a point in time. Snapshots are stored within the Filestore instance.
Cloud Filestore is especially useful when organizations are lifting and shifting applications that require a filesystem and cannot use object storage systems, such as Cloud Storage. Some typical use cases for Cloud Filestore are home directories and shared directories, web server content, and migrated applications that require a filesystem.
Cloud Filestore has three service tiers.
Filestore Basic is designed for file sharing, software, and Google Kubernetes Engine workloads that benefit from a persistent, managed filesystem. Filestore Basic is a zonal resource.
Filestore High Scale is designed to meet the demands of high-performance computing workloads, such as genome analysis and financial services applications requiring low-latency file operations. Filestore High Scale is a zonal resource.
Filestore Enterprise is for mission-critical applications and Google Kubernetes Engine workloads. This version provides 99.99 percent regional availability by provisioning multiple NFS shares across multiple zones within a region.
Filestore provides for snapshots of the filesystem, which can be taken periodically. If you need to recover a filesystem from a snapshot, it would be available within 10 minutes.
Cloud Filestore connects to VPC networks using either VPC Network Peering or private services access. VPC Network Peering is used when creating a filesystem instance with a stand-alone VPC network, when creating an instance within the host project of a Shared VPC, or when accessing the filesystem from an on-premises network using Cloud VPN or Cloud Interconnect. Private service access is used when creating an instance on a Shared VPC network from a service project, not the host project, or when you are using centralized IP range management for multiple Google Cloud services. Note, Cloud Filestore does not support transitive peering.
Access controls to Cloud Filestore and to files on the filesystem are managed with a combination of IAM roles and POSIX file permissions.
The roles/file.editor and roles/file.viewer are used to grant permissions on the Cloud Filestore instance to users. The viewer role allows users to see details about an instance, its location, backups, and snapshots as well as the operational status of the instance. The editor role includes the viewer permissions as well as permissions to create and delete instances, backups, and snapshots. These roles do not provide access to the files in the Filestore instance.
When a Filestore instance is created, it has default POSIX permissions of rwxr-xr-x permissions that are changed using operating system commands such as chmod
. Access control lists are also supported using operating systems commands such as setfacl
.
Google Cloud provides a variety of database storage systems. The Professional Cloud Architect exam may include questions that require you to choose an appropriate database solution when given a set of requirements. The databases can be broadly grouped as relational, analytical, and NoSQL databases.
Relational databases are highly structured data stores that are designed to store data in ways that minimize the risk of data anomalies and to support a comprehensive query language. An example of a data anomaly is inserting an employee record into a database department ID that does not exist in the department table. Relational databases support data models that include constraints that help prevent anomalies.
Another common characteristic of relational databases is support for ACID transactions. (ACID stands for atomicity, consistency, isolation, and durability.) NoSQL databases may support ACID transactions as well, but not all do.
Atomic operations ensure that all steps in a transaction complete or no steps take effect. For example, a sales transaction might include reducing the number of products available in inventory and charging a customer's credit card. If there isn't sufficient inventory, the transaction will fail, and the customer's credit card will not be charged.
Consistency, specifically transactional consistency, is a property that guarantees that when a transaction executes, the database is left in a state that complies with constraints, such as uniqueness requirements and referential integrity, which ensures foreign keys reference a valid primary key. When a database is distributed, consistency also refers to querying data from different servers in a database cluster and receiving the same data from each.
For example, some NoSQL databases replicate data on multiple servers to improve availability. If there is an update to a record, each copy must be updated. In the time between the first and last copies being updated, it is possible to have two instances of the same query receive different results. This is considered an inconsistent read. Eventually, all replicas will be updated, so this is referred to as eventual consistency.
There are other types of consistency; for those interested in the details of other consistency models, see “Consistency in Non-Transactional Distributed Storage Systems” by Paolo Viotti and Marko Vukolić at arxiv.org/pdf/1512.00168.pdf
.
Isolation refers to ensuring that the effects of transactions that run at the same time leave the database in the same state as if they ran one after the other. Let's consider an example.
Transaction 1 is as follows:
Transaction 2 is as follows:
When high isolation is in place, the value of C will be either 5 or 15, which are the results of either of the transactions. The data will appear as if the three operations of Transaction 1 executed first and the three operations of Transaction 2 executed next. In that case, the value of C is 15. If the operations of Transaction 2 execute first followed by the Transaction 1 operations, then C will have the value 5.
What will not occur is that some of the Transaction 1 operations will execute followed by some of the Transaction 2 operations and then the rest of the Transaction 1 operations.
Here is an execution sequence that cannot occur when isolation is in place:
This sequence of operations would leave C with the assigned value of 7, which would be an incorrect state for the database to be in.
The durability property ensures that once a transaction is executed, the state of the database will always reflect or account for that change. This property usually requires databases to write data to persistent storage—even when the data is also stored in memory—so that in the event of a crash, the effects of the transactions are not lost.
Google Cloud Platform offers two managed relational database services: Cloud SQL and Cloud Spanner. Each is designed for distinct use cases. In addition to the two managed services, GCP customers can run their own databases on GCP virtual machines.
Cloud SQL is a managed service that provides MySQL, SQL Server, and PostgreSQL databases.
Cloud SQL allows users to deploy MySQL, SQL Server, and PostgreSQL on managed virtual servers. GCP manages patching database software, backups, and failovers. Key features of Cloud SQL include the following:
Cloud SQL is an appropriate choice for regional databases up to 30 TB.
Customers can choose the machine type and disk size for the virtual machine running the database server. MySQL, SQL Server, and PostgreSQL all support the SQL query language.
Relational databases, like MySQL, SQL Server, and PostgreSQL, are used with structured data. Both require well-defined schemas, which can be specified using the data definition commands of SQL. These databases also support strongly consistent transactions, so there is no need to work around issues with eventual consistency. Cloud SQL is appropriate for transaction processing applications, such as e-commerce sales and inventory systems.
By default, Cloud SQL creates an instance in a single zone, but Cloud SQL provides for high availability by optionally maintaining a failover replica of the primary instance. Any changes to the primary instance are mirrored on the failover replica. If the primary instance fails, Cloud SQL will detect the failure and promote the failover replica to the primary instance.
Cloud SQL also supports read replicas. A read replica is a copy of the primary instance data stored in the same region or different region from the primary instance. When a read replica is in a different region, it is known as a cross-region replica. Cross-region replicas provide support for disaster recovery and migration of data between regions.
Database Migration Service is a managed service for migrating MySQL and PostgreSQL databases to Cloud SQL. Support for SQL Server migration is expected soon. The service uses change data capture mechanisms to create the initial snapshot and provide for ongoing replications. The Database Migration Service uses the replication mechanisms of the underlying database. The service is useful for both lift-and-shift migrations and for multicloud continuous replication.
A limiting factor of Cloud SQL is that databases can scale only vertically, that is, by moving the database to a larger machine. For use cases that require horizontal scalability or support, a globally accessed database, Cloud Spanner, is an appropriate choice.
Cloud Spanner is a managed database service that supports horizontal scalability across regions. This database supports common relational features, such as schemas for structured data and SQL for querying. Cloud Spanner supports both Google Standard SQL (ANSI 2011 with extensions) and Postgres dialects.
It supports strong consistency, so there is no risk of data anomalies caused by eventual consistency. Cloud Spanner also manages replication. Cloud Spanner is used for applications that require strong consistency on a global scale. Here are some examples:
Cloud Spanner provides 99.999 percent availability, which guarantees less than 5 minutes of downtime per year. Like Cloud SQL, all patching, backing up, and failover management is performed by GCP.
Data is encrypted at rest and in transit. Cloud Spanner is integrated with Cloud Identity to support the use of user accounts across applications and with Cloud Identity and Access Management to control authorizations to perform operations on Cloud Spanner resources.
As with any distributed database, there is the potential for hot spotting. That is skewing the database workload so that a small number of nodes are doing a disproportionate amount of work. When using Cloud Spanner, it is recommended that you use primary keys that do not lead to hotspotting. Incremented values, time stamps, and other values that monotonically increase in the first part of the key should not be used as a primary key since it will lead to writes being directed to a single server instead of more evenly distributed across all servers.
Cloud Spanner supports secondary indexes in addition to primary key indexes.
Cloud Spanner stores data encrypted at rest and by default uses Google-managed encryption. If you need to manage your encryption, you have the option to use Cloud Key Management Service (KMS) with a symmetric key, a Cloud HSM key, or a Cloud External Key Manager key.
BigQuery is a managed data warehouse and analytics database solution. It is designed to support queries that scan and return large volumes of data, and it performs aggregations on that data. BigQuery uses SQL as a query language. Customers do not need to choose a machine instance type or storage system. BigQuery is a serverless application from the perspective of the user.
BigQuery is a managed service built on several Google technologies, including Dremel, Colossus, Borg, and Jupiter. Dremel is the query execution engine that maps SQL statements into execution trees. The leaves of the tree are known as slots, which read data from storage and do some computation. Branches of the tree do aggregation. Colossus is Google's distributed filesystem, and it provides persistent storage, including replication and encryption at rest. Borg is a cluster management system that routes jobs to nodes of the cluster and works around any failed nodes. Jupiter is Google's 1 petabit/second networking infrastructure, which is fast enough to eliminate concerns such as rack-aware placement of data to reduce the need to copy data between racks.
In BigQuery, data is stored in a columnar format, known as Capacitor, which means that values from a single column in a database are stored together rather than storing data from the same row together. This is used in BigQuery because analytics and business intelligence queries often filter and group by values in a small number of columns and do not need to reference all columns in a row. Capacitor is designed to support nested and repeated fields as well.
BigQuery uses the concept of a job for executing tasks such as loading and exporting data, running queries, and copying data. Batch and streaming jobs are supported for loading data.
BigQuery uses the concept of a data set for organizing tables and views. A dataset is contained in a project. A dataset may have a regional or multiregional location. Regional datasets are stored in a single region such as us-west2 or europe-north1. Multiregional locations store data in multiple regions within either the United States or Europe.
BigQuery is billed based on the amount of data stored and the amount of data scanned when responding to queries, or in the case of flat-rate query billing, the allocation is used based on the size of the query. For this reason, it is best to craft queries that return only the data that is needed, and filter criteria should be as specific as possible.
If you are interested in viewing the structure of a table or view or you want to see sample data, it is best to use the Preview Option in the console or use the bq head
command from the command line. BigQuery also provides a --dry-run
option for command-line queries. It returns an estimate of the number of bytes that would be returned if the query were executed.
BigQuery is integrated with Cloud IAM, which has several predefined roles for BigQuery. Access can be granted at the organization, project, dataset, and table/view levels. When access is provided at the organization or project level, that access applies to all of a project's BigQuery resources. Datasets are children of projects in the resource hierarchy, so access granted at the dataset level apply only to that dataset and its tables and views. You can also assign access at the table and view levels.
Additional roles are available for controlling access to BigQuery ML, BigQuery Data Transfer Service, and BigQuery BI Engine.
The two patterns of data loading are batch loading and streaming. BigQuery supports both.
Batch loading jobs are common in data warehousing and typically involve extracting data from a source system, transforming the data, and then loading it. This is known as extraction, transformation, and load (ETL) although it is increasingly common to extract, load, and then transform (ELT).
Load jobs are BigQuery jobs for loading data from Cloud Storage or local filesystems. Load jobs work with files in Avro, CSV, ORC, and Parquet format.
The BigQuery Data Transfer Service is a specialized service for loading data from other cloud services, such as Google Ads and Google Ad Managers. It also supports transferring data from Google software-as-a-service (SaaS) applications or third-party services.
The BigQuery Storage Write API is used to batch process and commit many records in one atomic operation. BigQuery is also able to load data from Cloud Firestore exports.
There are two options for streaming data into BigQuery; they are Storage Write API and Cloud Dataflow. The Storage Write API provides high-throughput ingestion and exactly-once delivery semantics. Cloud Dataflow implements an Apache Beam runner and can write data directly to BigQuery tables from a Dataflow job.
GCP provides two managed relational databases and an analytics database with some relational features. Cloud SQL is used for transaction processing systems that do not need to scale beyond a single server. It supports SQL Server, MySQL, and PostgreSQL.
Cloud Spanner is a transaction processing relational database that scales horizontally, and it is used when a single server relational database is insufficient. Cloud Spanner is also used for applications that require a relational database with writable nodes in multiple regions.
BigQuery is designed for data warehousing and analytic querying of large datasets. BigQuery should not be used for transaction processing systems. If data is frequently updated after loading, then one of the other managed relational databases is a better option.
GCP offers three NoSQL databases: Bigtable, Datastore, and Cloud Firestore. All three are well suited to storing data that requires flexible schemas. Cloud Bigtable is a wide-column NoSQL database. Cloud Firestore and Cloud Datastore are document NoSQL databases. Cloud Firebase is the next generation of Cloud Datastore, and in the future existing Datastore databases will be automatically upgraded to Firestore using Datastore mode.
Cloud Bigtable is a wide-column, sparsely populated multidimensional database designed to support petabyte-scale databases for analytic operations, such as storing data for machine learning model building, as well as operational use cases, such as streaming Internet of Things (IoT) data. It is also used for time series, marketing data, financial data, and graph data. Some of the most important features of Cloud Bigtable are as follows:
cbt
Bigtable stores data in tables organized by key-value maps. Each row contains data about a single entity and is indexed by a row key. Multiple cells with different time stamps can exist at the intersection of a row and column. Columns are grouped into column families, which are sets of related columns. A table may contain multiple column families.
Tables in Bigtable are partitioned into blocks of contiguous rows known as tablets. Tablets are stored in the Colossus scalable filesystem. Data is not stored on nodes in the cluster. Instead, nodes store pointers to tablets stored in Colossus. Distributing read and write load across nodes yields better performance than having hotspots where a small number of nodes are responding to most read and write requests.
Bigtable supports the HBase API, making it a good option for migrating from Hadoop HBase to a managed service such as Google Cloud. Bigtable is also a logical choice for replacing Cassandra databases with a managed database service.
Bigtable supports creating more than one cluster in a Bigtable instance. Data is automatically replicated between clusters. This is useful when the instance is performing a large number of read and write operations at the same time. With multiple clusters, one can be dedicated to responding to read requests while the other receives write requests. Bigtable guarantees eventual consistency between the replicas.
Cloud Datastore is a managed document database, which is a kind of NoSQL database that uses a flexible JSON-like data structure called a document.
The terminology used to describe the structure of a document is different than that for relational databases. A table in a relational database corresponds to a kind in Cloud Datastore, while a row is referred to as an entity. The equivalent of a relational column is a property, and a primary key in relational databases is simply called the key in Cloud Datastore.
Cloud Datastore is fully managed. GCP manages all data management operations including distributing data to maintain performance. The flexible data structure makes Cloud Datastore a good choice for applications like product catalogs or user profiles.
Cloud Firestore is the next generation of the GCP-managed document database. Cloud Firestore:
Cloud Firestore supports both a Datastore mode, which is backward compatible with Cloud Datastore, and a Firestore mode. In Datastore mode, all transactions are strongly consistent, unlike Cloud Datastore transactions, which are eventually consistent. Other Cloud Datastore limitations on querying and writing are removed in Firebase Datastore mode.
It's best practice for customers to use Cloud Firestore in Datastore mode for new server-based projects. This supports up to millions of writes per second. Cloud Firestore in its native mode is for new web and mobile applications. This provides client libraries and support for millions of concurrent connections.
Cloud Memorystore is a managed cache service supporting both Redis and Memcached. Memorystore caches are used for storing data in nonpersistent memory, particularly when low-latency access is important. Stream processing and database caching are both common use cases for Memorystore.
Redis is an open source, in-memory data store, which is designed for submillisecond data access that has a variety of data types including strings, lists, sets, sorted sets, bitmaps, and hyperloglogs. Cloud Memorystore for Redis supports up to 300 GB instances and 12 Gbps network throughput. Caches replicated across two zones provide 99.9 percent availability.
As with other managed storage services, GCP manages Cloud Memorystore patching, replication, and failover.
Cloud Memorystore is used for low-latency data retrieval, specifically lower latency than is available from databases that store data persistently to disk or SSD.
Cloud Memorystore for Redis is available in two service tiers: Basic and Standard. Basic Tier provides a simple Redis cache on a single server without replication. Standard Tier is a highly available instance with cross-zone replication and support for automatic failover.
Memcached is an open source project and is used for caching in a wide range of use cases including database query caching, reference data caching, and storing session state. Memcached does not have the range of data types found in Redis and instead stores string values indexed by key strings.
Memcached is deployed on clusters, called instances, made up of nodes with the same amount of memory and virtual CPUs. Data is distributed across nodes in the cluster.
Instances can have between one and 20 nodes, which each have one to 32 vCPUs and between 1 GB and 256 GB of memory. The instance can have a maximum combined memory of 5 TB.
Memcached, like Redis, can be accessed from Compute Engine, Google Kubernetes Engine, Cloud Functions, App Engine Flexible, and App Engine Standard. Access to Cloud Functions and App Engine Standard requires Serverless VPC access.
Data has something of a life as it moves through several stages, including creation, active use, infrequent access but kept online, archived, and deleted. Not all data goes through all of the stages, but it is important to consider lifecycle issues when planning storage systems.
The choice of storage system technology usually does not directly influence data lifecycles and retention policies, but it does impact how the policies are implemented. For example, Cloud Storage lifecycle policies can be used to move objects from Nearline storage to Coldline storage after some period of time. When partitioned tables are used in BigQuery, partitions can be deleted without affecting other partitions or running time-consuming jobs that scan full tables for data that should be deleted.
If you are required to store data, consider how frequently and how fast the data must be accessed.
Cloud Storage provides object lifecycle management policies to make changes automatically to the way objects are stored in the object datastore. These policies contain rules for manipulating objects and are assigned to buckets. The rules apply to objects in those buckets. The rules implement lifecycle actions, including deleting an object and setting the storage class. Rules can be triggered based on the age of the object, when it was created, the number of newer versions, and the storage class of the object.
Another control for data management is retention policies. A retention policy uses the Bucket Lock feature of Cloud Storage buckets to enforce object retention. By setting a retention policy, you ensure that any object in the bucket or future objects in the bucket are not deleted until they reach the age specified in the retention policy. This feature is particularly useful for compliance with government or industry regulations. Once a retention policy is locked, it cannot be revoked.
Network latency is a consideration when designing storage systems, particularly when data is transmitted between regions within GCP or outside GCP to globally distributed devices. Three ways of addressing network latency concerns are as follows:
The reason to consider using these options is that the network latency without them would be too high to meet application or service requirements. For some points of reference, note the following:
Data can be replicated in multiple regions under the control of a GCP service or under the control of a customer-managed service. For example, Cloud Storage multiregional storage replicates data to multiple regions. Cloud Spanner distributes data automatically among multiple regions. Cloud Firestore is designed to scale globally. Using GCP services that manage multiregional and global distribution of data is preferred to managing replication at the application level.
Another way to reduce latency is to use GCP's Cloud CDN. This is particularly effective and efficient when distributing relatively static content globally. Cloud CDN maintains a set of globally distributed points of presence around the world. Points of presence are where the Google Cloud connects to the internet. Static content that is frequently accessed in an area can be cached at these edge nodes.
GCP offers two network service tiers. In the Standard Tier, network traffic between regions is routed over the public internet to the destination device. With the Premium Tier, all data is routed over the Google network up to a point of presence near the destination device. The Premium Tier should be used when high-performance routing, high availability, and low latency at multiregion scales are required.
GCP provides four types of storage systems: object storage using Cloud Storage, network-attached storage, databases, and caching. Cloud Storage is used for unstructured data that is accessed at the object level; there is no way to query or access subsets of data within an object. Object storage is useful for a wide array of use cases, from uploading data from client devices to storing long-term archives. Network-attached storage is used to store data that is actively processed. Cloud Filestore provides a network filesystem, which is used to share file-structured data across multiple servers.
Google Cloud offers several managed databases, including relational and NoSQL databases. The relational database services are Cloud SQL and Cloud Spanner. Cloud SQL is used for transaction processing systems that serve clients within a region and do not need to scale beyond a single server. Cloud Spanner provides a horizontally scalable, global, strongly consistent relational database. BigQuery is an analytical database designed for data warehousing and analytic database applications. The NoSQL managed databases in GCP are Bigtable, Datastore, and Firestore. Bigtable is a wide-column database designed for low-latency writes at petabyte scales. Datastore and Firestore are managed document databases that scale globally. Firestore is the next generation of document storage in GCP and has fewer restrictions than Cloud Datastore.
When designing storage systems, consider data lifecycle management and network latency. GCP provides services to help implement data lifecycle management policies and offers access to the Google global network through the Premium Tier network service.
bq
rather than using the gcloud
command line. BigQuery is billed based on the amount of data stored and the amount of data scanned when responding to queries.gsutil
gcloud
bq
cbt
bq
command-line option will help you understand the amount of data returned by a query without actually executing the query?
--no-data
--estimate-size
--dry-run
--size