Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Administering and Securing BigQuery

One of the reasons to use a fully managed serverless product like BigQuery is to take advantage of the security infrastructure of public cloud services. In Google Cloud Platform (GCP), data is encrypted at rest and in transit, and the API-serving infrastructure is accessible only over encrypted channels. To access BigQuery resources, users and applications must be authenticated and authorized using Identity and Access Management. You can perform this administration (of users, tables, jobs, views, etc.) by using the BigQuery web user interface (UI), using the bq command-line tool, or using the REST API.

In this chapter, we discuss how BigQuery’s infrastructure is secured, how to configure Cloud IAM, and a range of administration tools that you use to monitor jobs and authorize users. We end this chapter with a discussion of BigQuery support for a variety of tools that you might be able to use to help fulfill your regulatory and compliance needs based on the strong foundations established by the infrastructure security measures, Identity and Access Management, and administrative tools. It is always your responsibility to work with your legal counsel to determine whether implementing any of these tools and capabilities will satisfy your regulatory or compliance requirements.

Infrastructure Security

The security infrastructure that BigQuery relies on is end to end—starting with the people and continuing through the datacenter, server hardware, software stack, logging, encryption, and intrusion detection, and finally to the cloud platform itself.

Google’s Information Security Team develops security review processes, builds security infrastructure, and implements Google’s security processes. This team consists of top security experts and was responsible for discovering and coming up with fixes for problems including the Heartbleed vulnerability and the SSL 3.0 exploit. Google’s datacenters employ a layered physical security model with custom-designed safeguards, high-resolution cameras capable of tracking intruders, access logs, and routine patrols.

Server security is enhanced by using tens of thousands of identical, custom-built servers. This homogeneity, along with having built the entire stack including hardware, networking, and custom Linux software, reduces the security footprint and promotes agile responses to security threats. The servers themselves include a custom chip called Titan to provide verification of system firmware and software components, thus providing a strong, hardware-rooted system identity.

The security of customer information is protected through a variety of controls and practices. Every layer of the Google application and storage stack authenticates and verifies the authorization of requests coming from other components. Engineers’ access to production services and production environments is defined and controlled by a centralized group and role management system. The practices include using a security protocol that authenticates engineers through the use of short-lived personal public key certificates, the issuance of which is in turn guarded by two-factor authentication. Hard disks that are retired from Google’s systems are subjected to a data destruction process to safeguard customer information before leaving Google’s premises. The disks are wiped, checked, and tracked by multiple individuals before being released.

BigQuery, like other Google services, is managed through a secured global API gateway infrastructure that is accessible only over encrypted Secure Sockets Layer (SSL)/Transport Layer Security (TLS) channels. Every request must include a time-limited authentication token generated via human login or private key-based secrets in order to be serviced. All API requests are logged, and using GCP tools, a project administrator can read operations and access logs for BigQuery.

Any new data stored in persistent disks is encrypted under the 256-bit Advanced Encryption Standard (AES-256), and each encryption key is itself encrypted with a regularly rotated set of master keys. These are the same encryption and key management policies, cryptographic libraries, and root of trust used by many of Google’s production services, including Gmail. This sharing of infrastructure extends to network infrastructure. Google’s global network helps to improve the security of data in transit by limiting hops across the public internet. By using Cloud Interconnect and a managed Virtual Private Network (VPN), it is possible to create encrypted channels between an on-premises private IP environment and Google’s network.

BigQuery builds on these capabilities. However, it is still your responsibility to enforce appropriate access to data and analyze request logs. This includes preventing your end users from sharing critical information outside of your corporate network/public cloud infrastructure (i.e., data loss prevention) and ensuring that you keep safe any data that could identify a specific individual—that is, Personally Identifiable Information (PII). In the rest of this chapter, we discuss the tools that GCP and BigQuery provide to accomplish these goals.

Identity and Access Management

Cloud Identity and Access Management (IAM) allows users of BigQuery to manage access control by defining three things: identity, role, and resource. Essentially, we need to specify who (identity) has what access (role) to what resource.

Identity

The identity specifies who has access. This could be an end user who is identified by a Google account (such as a @gmail.com account or an @example.com account, where example.com is a G Suite domain) or an application identified by a service account. Service accounts are essentially GCP-assigned email addresses and can be created (for example, using the Cloud Console)¹ to have a subset of the permissions held by the creator of the service account in that project. Typically, we create them to embody the (limited) set of permissions required by applications that are run on our behalf.

Members to whom access is granted can also include virtual groups of Google accounts such as Google groups, G Suite domains, or Cloud Identity² domains. You should prefer providing access to Google groups over providing access to individuals because it is easier to add members to and remove members from a Google group instead of updating multiple Cloud IAM policies to onboard or remove users. Even if access control is provided to a virtual group, you don’t lose auditability: logging and auditing will resolve to the actual Google account or service account that is accessing BigQuery.

It is also possible to provide access to allAuthenticatedUsers (a special identifier for anyone who is authenticated with a Google account or service account).³ A common use is to publish a public dataset—the london_bicycles dataset that we used throughout this book was published in this manner. You should note that allAuthenticatedUsers allows any authenticated user, not just users in your domain, to have access.

Role

The role determines what access is allowed to the identity in question. A role consists of a set of permissions. It is possible to create a custom role to provide granular access to a custom list of permissions. However, most commonly, you will use predefined roles.

Predefined roles

Roles, such as the BigQuery dataViewer role, are predefined and consist of a combination of permissions that are frequently required. For example, the dataViewer role provides, among others, the bigquery.datasets.get permission to get metadata about a dataset, and bigquery.tables.getData to get table data, but not the bigquery.datasets.delete, which would allow any identity with that permission to delete the dataset.

As of this writing, there are eight predefined roles, including four roles associated with access to datasets and associated tables and views. Loosely, in order of increasing capability, these are:

metadataViewer (the fully qualified name is roles/bigquery.metadataViewer) provides metadata-only access to datasets, tables, and/or views.
dataViewer provides permissions to read data as well as metadata.
dataEditor provides the ability to read a dataset and list, create, update, read, and delete tables in that dataset.
dataOwner adds the ability to also delete the dataset.
readSessionUser provides access to the BigQuery Storage API sessions that are billed to a project.
jobUser can run jobs (including queries) that are billed to the project.
user can run jobs and create datasets whose storage is billed to the project.
admin can manage all data within the project and cancel jobs by other users.

Treat these two sets of roles as being independent and orthogonal. Users granted the bigquery.readSessionUser do not have access to table data—it might be that they need to read data from datasets belonging to a different project! To read data, you must also grant them bigquery.tables.getData permissions. Similarly, having the jobUser role does not grant the ability to create, modify, or delete tables (only to do a SELECT from them); you need to specifically assign the dataEditor role in addition to the jobUser role if you want the user to be able to run Data Definition Language (DDL)/Data Manipulation Language (DML) queries.

It is also quite conceivable that you might want to provide only the dataViewer role without providing the user role. This will be the situation when you want users to pay for their own queries (i.e., to create query jobs in their own projects, but be able to query datasets that belong to you). For the specific capabilities of each role, the full set of permissions, and which REST API methods require what permissions, refer to the BigQuery documentation.

Primitive roles

In addition to predefined roles and custom roles, BigQuery supports primitive roles, which date back to before GCP had support for Cloud IAM. Mostly, you will use the aforementioned predefined roles, but in some cases, it can be more convenient to assign users a role in the project (viewer, editor, or owner) and have the permissions for all BigQuery datasets and jobs in the project be inherited from this project role.

Identities that have viewing rights on the project get dataViewer on all datasets in the project as well as the ability to create jobs (i.e., run queries) that are billed to the project. Project editors get the dataEditor role in addition to project viewer privileges, and project owners get the dataOwner role in addition to project editor privileges. One exception is that only the user who runs a query has access to the cached results table (because of the implications of sharing access to the results of queries that join against datasets to which other project owners might not have access). To grant or revoke primitive roles for projects, use the GCP console.

The primitive roles that provide reading, writing, or ownership access to datasets translate neatly to dataViewer, dataEditor, or dataOwner roles, respectively. As such, they can be granted in all the ways that predefined roles can be granted, but the simplest way is to click the link to share a dataset from the BigQuery web UI.

Custom roles

If the predefined roles don’t meet your specific needs, consider creating a custom role, but note that needing to grant multiple roles (such as jobUser as well as dataViewer) to groups of people is not a good reason to create a custom role.

Tip

If you need to grant multiple roles to allow a particular task, create a Google group, grant the roles to that group, and then add users or other groups to that group. You might find it helpful to create Google groups for different job functions within your organization and give everyone in those groups a set of predefined roles. For example, all members of your data science team might be given BigQuery dataViewer and jobUser permissions on data warehousing datasets. This way, if people change jobs, we need to just update their membership in the appropriate groups instead of updating their access to datasets and projects one dataset or project at a time.

One reason to create a custom role is to subtract permissions from the predefined roles. For example, the predefined role dataEditor allows the possessor to create, modify, and delete tables. Suppose that you want to allow your data suppliers to create tables but not to modify or delete any existing tables. In that case, you could create a new role named dataSupplier and provide it with the specific list of permissions. You would start by creating a YAML file (called, for instance, dataSupplier.yaml) with the following contents:

title: "Data Supplier"
description: "Can create, but not delete tables"
stage: "ALPHA"
includedPermissions:
- bigquery.datasets.get
- bigquery.tables.list
- bigquery.tables.get
- bigquery.tables.getData
- bigquery.tables.export
- bigquery.datasets.create
- bigquery.tables.create
- bigquery.tables.updateData

Then you would run the following gcloud command to create the custom role:

PROJECT=$(gcloud config get-value project)
gcloud iam roles create dataSupplier --project $PROJECT 
       --file dataSupplier.yaml

You can verify the permissions associated with this role by obtaining its current definition using:

gcloud iam roles describe dataSupplier --project $PROJECT

This works for predefined roles also.

It’s a good idea when creating a new role to set its stage to be ALPHA and try it out on a smaller set of users before upgrading the stage to BETA or GA. This way, you can fine-tune the set of permissions (start with the most restrictive set) before rolling it out widely.

Resource

Access to resources is managed individually, resource by resource. An identity does not get the dataViewer role or the bigquery.tables.getData permission on all resources in BigQuery; rather, the permission is granted on specific datasets or tables.

Because the dataViewer role is provided on tables or datasets, it is not possible for someone with just the dataViewer role to obtain information about jobs, for example; jobs are a separate resource and require a different set of permissions. An identity can, of course, have both the role dataViewer and the role jobUser, which would give them the ability to create jobs (including running queries) and cancel self-created jobs in addition to being able to view table data.

As much as possible, avoid permission/role creep; err on the side of providing the least amount of privileges to identities. This includes restricting both the roles and the resources on which they are provided. Balance this against the burden of updating permissions on new resources as they are created. One reasonable compromise is to set trust boundaries that map projects to your organizational structure and set roles at the project level—IAM policies can then propagate down from projects to resources within the project, thus automatically applying to new datasets in the project.

Administering BigQuery

It is possible to administer BigQuery from the BigQuery web UI, using the REST API, or using the bq command-line tool. In this section, we assume that you have the admin role in BigQuery, either by virtue of being the BigQuery point person in your company or by being an admin on the project with the resources (either jobs or datasets) in question. Let’s look at common tasks that administrators might need to do, focusing on the bq command line.

Job Management

When a job has been submitted to BigQuery, it goes into three states in succession: PENDING, meaning that it is scheduled but not yet started; RUNNING, meaning that it has started; and either SUCCESS or FAILURE depending on the final status.

You can list all jobs created within the project in the past 24 hours by using the following:

NOW=$(date +%s)
START_TIME=$(echo "($NOW - 24*60*60)*1000" | bc)
bq --location=US ls -j -all --min_creation_time $START_TIME

bq requires a Unix timestamp in milliseconds, so we obtain the min_creation_time by subtracting one day (24*60*60 seconds) from the current timestamp and converting it into milliseconds using the command-line calculator tool, bc.

As soon as you know the job ID, it is possible to cancel⁴ a running job:

bq --location=US cancel bquxjob_180ae24c_16b04a8d28d

Note that you will sometimes have a fully qualified job ID that will include both the project name and the location,⁵ for example, from logs (or from the BigQuery web UI). In that case, you can omit the location to the cancel call:

bq cancel someproject:US.bquxjob_180ae24c_16b04a8d28d

Anyone with the jobUser or user role has the ability to run and cancel their own jobs; they don’t need admin access unless it is to list or cancel jobs started by other users.

Authorizing Users

We recommend creating Google groups and adding members to Google groups instead of providing individual users access to BigQuery resources. If you follow this recommendation, you will be authorizing access to resources only to Google groups, and you can do this from the Cloud Console. There are several convenient ways to add and remove multiple users to a Google group. For details, see the G Suite help page.

On a one-off basis, use the Cloud Console IAM page to provide individual users, service accounts, or Google groups permissions to BigQuery. To share specific resources, in the BigQuery web UI, select the dataset and then click “Share dataset.”

Restoring Deleted Records and Tables

If a user has messed up the contents of a table by loading in duplicate data or by deleting necessary records, it is possible to recover as long as it is within seven days. Deleted tables (as opposed to deleted records within existing tables) can be recovered for up to two days only.

To recover the state of a table as it existed 24 hours earlier, for example, you can use SYSTEM_TIME AS OF and the DDL:⁶

CREATE OR REPLACE TABLE ch10eu.restored_cycle_stations AS
SELECT 
  * 
FROM `bigquery-public-data`.london_bicycles.cycle_stations
FOR SYSTEM_TIME AS OF 
    TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)

For up to two days, you can also recover a deleted table. For example, let’s delete the table you created a moment ago:

bq rm ch10eu.restored_cycle_stations

Now recover it from a snapshot as of 120 seconds ago:

NOW=$(date +%s)
SNAPSHOT=$(echo "($NOW - 120)*1000" | bc) 
bq --location=EU cp 
   ch10eu.restored_cycle_stations@$SNAPSHOT 
   ch10eu.restored_table

Note

You can recover a deleted table only if another table with the same ID in the dataset has not been created in the meantime.⁷ In particular, this means that you cannot recover a deleted table if it is receiving streamed data and the create-disposition is to create the table if it doesn’t exist. Chances are that the streaming pipeline would have created an empty table and started to push rows into it. This is also why you should be careful about using CREATE OR REPLACE TABLE: it makes the table irrecoverable.

Continuous Integration/Continuous Deployment

It might be important to have SQL queries under version control so as to be able to obtain the version of a script as of a certain time and track changes in the script over time. If this is the case, consider using Cloud Source Repositories and Cloud Functions (or Cloud Run if you have more complex dependencies) to execute the queries.

Invoking BigQuery from a Cloud Function

In your Cloud Source Repository, create an .sql file containing the BigQuery SQL query and a Python file implementing the Cloud Function and place both under version control. The Cloud Function could then use the BigQuery client library to submit the query to BigQuery and export results to Google Cloud Storage as long as the query time is less than the timeout of the Cloud Function.⁸

You can create a Cloud Function from the GCP Cloud Console. In the text window, type this code:

from google.cloud import bigquery
def query_to_gcs():
client = bigquery.Client()

# Run query and wait for it to complete
query_job = client.query("""
    ...
    """)
query_job.result()

# Extract to GCS, and wait for it to complete
extract_job = client.extract_table(
    query_job.destination, "gs://bucket/file.csv")
extract_job.result()

Now, instead of scheduling the query, you will schedule the Cloud Function using Cloud Scheduler.

Tip

Note that the preceding code is invoking extract_table on the temporary table created as a result of executing the query. This is a fast way to quickly export the result of a query into a comma-separated values (CSV) file.

Putting table, view, and function creation under version control

Having version control and repeatability is important not just for queries but also for tables, views, models, stored procedures, and functions. It is preferable, therefore, to put all creation code into a script that you can invoke every time you want to re-create the table, view, model, or function in question.

To create a table from a query result, you can use the BigQuery client library and set the job destination to be the desired table:

from google.cloud import bigquery
client = bigquery.Client()
sql = """
WITH stations AS (
   SELECT [300, 314, 287] AS closed
)
SELECT
  station_id
  , (SELECT name FROM `bigquery-public-data`.london_bicycles.cycle_stations WHERE
id=station_id) AS name
FROM
  stations, UNNEST(closed) AS station_id
"""
job_config = bigquery.QueryJobConfig()
job_config.destination = (
client.dataset('ch10eu').table('stations_under_construction'))
query_job = client.query(sql, location='EU', job_config=job_config)
query_job.result() # Waits for the query to finish

Here’s the equivalent using the DDL approach:

CREATE OR REPLACE TABLE -- or TABLE/MODEL/FUNCTION
ch10eu.stations_under_construction
(
  station_id INT64 OPTIONS(description = 'Station ID'),
  name string OPTIONS(description = 'Official station name')
)
OPTIONS(
    description = 'Stations in London.',
    labels=[("pii", "none")] -- Must be lowercase.
)
AS
 
WITH stations AS (
   SELECT [300, 314, 287] AS closed
)
 
SELECT
  station_id
  , (SELECT name FROM `bigquery-public-data`.london_bicycles.cycle_stations WHERE
id=station_id) AS name
FROM
  stations, UNNEST(closed) AS station_id

Note in this example how the table and column descriptions are stored directly within the CREATE statements in version control. If you already have existing BigQuery tables, you can query the list of tables and columns in your metadata and programmatically create those SQL DDL statements through functions.

Note

Be careful about using scheduled Cloud Functions that create or replace tables: if you have a Cloud Function that schedules and reruns the previous statement to REPLACE the table, any changes to the table (including updated rows and schema descriptions) will be overwritten.

Cost/Billing Exports

It is possible to export your daily usage of GCP services as well as cost estimates automatically throughout the day to a BigQuery dataset. Watch out, though: this billing export will include your usage of BigQuery as well!

To enable billing exports, in the GCP Cloud Console, start at the Billing section, select your billing account and project, select the BigQuery dataset⁹ to which to export the billing data, and then enable BigQuery export.

The billing data is loaded into BigQuery at regular intervals, so it could be a few hours before you see anything. The frequency of updates in BigQuery varies depending on the GCP services you’re using.

As with any BigQuery table, you can examine the schema of the exported billing data in BigQuery and figure out what queries you can run and what dashboards you can populate. Here are a few places to get started.

Costs by month by product

To get the monthly invoice amount by product, use this query:

SELECT
  invoice.month
  , product
  , ROUND(SUM(cost)
          + SUM(IFNULL((SELECT SUM(c.amount) FROM UNNEST(credits) c),
                0))
          , 2) AS monthly_cost
FROM ch10eu.gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX
GROUP BY 1, 2
ORDER BY 1 ASC, 2 ASC

The monthly cost is the sum of the costs for that product corrected by the sum of the credits.

Visualizing the billing report

As illustrated in Figure 10-1, a starter Data Studio dashboard for visualizing the billing data is available.

For details on how to make a copy and start working in Data Studio, see https://cloud.google.com/billing/docs/how-to/visualize-data.

Example Data Studio dashboard of the billing export data in BigQuery.

Labels

Although obtaining a cost breakdown by product is useful, what you often want is a cost breakdown by cost center within your organization. To enable this level of detail in your billing reports, you need to apply labels (which are key/value pairs) to your GCP resources. Then each row in the billing export will contain values for two columns, labels.key and labels.value, that correspond to the label applied to the GCP resource usage for which you’re being billed.

If the labels are based on team or cost center, the key could be team and the value could be marketing or research. Labels could also be based on environment (e.g., key=environment, value=production, or value=test), application, or component.

You can assign labels to GCP resources like Compute Engine virtual machines (VMs), Dataproc clusters, or Dataflow jobs. Of course, you can also assign labels to BigQuery datasets, tables, models, and even query jobs.

Here’s how to apply the label “environment:learning” to the dataset ch10eu:

bq update --set_label environment:learning ch10eu

You can also apply labels to table and views in a similar way, but (as of this writing) table/view labels do not show up in the billing data:

bq update --set_label environment:learning ch10eu.restored_table

It also possible to assign a label when you submit a job through the command line:

bq query --label environment:learning --nouse_legacy_sql 'SELECT 17'

When submitting through the REST API, populate the labels property for the job resource.

Then the billing export will reflect query costs, and you can aggregate costs by label for the purpose of apportioning costs between environments, or cost centers, or any other label key:

SELECT
  invoice.month
  , label.value
  , ROUND(SUM(cost)
          + SUM(IFNULL((SELECT SUM(c.amount) FROM UNNEST(credits) c),
                0))
          , 2) AS monthly_cost
FROM 
  ch10eu.gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX
  , UNNEST(labels) AS label
WHERE
  label.key = 'environment'
GROUP BY 1, 2
ORDER BY 1 ASC, 2 ASC

Dashboards, Monitoring, and Audit Logging

A key aspect of security is to be able to verify that the security measures are being effective. Observability of all the resources deployed is very important.

Cloud Security Command Center

The Cloud Security Command Center (SCC) provides a comprehensive security management and data-risk platform for GCP. By providing visibility into what assets you have and what security state they are in, Cloud SCC makes it easier to prevent, detect, and respond to threats. There are built-in threat detectors that can alert you to suspicious activity.

You can access Cloud SCC from the GCP Cloud Console Security Command Center Marketplace page and launch asset discovery. After your projects have been scanned, you can use the Cloud SCC dashboard to look for common problems like an open port 22 (for Secure Shell [SSH]). After this, asset discovery runs at least once per day.

Stackdriver monitoring and audit logging

You can use Stackdriver to monitor BigQuery resources. These include visualizations of metrics such as overall query times, the number of slots available, and more. BigQuery also automatically sends audit logs to Stackdriver Logging. Stackdriver Logging allows users to filter and export messages to other services, including Cloud Pub/Sub, Cloud Storage, and BigQuery.

In addition to providing long-term log retention functionality, log exports to BigQuery are recommended as a way to provide the ability to do aggregated analysis on logs data. Here is a query that estimates costs (before any discounts are applied) by user identity:

WITH data as
  (
   SELECT
        protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
        protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS
jobCompletedEvent
     FROM
       ch10.cloudaudit_googleapis_com_data_access_2019*
  )
  SELECT
    principalEmail,
    SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40)) AS
Estimated_USD_Cost
  FROM
    data
  WHERE
    jobCompletedEvent.eventName = 'query_job_completed'
  GROUP BY principalEmail
  ORDER BY Estimated_USD_Cost DESC

BigQuery audit logs are reported as the protoPayload.metadata within a LogEntry message. They are organized into three streams: admin activity, system events, and data access. The admin activity includes events such as job insertions and completions. System events are events such as the TableDeletion event logged when a table or partition expires and is removed. The data access stream contains information about new jobs, jobs that changed state, table data changes, and table data reading.

Availability, Disaster Recovery, and Encryption

BigQuery architecture contributes significantly to its reliability. For instance, the serverless aspect of the service means that virtually any hardware component can fail and it will have little to no impact on BigQuery’s ability to run queries. Unlike many systems that are tied to particular virtual machines (VMs) or nodes, BigQuery runs in a giant shared pool of servers and can redirect traffic nearly instantaneously from one location to another.

Zones, Regions, and Multiregions

In GCP, there are three different types of service locations. Zones are compute clusters, generally located within a single building. Zones have fairly high availability, but if there is a major hardware failure (there’s a fire, or a transformer gets fried, for example), the zone can go offline. Sometimes services have problems within a single zone that aren’t related to hardware. A high rate of requests could cause some service to crash, and that will cause problems with dependent services. Resilient services, such as BigQuery, are designed to withstand any zonal problem seamlessly.

Regions, on the other hand, are metro-wide locations that consist of multiple zones with uncorrelated failures. Regions tend to be spread across multiple buildings within a large campus. In general, it is very rare for an entire region to be taken offline. Natural disasters can, however, cause regions to go offline. Predictable natural disasters, like hurricanes, can generally be preceded by an ordered shutdown of the region so that no data is lost. Unpredictable disasters, like earthquakes, could cause services to lose data in a region as well as prevent the region from starting up again when the disaster has passed.

Multiregions are the most resilient; they generally imply some flexibility in their location, being spread over multiple datacenters separated by hundreds of miles. For example, the EU multiregion consists only of physical datacenters that are part of the European Union. Which specific datacenters? Is the Frankfurt datacenter in the EU? How about the Finland one? Some services, like Google Cloud Storage, are explicit about which regions comprise the multiregions. Others, like BigQuery, give only vague guidelines to where multiregional data is located in order to preserve flexibility in both compute and data placement.

BigQuery and Failure Handling

One way of predicting how well a service will handle a particular type of failure is to note how often the failure happens; if failures happen very infrequently, the service is usually untested and liable to have bugs or other problems. Services at Google are designed to be able to handle virtually any type of hardware and even software failure and keep running. Not only are they designed this way, but they are tested rigorously to make sure the failure paths are exercised routinely. Let’s discuss some different types of failures and how BigQuery responds.

Disk failures

Spinning disks have moving parts, and like many things with moving parts, they fail pretty frequently. Because of the way data is encoded in Colossus (the storage infrastructure that underlies BigQuery; see Chapter 6) using erasure encoding, lots of disks can fail without any loss of data. If you have 100,000 disks, you can expect that dozens will fail every day. If you consider that routine maintenance can cause some disks to be unavailable, the number is probably in the hundreds. If there is a power outage, when disks start up again you can lose a lot more. Even with all of these factors, losing data due to hardware failure in BigQuery would be extremely rare.¹⁰

When a disk fails, Colossus detects it and replicates the data to another disk. Google Datacenter Hardware Operations staff will remove the disk and securely destroy it. Software and services using that disk will not notice any disruption, other than certain requests perhaps taking a few milliseconds longer to complete.

Machine failures

Despite all attempts to keep servers up and running for long periods of time, they crash. Operating systems have bugs, cosmic rays corrupt memory, CPUs die, power supplies fail, software has memory leaks, and lots of other things can go wrong. The common approach in server software is to add redundancy and hardening to prevent these types of failures. Expensive servers have backup power supplies and hot-swappable memory and CPUs, so the server never needs to go down.

Google, as a philosophy, takes a different approach. Google datacenters are set up so that they expect any machine to die at any time. Software must be written in a way that handles this. In order to allow any machine to die at any time, Google invented a number of scale-out distributed systems, like MapReduce, Google File System (GFS, the precursor to Colossus), and Dremel.

BigQuery servers crash all the time. When you have hundreds of thousands or millions of independent workers running, some of those are going to hit problems. Virtually any BigQuery server can crash at any time and users will not see more than a small hiccup in their queries. The cluster management software that runs Google datacenters, Borg, will restart any tasks that don’t respond to a health check within a few seconds, sometimes on a different machine. For the most part, even the software running in the query engine doesn’t notice these problems; it just retries on a different task and keeps going.

Even larger problems, like rack failures or network switch failures, are handled transparently. Because of the scale of the datacenter clusters, if a rack or switch fails, it will affect only a few of the tasks running the service, so the service can route around the problem. Nobody gets paged, except perhaps the hardware operations personnel in the datacenter who will need to address the issue directly. But the service operations (SRE) staff sleep soundly.

Zonal failures

So what happens when there are larger failures in which the built-in self-healing can no longer automatically handle the problem? To be clear, these failures are pretty rare. Zones are designed for high resilience to hardware and network failures. Anything that can cause the zone to go offline, like network switches or transformers, usually has a redundant backup. But there have been cases in which someone cuts the wrong fiber cable with a backhoe, or transformers catch fire, and it causes a zone-wide outage.

Some services, like Google Compute Engine, are, as of this writing, tied to a single availability zone. If that zone goes down, the VM instances in that zone go down. BigQuery, on the other hand, is designed to be able to handle almost all zonal failures. In BigQuery, all Cloud Projects have a primary and a secondary location. If there is an outage in the primary location, BigQuery will seamlessly fail over to the secondary location.

There are two types of zonal failures: soft and hard. Soft failures mean that there are problems in the zone but things are still progressing, perhaps with degraded capacity. Soft failures are often the result of a software failure, rather than a hardware issue. Perhaps quota servers are failing, or Bigtable is stuck, or the BigQuery scheduler is taking too long to schedule. Hard failures mean that the zone is down. Maybe it had a power failure or some sort of unrecoverable hardware issue.

BigQuery reacts to soft failures by proactively draining the zone. A drain means that new queries are sent somewhere else, but existing queries are allowed to continue. New requests are routed to the secondary zone. Queries that are in progress might be allowed to continue; if, however, the outage is severe enough, they will be restarted in the new secondary zone.

Soft failures happen fairly frequently. BigQuery operates in dozens of availability zones around the world, and the odds of some service behaving badly on some zone somewhere is pretty high. The good news is that this allows BigQuery to exercise failover code, and users almost never actually notice any hiccup.

Hard failures are much rarer; they generally mean that the entire zone is undergoing severe problems. In the event of a hardware zonal failure, users might notice disruption; existing queries will be cancelled and restarted in the new zone, for example, so they might take up to twice as long to complete. And particularly bad zonal failures could mean that recent data has not been replicated to the secondary zone, and will be unavailable until the zone is brought back online. In the event this happens, queries to the affected tables will fail. BigQuery would rather fail queries than return inconsistent data.

Regional failures

The next level of failure type is a failure in which an entire region goes offline. This is much rarer than a zonal outage. Like zones, regions can have soft failures in which the entire region needs to be neatly shut down. Regions that are in the path of a hurricane might be shut down cleanly before the hurricane hits. By going into shutdown before the arrival of the storm, Google can minimize the chance of data loss when the region is restarted, or in the rare event that the region is damaged. A power outage shouldn’t cause a region to go down because there are backup systems in place, but if the power outage lasts too long, the backup systems can run out of capacity, and so an orderly shutdown of the region would be initiated.

Regions can have hard failures, too, but these are even more rare. Like many “black swan”–type events,¹¹ it is difficult to estimate how rare they are in practice. A catastrophic earthquake could take down a region with little or no warning. Other extreme weather events or natural disasters could also occur unexpectedly. A hard region failure could cause damage to hardware and could mean that data not replicated offsite could be lost.

As of this writing, single-region BigQuery locations (asia-east1, or europe-north1, for example) do not store a physical copy in another site. In general, the reason is that there might not be a place to store a backup copy in another location without violating customers’ requirements for their data. For example, Singapore is an island only about 30 miles across; if a customer requires that their data is not stored outside of Singapore for regulatory purposes, there isn’t much opportunity to store the data elsewhere. However, before making any durability assumptions about your data, you should check the up-to-date documentation provided by Google.

Multiregional BigQuery locations, like the US and the EU, store a backup copy offsite in another region. In the event of a catastrophic failure of a region, the data would be safe. However, it might take some time before that backup becomes available.

Durability, Backups, and Disaster Recovery

To summarize the replication story for BigQuery:

Multiregional data is replicated to at least two regions (single-region data is only in one).
All data is replicated to two availability zones.
Within an availability zone, data is encoded using erasure encoding.

The offsite backups are also protected by a secondary mechanism to prevent accidental deletion; they use a feature of disk firmware that prevents deletion until a certain period of time has passed. This means that if there were a code bug in BigQuery that overzealously deleted data, low-level firmware systems on the disk would prevent the data from being physically deleted immediately.

If data is accidentally deleted by a customer, BigQuery’s time-travel feature can come in handy. Users can query the table as of a time before the data was deleted, using the SYSTEM_TIME AS OF syntax (see Chapter 8). Moreover, you can copy the table as of a particular time by using tablename@timestamp in the copy job.

This technique of copying the old snapshot of the data is useful if you need to undelete a table. To undelete, you should copy the old table name as of a timestamp that the table existed to a new destination table. Note that if you delete a table and then re-create one with the same name, it will become unrecoverable, so you should be careful. As of this writing, time travel for deleted tables is available for only 48 hours after the table was deleted, which is shorter than the normal seven-day time travel period.

Privacy and Encryption

Google takes security and privacy very seriously. All data in BigQuery is encrypted at rest and encrypted when it is transferred over the network. Encryption on disk is done transparently through Colossus file encryption. Streaming data is encrypted in Bigtable or in log files. Metadata is encrypted in Spanner. Network traffic is encrypted transparently through the use of Google’s internal Remote Procedure Call (RPC) protocols. Someone with physical disk access or with a network tap wouldn’t be able to access data in the clear.

Access transparency

Google takes a number of steps to safeguard access to the data. To ensure the safety and reliable running of the system, only a small number of on-call engineers can get access to user data. The Access Transparency program in GCP means that whenever someone at Google accesses your data, you are notified through audit log records. It is generally as simple as that; if someone reads your data, you can find out about it.

Virtual Private Cloud Service Controls

Virtual Private Cloud Service Controls (VPC-SC) is a mouthful to pronounce, but it is a mechanism that gives you fine-grained control over how services can be accessed and where data can flow within GCP. For example, you can limit BigQuery access to a narrow range of IP addresses from your company’s network. Or you can ensure control over how data flows between services by preventing data export from BigQuery to Google Cloud Storage. Alternatively, you might allow export to Google Cloud Storage, but only to Cloud Storage buckets that are owned by your organization.

VPC-SC is not merely a BigQuery feature; it is a feature that works across a number of GCP products. This lets you create one overall policy describing data exfiltration and movement policies. You might decide that you don’t want people to access BigQuery at all (that seems like a shame, though). For more information on VPC, check out the Google Cloud Documentation.

Customer-Managed Encryption Keys

All data is encrypted at rest in BigQuery, but what if you want to make sure that your data is encrypted with your own keys? In these cases, you can use Customer-Managed Encryption Keys (CMEK) in BigQuery. You can manage your keys in Cloud KMS, GCP’s central key management service. You can then designate datasets or tables that you want to be encrypted using those keys.

BigQuery uses multiple layers of key wrapping; that is, the master keys aren’t exposed outside of KMS. Every CMEK-protected table has a wrapped key as part of the table metadata. When BigQuery accesses the table, it sends a request to Cloud KMS to unwrap the key. The unwrapped table key is then used to unwrap separate keys for each file. There are a number of advantages to this key-wrapping protocol that reduce the risk should an unwrapped key be leaked. If you have an unwrapped file key, you can’t read any other files. If you have an unwrapped table key, you can only unwrap file keys after you pass access control checks. And Cloud KMS never discloses the master key. If you delete the key from KMS, the other keys can never be unwrapped (so be careful with your keys!).

Regulatory Compliance

Most organizations are subject to government regulations of one form or another, and your organization probably defines compliance requirements for software and analysis teams so as to follow those regulations and stay on the right side of the law. In this section, we look at BigQuery features that can help you provide support for such regulatory compliance. But remember: it is always your responsibility to work with your legal counsel to determine whether implementing any of these tools and capabilities will satisfy your regulatory or compliance requirements.

Data Locality

Many governments around the world regulate where data can be stored, and BigQuery enforces that queries on any dataset are run only in a datacenter where that dataset is available. Hence, controlling data locality is done at the time a dataset is created. For example, here’s how to create a dataset in the asia-east2 region (located in Hong Kong):

bq --location=asia-east2 mk --dataset ch10hk

Two types of locations are supported by BigQuery: regional and multiregional. The Hong Kong region is an example of a regional location and represents a specific geographic place. The other type of location is a multiregional location (like the US or the EU¹²) that contains two or more regional locations. For an up-to-date list of supported locations, refer to the BigQuery documentation.

As explained in Chapter 6, BigQuery determines the location to run a job based on the project defaults, reservations, and datasets referenced in the request. It is also possible to explicitly specify the location in which to run a job, whether the job is submitted through the BigQuery web UI (set Processing Location in Query Settings), the REST API (specify the location property in the jobReference section), or the bq command-line tool (specify --location). If a query cannot be run in the location specified (such as if the location is specified as US but the data is in the EU), BigQuery returns an error.

Moving data directly between regions is not currently possible other than with the BigQuery Data Transfer Service, but there is one exception: you can move data from a US multiregional Cloud Storage bucket to a BigQuery dataset in any region or multiregional location. If you are loading data into BigQuery from a regional Cloud Storage bucket, the bucket must be colocated with the BigQuery dataset (for example, both need to be in asia-east2) unless the bucket in question is in the US multiregion.

If you cannot use the BigQuery Data Transfer Service, moving BigQuery data between locations will involve a few hops: export the data from BigQuery to Google Cloud Storage in the same region, transfer the data to Cloud Storage in the target region, and load it into a BigQuery dataset. Note that you will incur extra storage costs for the time period that you have data in Cloud Storage, and network egress charges for transferring data between regions in Cloud Storage.

Restricting Access to Subsets of Data

To restrict access to an entire dataset, you can use IAM. But in many cases, tables might contain sensitive data and what you want is to restrict access to parts of a table. You can do that with authorized views or dynamic filters, or through fine-grained access control.

Authorized views

An authorized view allows you to use an SQL query to restrict the columns or rows that the users are able to query. For example, suppose that you have a set of users who are allowed to view only a specific subset of columns and rows from our london_bicycles dataset. You could do this by sharing with them, not the original dataset, but the dataset ch10eu, which contains this view:

CREATE OR REPLACE VIEW ch10eu.authorized_view_300 AS 
SELECT 
  * EXCEPT (bike_id, end_station_priority_id)
FROM 
  [PROJECTID].ch07eu.cycle_hire_clustered
WHERE
  start_station_id = 300 OR end_station_id = 300

Now users granted access to this view will not be able to access the bike_id column or information from any stations other than the one whose ID is 300. Grant users access to this view by sharing the dataset that it is part of. From the BigQuery web UI, select the target dataset (ch10eu) and click Share Dataset, and then share it with the desired user or Google group. To try it out, share this dataset with a second Google Account you have access to with the BigQuery User role, as demonstrated in Figure 10-2.

Providing access to the dataset (ch10eu) containing the authorized view

However, the authorized view itself needs to be able to access the original dataset. You can do that by selecting the source dataset (ch07eu) in the BigQuery web UI, clicking Share Dataset, and then, in the Dataset permissions panel, choosing the authorized view to allow, as shown in Figure 10-3.

Providing the authorized view in ch10eu access to the source dataset in ch07eu

Now when you visit the following URL (replace [PROJECT] by your project name):

https://console.cloud.google.com/bigquery?p=[PROJECT]&d=ch10eu&page=dataset

in a window where you are logged in as the other Google account, you will be able to view the dataset ch10eu, but not the dataset ch07eu. You will also be able to query the view:

SELECT AVG(duration)
FROM [PROJECT].ch10eu.authorized_view_300

Note that this average duration will be over trips that started or ended at station ID 300 because of the way that the view is constructed.

Note

Authorized views are simple only as long as you don’t have layered views (views calling views calling views...). When you have layered views, the SQL effectively contains references to many tables, and those tables will often reside in different datasets. This can become challenging to administer because the Access Control Lists (ACLs) now need to be chained across all the views/datasets.

Dynamic filtering based on user

In the previous section, we built a view that filters the full dataset to show a subset of columns and rows to anyone who has access to that view. But what if you want to filter the rows in the table based on who the logged-in user is? To accomplish that, use the built-in function SESSION_USER, as shown in the example that follows.

To illustrate, suppose that you want to create a view to flag the top 10 transactions over $1,000,000 for manual review, but you want to restrict it to the transactions from the same company as the viewer:¹³

CREATE OR REPLACE VIEW ecommerce.vw_large_transactions
OPTIONS(
  description="large transactions for review",
  labels=[('org_unit','loss_prevention')],
  expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
)
AS
 
SELECT 
  visitorId,
  REGEXP_EXTRACT(SESSION_USER(), r'@(.+)') AS user_domain,
  REGEXP_EXTRACT(visitorEmailAddress, r'@(.+)') AS customer_domain,
  date,
  totals.transactions, 
  totals.transactionRevenue,
  totals.totalTransactionRevenue,
  totals.timeOnScreen
  
FROM `bigquery-public-data`.google_analytics_sample.ga_sessions_20170801 
  
WHERE
  (totals.totalTransactionRevenue / 1000000) > 1000 
  AND REGEXP_EXTRACT(visitorEmailAddress, r'@(.+)') = 
      REGEXP_EXTRACT(SESSION_USER(), r'@(.+)')
ORDER BY totals.totalTransactionRevenue DESC
LIMIT 10

Notice that the view filters the transactions to reflect only transactions from the same domain as the viewer, so that someone from @example.com can review large transactions from @example.com but not those from @acme.com.

Removing All Transactions Related to a Single Individual

Suppose that you are storing user transaction data in BigQuery and receive a request to remove all records of transactions by userId=xyz from the data warehouse. There are two ways you might be able to accomplish this (always check with your legal counsel whether this method is sufficient in your jurisdiction if the removal is pursuant to a legal requirement): using DML, and using crypto-shredding.

DML

You can use DML to DELETE the rows in each table that contain user data:

DELETE someds.user_transactions
WHERE
  userId = 'xyz'

Make sure to also delete the rows in any and all of your backups and temporary tables. The deleted rows are recoverable for up to seven days, so if there is a time limit within which the erasure must be carried out, make sure to start at least seven days earlier. As discussed in Chapter 8, it is more efficient to batch up these sorts of deletes and remove records for a group of users using a MERGE.

Crypto-shredding

The second way to carry out erasure of records is to plan for it by assigning a unique encryption key to each userId and encrypting all sensitive data corresponding to a user with that encryption key. This has the advantage of maintaining user privacy as well. To remove the records for a user, you can simply delete the encryption key. This approach has the advantage that it immediately makes the user records unusable (as long as the deleted key is not recoverable) in all the tables in your data warehouse, including backups and temporary tables.

As an illustration, assume that each bike in the london_bicycles dataset is a “person” for whom we want to remove all data. In particular, we do not want to expose the stations that the bike is traveling between at any point in time.

The first step is to create a table of a keyset for each distinct bike_id in the dataset:

CREATE OR REPLACE TABLE ch10eu.encrypted_bike_keys AS
WITH bikes AS (
  SELECT
     DISTINCT bike_id
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire
)
SELECT 
   bike_id, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset 
FROM
   bikes

Now the encrypted_bike_keys table looks as follows, with a keyset for each distinct bike_id in the original dataset:

Row	bike_id	keyset
1	3792	CJ/erfEIEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIB0bdZJ8sfaEzaN8RyShvuCZL5r0OXf7EztsBLB9V1tMGAEQARif3q3xCCAB
2	5331	CKSmz/ILEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIKtZqxsil9t41Sv6lTKAWEi/wxLS0cizdbCDWVPpcI8JGAEQARikps/yCyAB

We can use this keyset to encrypt any individual’s data in the cycle_hire data (in our case, let’s assume that data includes the stations the bike is traveling between as well as the specific timestamps). First, we create a pair of helper functions to encrypt the individual’s data because some of the sensitive columns are integers and other columns are strings:

CREATE TEMPORARY FUNCTION encrypt_int(keyset BYTES, data INT64, trip_start
TIMESTAMP) AS (
   AEAD.ENCRYPT(keyset, CAST(data AS STRING), CAST(trip_start AS STRING)) 
);
CREATE TEMPORARY FUNCTION encrypt_str(keyset BYTES, data STRING, trip_start
TIMESTAMP) AS (
   AEAD.ENCRYPT(keyset, data, CAST(trip_start AS STRING)) 
);

Then we can join against the encrypted keys using the bike_id to get the encryption keyset to use for each row and invoke the encryption helper functions:

CREATE OR REPLACE TABLE ch10eu.encrypted_cycle_hire AS

SELECT
    cycle_hire.* EXCEPT(start_station_id, end_station_id, 
      start_station_name, end_station_name)
    , encrypt_int(keyset, start_station_id, start_date) AS start_station_id
    , encrypt_int(keyset, end_station_id, start_date) AS end_station_id
    , encrypt_str(keyset, start_station_name, start_date) AS start_station_name
    , encrypt_str(keyset, end_station_name, start_date) AS end_station_name    
FROM
    `bigquery-public-data`.london_bicycles.cycle_hire
JOIN
    ch10eu.encrypted_bike_keys
USING (bike_id)

This table was built by encrypting each column of sensitive data (here, start_station_id, end_station_id, start_station_name, and end_station_name) by a keyset unique to the bike_id (recall that the bike_id is the “person” whose privacy we are preserving). The third argument to the ENCRYPT function is “extra” data that can be used to ensure that decryption is allowed only in context. Here, we are using the start_date as the context. Had we not done this, it would have been possible for someone who is provided access to the table to swap in a different row corresponding to this bike and get the plain text of the bike’s location. Now, because the context (start_date) is different, this sort of attack by an insider will not be possible.¹⁴

To query the encrypted table, you need to decrypt the data before querying. For example, to find stations with the longest duration rentals, you would need to do the following:

CREATE TEMPORARY FUNCTION 
 decrypt(keyset BYTES, encrypted BYTES, trip_start TIMESTAMP) AS (
    AEAD.DECRYPT_STRING(keyset, encrypted, CAST(trip_start AS STRING)) 
);
 
WITH duration_by_station AS (
  SELECT
       duration
       , decrypt(keyset, start_station_name, start_date) AS start_station_name
  FROM
       ch10eu.encrypted_cycle_hire
  JOIN
       ch10eu.encrypted_bike_keys
  USING (bike_id)
)
 
SELECT 
  start_station_name
  , AVG(duration) AS duration
FROM
  duration_by_station
GROUP BY
  start_station_name
ORDER BY duration DESC
LIMIT 5

Here is the result, as expected:

Row	start_station_name	duration
1	Stewart’s Road, Nine Elms	4836.380090497737
2	Contact Centre, Southbury House	4364.000000000001
3	Speakers’ Corner 2, Hyde Park	4006.0086554245627
4	Speakers’ Corner 1, Hyde Park	3710.4661268713203
5	Black Lion Gate, Kensington Gardens	3588.0120035566083

Note that we cannot simply group by the encrypted value: because the encryption is nondeterministic (i.e., involves salting), there are many more encrypted station names than there were in the original table:

SELECT COUNT (DISTINCT start_station_name)
FROM ch10eu.encrypted_cycle_hire

This returns 24369201, whereas

SELECT COUNT (DISTINCT start_station_name)
FROM `bigquery-public-data`.london_bicycles.cycle_hire

returns just 880.

The query to compute the stations with the longest rentals is much slower (more than two times slower) than if we did not need to do decryption. The trade-off is that, if necessary, we can simply drop the keyset corresponding to a bike_id in order to forget it:

DELETE ch10eu.encrypted_bike_keys
WHERE bike_id = 300

When you need to delete all data relating to an individual, the method you use—DML or crypto-shredding—should be based on whether you want your technical overhead to be in the form of maintenance or in the form of computational complexity, respectively.

Data Loss Prevention

In many cases, you might not even know where sensitive data exists. It can therefore be helpful to scan BigQuery tables looking for known patterns such as credit card numbers, company confidential project codes, and medical information. The result of a scan can be used as a first step to ensure that such sensitive data is properly secured and managed, thus reducing the risk of exposing sensitive details. It can also be important to carry out such scans periodically to keep up with growth in data and changes in use.

You can use the Cloud Data Loss Prevention (Cloud DLP)¹⁵ to scan your BigQuery tables and to protect your sensitive data. Cloud DLP is a fully managed service that uses more than 90 built-in information type detectors to identify patterns, formats, and checksums. It also provides the ability to define custom information type detectors using dictionaries, regular expressions, and contextual elements. Cloud DLP includes a set of tools to de-identify your data, including masking, tokenization, pseudonymization, date shifting, and more, all without replicating customer data. Besides its use in BigQuery, Cloud DLP can also be used within streams of data and files in Google Cloud Storage, and within images. Finally, it can be used to analyze structured data to help understand the risk of reidentification,¹⁶ including computation of metrics like k-anonymity.

To scan a BigQuery table, select it in the Cloud Console and then choose Export > Scan with DLP, configuring it to look for specific forms of data, as depicted in Figure 10-4.

Scanning a BigQuery table using Cloud DLP.

To redact or otherwise de-identify sensitive data that the Cloud DLP scan found, protect the data with Cloud KMS keys, which we discuss in the next section.

CMEK

BigQuery employs envelope encryption to encrypt table data without any additional actions on your part. In envelope encryption, the data in a BigQuery table is first encrypted using a data encryption key (DEK), and then the DEKs are encrypted by a key encryption key. For each GCP customer, any nonshared resources are split into data chunks and encrypted with keys separate from keys used for other customers. These DEKs are even separate from those that protect other pieces of the same data owned by that same customer. Key encryption keys are then used to encrypt the data encryption keys that Google uses to encrypt your data. These key encryption keys (KEKs) are managed centrally in the Google Key Management Service (KMS), as shown in Figure 10-5.

Envelope encryption with DEKs and KEKs. The KEKs are managed centrally in a KMS, which rotates keys through the use of a key ring.

Encryption helps to ensure that if the data accidentally falls into an attacker’s hands, they cannot access the data without also having access to the encryption keys. Even if an attacker obtains the storage devices containing your data, they won’t be able to understand or decrypt it. Encryption also acts as a “chokepoint”—centrally managed encryption keys create a single place where access to data is enforced and can be audited. Finally, encryption contributes to the privacy of customer data; it allows systems to manipulate data—for backup, for example—and engineers to support the infrastructure without providing access to content.

If regulations require that you control the keys used to encrypt your data, you may find CMEK a useful tool. Recall that key encryption is used to encrypt the DEKs, which are used to encrypt the data chunks, and that the key encryption keys are stored and managed centrally. You’ll typically want to do the same if you are managing key encryption yourself; you’ll run the KMS in a central project and use those keys to encrypt table data in all your organization’s projects.

In Cloud KMS, a key belongs to a key ring, which resides in a particular location. In the central project where you are running Cloud KMS, create a key ring and a key, specifying a rotation period of the keys:

gcloud kms keyrings create acmecorp --location US
gcloud kms keys create xyz --location US 
  --keyring acmecorp --purpose encryption 
  --rotation-period 30d 
  --next-rotation-time 2019-07-01T12:00:00Z

The key ring should be created in a location that matches the location of your BigQuery datasets. For example, a dataset in region US should be protected with a key ring from region US, and a dataset in asia-northeast1 should be protected with a key ring from asia-northeast1.

The Cloud KMS keys are used as key encryption keys in BigQuery, in that they encrypt the DEKs that encrypt your data. So, having created the key, you need to allow the BigQuery service account in every project (not the KMS project) to use the key to encrypt and decrypt data:

SVC=$(bq show --encryption_service_account)
gcloud kms keys add-iam-policy-binding 
  --project=[KMS_PROJECT_ID] 
  --member serviceAccount:$SVC
  --role roles/cloudkms.cryptoKeyEncrypterDecrypter 
  --location=US 
  --keyring=acmecorp 
  xyz

When creating tables, specify the key to be used:

bq mk … --destination_kms_key 
projects/[PROJECT_ID]/locations/US/keyRings/acmecorp/cryptoKeys/xyz 
mydataset.transactions

Beyond this configuration of tables at creation time, no special arrangements are required to query a table protected by Cloud KMS. BigQuery stores the name of the key used to encrypt the table content and will use that key when a table protected by Cloud KMS is queried. All existing tools, the BigQuery web UI, and the bq command-line interface run the same way as with default-encrypted tables, as long as BigQuery has access to the Cloud KMS key used to encrypt the table content.

Data Exfiltration Protection

Virtual Private Cloud (VPC) Service Controls allow users to define a security perimeter around Google Cloud Platform resources such as Cloud Storage buckets and BigQuery datasets, to help mitigate data exfiltration risks by constraining data to stay within the VPC perimeter. Combined with Private Google Access, it is possible to set up a hybrid cloud environment of cloud and on-premises deployments to help keep sensitive data private.

VPC Service controls provide an additional, context-based perimeter security beyond the identity-based access control offered by Cloud IAM. Using VPC Service Controls, you can mitigate security risks associated with access from unauthorized networks using stolen credentials, data exfiltration by disgruntled insiders, and inadvertent exposure of private data due to misconfigured IAM policies. It is possible to use VPC Service Controls to prevent reading data from or copying data to a resource outside the perimeter using tools such as gsutil and bq.

To set up VPC Service Controls, go to the VPC Service Controls section of the GCP console and add a new perimeter. You can then specify the projects and services within those projects that are allowed to communicate with each other within the perimeter. For example, suppose you choose only Project A and two services: BigQuery and Cloud Storage. Then it will be possible to load data from GCS into BigQuery and export data from BigQuery to GCS but only to buckets owned by the same project. It will not be possible to load data into BigQuery from buckets owned by other projects or export data from BigQuery into buckets owned by other projects. The same goes for copying data between buckets or querying datasets that are part of other projects. You can, of course, have two projects within the perimeter to allow interproject communication (but only between those two projects).

Summary

In this chapter, we described the infrastructure security that underlies BigQuery and discussed how users and applications can be authenticated and authorized using IAM. We then discussed a variety of tools that can help you fulfil your regulatory and compliance needs. We’d like to reiterate, however, that it is always your responsibility to work with your legal counsel to determine whether implementing any of these tools and capabilities will satisfy your regulatory or compliance requirements.

Thank you for sticking with us through a wide-ranging tour of BigQuery. We began with an introduction to the service in Chapter 1, delved into the SQL syntax in Chapter 2, described data types in Chapter 3, covered ingest in Chapter 4, and examined the development environment in Chapter 5. We started to go beyond the basics in Chapter 6 with a description of BigQuery’s architecture. In Chapter 7, we covered a variety of performance tips, and in Chapter 8, we poked around in some of the nooks and crannies of BigQuery capabilities. Chapter 9 was devoted to machine learning in BigQuery, and in this chapter we focused on security. Because BigQuery is a serverless SQL enterprise data warehouse that was designed to be knob free, we were able to use this book to focus on analyzing data to find meaningful insights. We hope that you will enjoy working with BigQuery and be exceedingly successful with it!

¹ For information, see https://console.cloud.google.com/apis/credentials/serviceaccountkey and https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating_a_service_account.

² These are essentially like G Suite domains, but they don’t have access to G Suite applications. To manage users who don’t need G Suite or premium features like mobile device management, you can create free Cloud Identity accounts for them—see https://support.google.com/cloudidentity/answer/7319251.

³ Note that allUsers, although allowed by GCP, doesn’t have any effect in BigQuery because all BigQuery users must be authenticated.

⁴ Cancelled jobs are still charged.

⁵ Recall that BigQuery data is stored in a specific region or multiregion. Queries will need to run where the data is located, and the job metadata is also stored regionally.

⁶ Create the output dataset ch10eu in the EU location first.

⁷ Also, the encompassing dataset should not have been deleted or re-created. See https://cloud.google.com/bigquery/docs/managing-tables#undeletetable.

⁸ Cloud Functions have a configurable timeout, but the maximum value as of this writing is nine minutes.

⁹ Create a dataset if necessary.

¹⁰ Of course, extenuating circumstances, code bugs, natural disasters, and so on could cause data loss. But as designed, the probability of data loss is very low. If you had 1 PB of data stored and every day 1,000 disks in a single zone crashed, after a million years, there would be less than 0.01% chance that you’d lose data.

¹¹ These are unexpected and very rare events of large magnitude and consequence; the name refers to a theory put forward by Nassim Nicholas Taleb in his book, The Black Swan: The Impact of the Highly Improbable (Random House).

¹² As of this writing, data stored in the EU multiregion is not stored in Zurich or London.

¹³ The referenced dataset is public but, for PII reasons, doesn’t include visitor email addresses. So the query won’t work as-is. However, it is illustrative.

¹⁴ For example, suppose that a programmer uses an authorized application that displays a bike’s location at the current time. If they also had the ability to modify the application, they would be able to swap the bike’s encrypted current location by the encrypted location at an earlier time and get the application to map it. This way, they would get the bike’s location history without ever knowing the encryption keys. If we use start_date as an additional field, however, this sort of attack is not possible, because the earlier location was encrypted with additional data consisting of the start_date of the trip being encrypted and it is not possible to decrypt it with the latest date. Any metadata that provides this sort of context can be used as extra data to the AEAD encryption function.

¹⁵ See https://cloud.google.com/dlp/docs/. As of this writing, Cloud DLP was a global service. If you have data location requirements, check whether this is still the case.

¹⁶ See https://en.wikipedia.org/wiki/Data_re-identification. This is the risk that anonymized data can be matched with auxiliary data to reidentify the individual with whom the data is associated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Administering and Securing BigQuery

Create new playlist

Sign In

Sign Up

Chapter 10. Administering and Securing BigQuery

Infrastructure Security

Identity and Access Management

Identity

Role

Predefined roles

Primitive roles

Custom roles

Tip

Resource

Administering BigQuery

Job Management

Authorizing Users

Restoring Deleted Records and Tables

Note

Continuous Integration/Continuous Deployment

Invoking BigQuery from a Cloud Function

Tip

Putting table, view, and function creation under version control

Note

Cost/Billing Exports

Costs by month by product

Visualizing the billing report

Figure 10-1. Example Data Studio dashboard of the billing export data in BigQuery

Labels

Dashboards, Monitoring, and Audit Logging

Cloud Security Command Center

Stackdriver monitoring and audit logging

Availability, Disaster Recovery, and Encryption

Zones, Regions, and Multiregions

BigQuery and Failure Handling

Disk failures

Machine failures

Zonal failures

Regional failures

Durability, Backups, and Disaster Recovery

Privacy and Encryption

Access transparency

Virtual Private Cloud Service Controls

Customer-Managed Encryption Keys

Regulatory Compliance

Data Locality

Restricting Access to Subsets of Data

Authorized views

Figure 10-2. Providing access to the dataset (ch10eu) containing the authorized view

Figure 10-3. Providing the authorized view in ch10eu access to the source dataset in ch07eu

Note

Dynamic filtering based on user

Removing All Transactions Related to a Single Individual

DML

Crypto-shredding

Data Loss Prevention

Figure 10-4. Scanning a BigQuery table using Cloud DLP

CMEK

Figure 10-5. Envelope encryption with DEKs and KEKs; the KEKs are managed centrally in a KMS, which rotates keys through the use of a key ring

Data Exfiltration Protection

Summary

Table of Contents for
10. Administering and Securing BigQuery

Figure 10-2. Providing access to the dataset (`ch10eu`) containing the authorized view

Figure 10-3. Providing the authorized view in `ch10eu` access to the source dataset in `ch07eu`