Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Query Essentials

BigQuery is first and foremost a data warehouse, by which we mean that it provides persistent storage for structured and semi-structured data (like JSON objects). The four basic CRUD operations are supported on this persistent storage:

Create: To insert new records. This is implemented through load operations, by the SQL INSERT statement, and through a streaming insert API. You can also use SQL to create database objects like tables, views, and machine learning models as part of BigQuery’s support of the Data Definition Language (DDL). We go into examples of each later.
Read: To retrieve records. This is implemented by the SQL SELECT statement as well as the bulk read API.
Update: To modify existing records. This is implemented by the SQL UPDATE and MERGE statements, which are part of BigQuery’s support of the Data Manipulation Language (DML). Note that, as we discussed in Chapter 1, BigQuery is an analytics tool and is not meant to be used for frequent updates.
Delete: To remove existing records. This is implemented by SQL DELETE, which is also a DML operation.

BigQuery is a tool for data analysis, and the majority of queries you can expect to write will be the aforementioned Read operations. Reading and analyzing your data is accomplished by the SELECT statement, which is the focus of this chapter. We cover creating, updating, and deleting data in later chapters.

Simple Queries

BigQuery supports a dialect of SQL that is compliant with SQL:2011. When the specification is ambiguous or otherwise lacking, BigQuery follows the conventions set by existing SQL engines. There are other areas in which there is no specification at all, such as with machine learning; in these cases, BigQuery defines its own syntax and semantics.

What’s Legacy SQL?

For a long time, BigQuery supported only a limited subset of SQL with some Google enhancements. This was because BigQuery was based on an internal SQL query engine at Google (called Dremel) that was originally built to process log data held in Protocol Buffers (Protobufs).¹ Because it was not built as a general-purpose SQL engine, Dremel could use a dialect of SQL (now referred to as legacy SQL) that was well suited to Protobufs, which are used to hold hierarchical structures. For example, the legacy SQL dialect distinguished between records (the complete hierarchical structure pertaining to a log message) and rows (slices through the structure).² Therefore, COUNT(*) in Dremel counts the number of non-NULL values in the most repeated field. Even though such features made certain types of queries much easier to write, Dremel took some getting used to because it was not standard SQL.

In this book, we focus exclusively on standard SQL. The BigQuery user interface (UI) in the Google Cloud Platform (GCP) Cloud Console defaults to standard SQL, and new features are not being backported to legacy SQL. However, some tools and user interfaces still default to legacy SQL. If that is the case for any tool that you are using, preface the query with #standardsql on the first line, as shown in the following example:

#standardsql
SELECT DISTINCT gender
FROM `bigquery-public-data`.new_york_citibike.citibike_trips

If the BigQuery service receives a query string whose first line consists of #standardsql, the query engine will treat what follows as standard SQL even if the client itself does not know about standard SQL.

Retrieving Rows by Using SELECT

The SELECT statement allows you to retrieve the values of specified columns from a table. For example, consider the New York bicycle rentals dataset—it contains several columns relating to bicycle rentals, including the trip duration and the gender of the person renting the bicycle. We can pull out the values of these columns by using the SELECT statement (lines beginning with double dashes or # are comments):

-- simple select
SELECT
  gender, tripduration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
LIMIT 5

The result looks something like this:

Row	gender	tripduration
1	male	371
2	male	1330
3	male	830
4	male	555
5	male	328

The result set has two columns (gender and tripduration) in the order specified in the SELECT. There are five rows in the result set because we limited it to five in the final line of the query. BigQuery distributes the task of fetching rows to multiple workers, each of which can read a different shard (or part) of the dataset, so if you run the previous query, you might get a different set of five rows.

Note that using a LIMIT constrains only the amount of data displayed to you and not the amount of data the query engine needs to process. You are typically charged based on the amount of data processed by your queries, and this usually implies that the more columns your query reads, the higher your cost will be. The number of rows processed will usually be the total size of the table that you are reading, although there are ways to optimize this (which we cover in Chapter 7). We examine performance and pricing considerations in later chapters.

The values are being retrieved from the following:

bigquery-public-data.new_york_citibike.citibike_trips

Here, bigquery-public-data is the project ID, new_york_citibike is the dataset, and citibike_trips is the table. The project ID indicates ownership of the persistent storage associated with the dataset and its tables. The owner of bigquery-public-data is paying the storage costs associated with the new_york dataset. The cost of the query is paid by the project within which the query is issued. If you run the preceding query, you pay the query costs. Datasets provide for Identity and Access Management (IAM). The person who created the new_york_citibike dataset³ in BigQuery made it public, which is why we were able to list the tables in the dataset and query one of those tables. The citibike_trips table contains all of the bicycle trips. The project, dataset, and table are separated by dots. The backtick is needed as an escape character in this case because the hyphen (-) in the project name (bigquery-public-data) would otherwise be interpreted as subtraction. Most developers simply enclose the entire string within backticks, as shown here:

-- simple select
SELECT
  gender, tripduration
FROM
  `bigquery-public-data.new_york_citibike.citibike_trips`
LIMIT 5

Although this is simpler, you lose the ability to use the table name (citibike_trips) as an alias. So it is worth developing the habit of putting the backticks only around the project name and avoiding the use of hyphens when you create your own datasets and tables.

For a long time, our recommendation was that tables in BigQuery be stored in denormalized form (i.e., a single table often containing all of the data you’d need without the need for joining multiple tables). However, with improvements in the service, this is no longer necessary. It is possible now to achieve good performance even with a star schema.

Table 2-1 reviews the three key components of the name `bigquery-public-data`.new_york_citibike.citibike_trips.

Table 2-1. Summary of BigQuery objects and descriptions
BigQuery object	Name	Description
Project	`bigquery-public-data`	Owner of the persistent storage associated with the dataset and its tables. The project also governs the use of all other GCP products as well.
Dataset	`new_york_citibike`	Datasets are top-level containers that are used to organize and control access to tables and views. A user can own multiple datasets.
Table/View	`citibike_trips`	A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.^a
^a See https://cloud.google.com/bigquery/docs/datasets-intro.

Distinguishing between each of these three will be critical later when we discuss geographic location, data access, and sharing of data.

Aliasing Column Names with AS

By default, the names of the columns in the result set match those of the table from which the data is retrieved. It is possible to alias the column names by using AS:

-- Aliasing column names
SELECT
  gender, tripduration AS rental_duration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
LIMIT 5

This now yields the following (your results might be a different set of five):

Row	gender	rental_duration
1	male	432
2	female	1186
3	male	799
4	female	238
5	male	668

Aliasing is useful when you are transforming data. For example, without the alias, a statement such as

SELECT
  gender, tripduration/60
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
LIMIT 5

would result in an automatically assigned column name for the second column in the result set:

Row	gender	f0_
1	male	6.183333333333334
2	male	22.166666666666668
3	male	13.833333333333334
4	male	9.25
5	male	5.466666666666667

You can assign the second column a more descriptive name by adding an alias to your query:

SELECT
  gender, tripduration/60 AS duration_minutes
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
LIMIT 5

This yields a result similar to the following:

Row	gender	duration_minutes
1	male	6.183333333333334
2	male	22.166666666666668
3	male	13.833333333333334
4	male	9.25
5	male	5.466666666666667

Filtering with WHERE

To find rentals of less than 10 minutes, we could filter the results returned by SELECT by using a WHERE clause:

SELECT
  gender, tripduration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE tripduration < 600
LIMIT 5

As expected, the result set now consists only of rows for which the trip duration is less than 600 seconds:

Row	gender	tripduration
1	male	178
2	male	518
3	male	376
4	male	326
5	male	516

The WHERE clause can include Boolean expressions. For example, to find trips rented by females and lasting between 5 and 10 minutes, you could use this:

SELECT
  gender, tripduration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE tripduration >= 300 AND tripduration < 600 AND gender = 'female'
LIMIT 5

The OR keyword also works, as does NOT. For example, to find nonfemale riders (i.e., male riders and those whose gender is unknown), the WHERE clause could be as follows:

WHERE tripduration < 600 AND NOT gender = 'female'

It is also possible to use parentheses to control the order of evaluation. To find female riders who take short trips as well as all male riders, you could use this:

WHERE (tripduration < 600 AND gender = 'female') OR gender = 'male'

The WHERE clause operates on the columns in the FROM clause; thus, it is not possible to reference aliases from the SELECT list in the WHERE clause. In other words, to retain only trips shorter than 10 minutes, it is not possible to use the following:

SELECT
  gender, tripduration/60 AS minutes
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE minutes < 10 -- CAN NOT REFERENCE ALIAS IN WHERE
LIMIT 5

Instead, you need to repeat the transformation in the WHERE clause (we explore better alternatives later):

SELECT
  gender, tripduration / 60 AS minutes
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE (tripduration / 60) < 10
LIMIT 5

SELECT *, EXCEPT, REPLACE

For cost and performance reasons (which we cover in detail in Chapter 7), it is better to select only the columns that you want. If, however, you do want to select all of the columns in the table, you can use SELECT *:

SELECT
  *
FROM
  `bigquery-public-data`.new_york_citibike.citibike_stations
WHERE name LIKE '%Riverside%'

The WHERE clause uses the LIKE operator to look for stations that have Riverside anywhere in their name.

To select all except for a few columns, use SELECT EXCEPT:

SELECT
  * EXCEPT(short_name, last_reported)
FROM
  `bigquery-public-data`.new_york_citibike.citibike_stations
WHERE name LIKE '%Riverside%'

This query returns the same result as the previous one except that two of the columns (short_name and last_reported) are omitted.

To select all of the columns but replace a column with another, you can use SELECT REPLACE. For example, you can add 5 to the number of bikes reported to be available using the following:

SELECT
  * REPLACE(num_bikes_available + 5 AS num_bikes_available)
FROM
  `bigquery-public-data`.new_york_citibike.citibike_stations

Subqueries with WITH

You can reduce the repetitiveness and retain the use of the alias by using a subquery:

SELECT * FROM (
  SELECT
     gender, tripduration / 60 AS minutes
  FROM
    `bigquery-public-data`.new_york_citibike.citibike_trips
)
WHERE minutes < 10
LIMIT 5

The outer SELECT operates on the inner subquery that is enclosed within parentheses. Because the alias happens in the inner query, the outer query can use the alias in its WHERE clause.

Queries with parentheses can become quite difficult to read. A better approach is to use a WITH clause to provide names to what would otherwise have been subqueries:

WITH all_trips AS (
  SELECT
     gender, tripduration / 60 AS minutes
   FROM
     `bigquery-public-data`.new_york_citibike.citibike_trips
)

SELECT * from all_trips
WHERE minutes < 10
LIMIT 5

In BigQuery, the WITH clause behaves like a named subquery and does not create temporary tables. We will refer to all_trips as a “from_item”—it’s not a table, but you can select from it.

Sorting with ORDER BY

To control the order of rows in the result set, use ORDER BY:

SELECT
  gender, tripduration/60 AS minutes
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE gender = 'female'
ORDER BY minutes DESC
LIMIT 5

By default, rows in results are not ordered. If an order column is specified, the default is ascending order. By asking for the rows to be listed in descending order and limiting to 5, we get the five longest trips by women in the dataset:

Row	gender	minutes
1	female	250348.9
2	female	226437.93333333332
3	female	207988.71666666667
4	female	159712.05
5	female	154239.0

Note that we are ordering by minutes, which is an alias—because the ORDER BY is carried out after the SELECT, it is possible to use aliases in ORDER BY.

Aggregates

In the example in the previous section, when we converted seconds to minutes by dividing by 60, we operated on every row in the table and transformed it. It is also possible to apply a function to aggregate all of the rows so that the result set contains only one row.

Computing Aggregates by Using GROUP BY

To find the average duration of trips by male riders, you could do the following:

SELECT
  AVG(tripduration / 60) AS avg_trip_duration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE
  gender = 'male'

This yields the following:

Row	avg_trip_duration
1	13.415553172043886

This indicates that the average bicycle trip taken by male riders in New York is about 13.4 minutes. Because the dataset is continuously updated, though, your result might be different.

How about female riders? Although you could run the previous query twice, once for male riders and the next for females, it seems wasteful to traverse through the dataset a second time, changing the WHERE clause. Instead, you can use a GROUP BY:

SELECT
  gender, AVG(tripduration / 60) AS avg_trip_duration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE
  tripduration is not NULL
GROUP BY
  gender
ORDER BY
  avg_trip_duration

This yields the following result:

Row	gender	avg_trip_duration
1	male	13.415553172043886
2	female	15.977472148805207
3	unknown	31.4395230232542

The aggregates have now been computed on each group separately. The SELECT expression can include the thing being grouped by (gender) and aggregates (AVG). Note that there are actually three genders in the dataset: male, female, and unknown.

Counting Records by Using COUNT

To see how many rides went into the previous averages, you can simply add a COUNT():

SELECT
  gender,
  COUNT(*) AS rides,
  AVG(tripduration / 60) AS avg_trip_duration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE
  tripduration IS NOT NULL
GROUP BY
  gender
ORDER BY
  avg_trip_duration

This gives us the result shown here:

Row	gender	rides	avg_trip_duration
1	male	35611787	13.415553172043888
2	female	11376412	15.97747214880521
3	unknown	6120522	31.439523023254207

Filtering Grouped Items by Using HAVING

It is possible to post-filter the grouped operations via the HAVING clause. To learn which genders take trips that, on average, last longer than 14 minutes, you can use this:

SELECT
  gender, AVG(tripduration / 60) AS avg_trip_duration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE tripduration IS NOT NULL
GROUP BY
  gender
HAVING avg_trip_duration > 14
ORDER BY
  avg_trip_duration

This yields the following:

Row	gender	avg_trip_duration
1	female	15.977472148805209
2	unknown	31.439523023254203

Note that, even though it is possible to filter the gender or trip duration with a WHERE clause, it is not possible to use it to filter by average duration, because the average duration is computed only after the items have been grouped (try it!).

Finding Unique Values by Using DISTINCT

What values of gender are present in the dataset? Although you could use GROUP BY, a simpler way to get a list of distinct values of a column is to use SELECT DISTINCT:

SELECT DISTINCT 
   gender
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips

This yields a result set with just four rows:

Row	gender
1	male
2	female
3	unknown
4

Four rows? What is the fourth row? Let’s explore:

SELECT
    bikeid,
    tripduration,
    gender
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE gender = ""
LIMIT 100

This yields the result shown here:

Row	bikeid	tripduration
1	null	null
2	null	null
3	null	null
...

In this particular case, a blank gender value seems to indicate missing or poor-quality data. We discuss missing data (NULL values) and how you can account for and transform them in Chapter 3, but briefly: if you want to filter for NULLs in a WHERE clause, use the IS NULL or IS NOT NULL operators because other comparison operators (=, !=, <, >) applied to a NULL return NULL and therefore will never match the WHERE condition.

Going back to our original query for DISTINCT genders, it’s important to note that the DISTINCT modifies the entire SELECT, not just the gender column. To see what we mean, add a second column to the query’s SELECT list:

SELECT DISTINCT
  gender,
  usertype
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE gender != ''

This results in six rows; that is, you get a row for every combination of unique gender and user type (subscriber or customer) that exists in the dataset:

Row	gender	usertype
1	male	Subscriber
2	unknown	Customer
3	female	Subscriber
4	female	Customer
5	male	Customer
6	unknown	Subscriber

A Brief Primer on Arrays and Structs

In this section, we provide a brief primer on arrays so that we can illustrate many of the data types and functions in the next chapter on small, illustrative datasets. The combination of ARRAY (the square brackets in the query that follows) and UNNEST gives us a quick way to experiment with queries, functions, and data types.

For example, if you want to know how the SPLIT function of a string behaves, simply try it out:

SELECT
  city, SPLIT(city, ' ') AS parts
FROM (
  SELECT * from UNNEST([
      'Seattle WA', 'New York', 'Singapore'
  ]) AS city
)

Here’s the result of this quick query:

Row	city	parts
1	Seattle WA	Seattle
		WA
2	New York	New
		York
3	Singapore	Singapore

This ability to hardcode an array of values in the SQL query itself allows you to play with arrays and data types without the need to find an appropriate dataset or wait for long queries to finish. Even better, this processes 0 bytes and therefore does not incur BigQuery charges.⁴

Another way to quickly experiment with a set of values employs UNION ALL to combine single row SELECT statements:

WITH example AS (
  SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways
  UNION ALL SELECT 'Sun', 2376, 936
  UNION ALL SELECT 'Mon', 1476, 736
)

SELECT * from example
WHERE numrides < 2000

This yields the two rows in the small inline dataset that have fewer than 2,000 rides:

Row	day	numrides	oneways
1	Sat	1451	1018
2	Mon	1476	736

In the next chapter, we use such inline datasets with hardcoded numbers to illustrate various aspects of the way different data types and functions behave.

The purpose of this section is to quickly introduce arrays and structs so that we can use them in illustrative examples. We review these concepts in greater detail in Chapter 8, so feel free to quickly skim the remainder of this section for now.

Creating Arrays by Using ARRAY_AGG

Consider finding the number of trips by gender and year:

SELECT
  gender
  , EXTRACT(YEAR from starttime) AS year --
  , COUNT(*) AS numtrips
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE gender != 'unknown' and starttime IS NOT NULL
GROUP BY gender, year
HAVING year > 2016

This returns the following:

Row	gender	year	numtrips
1	male	2017	9306602
2	male	2018	3955871
3	female	2018	1260893
4	female	2017	3236735

Tip

What’s with the leading commas in the SELECT clause? Standard SQL (at least as of this writing) does not support a trailing comma, and so moving the comma to the next line allows you to easily reorder or comment lines and still have a working query:

SELECT
  gender
  , EXTRACT(YEAR from starttime) AS year
  -- comment out this line , COUNT(1) AS numtrips
FROM etc.

Trust us, the leading comma will become second nature after a while and will greatly speed up your development.⁵

What would be required, though, if we want to get a time-series of the number of trips associated with each gender over the years—in other words, the following result?

Row	gender	numtrips
1	male	9306602
		3955871
2	female	3236735
		1260893

To get this, you would need to create an array of the numbers of trips. You can represent that array in SQL using the ARRAY type and create such an array by using ARRAY_AGG:

SELECT 
  gender
  , ARRAY_AGG(numtrips order by year) AS numtrips
FROM (
  SELECT
    gender
    , EXTRACT(YEAR from starttime) AS year
    , COUNT(1) AS numtrips
  FROM
    `bigquery-public-data`.new_york_citibike.citibike_trips
  WHERE gender != 'unknown' and starttime IS NOT NULL
  GROUP BY gender, year
  HAVING year > 2016
)
GROUP BY gender

Normally, when you group by gender, you compute a single scalar value for the group, such as the AVG(numtrips) to find the average number of trips across all years. ARRAY_AGG allows you to collect the individual values and put them into an ordered list, or ARRAY.

The ARRAY type is not limited to the results of queries. Because BigQuery can ingest hierarchical formats such as JSON, it is possible that the input data contains JSON arrays—for example:

[
 {
    "gender": "male",
    "numtrips": [
      "9306602",
      "3955871"
    ]
 },
 {
    "gender": "female",
    "numtrips": [
      "3236735",
      "1260893"
    ]
 }
]

Creating a table by ingesting such a JSON file will result in a table whose numtrips column is an ARRAY type. An array is an ordered list of non-NULL elements; for instance, ARRAY<INT64> is an array of integers.

Note

Technically, NULL elements in arrays are permissible as long as you don’t try to save them to a table. Thus, for example, the following will not work, because you are trying to save the array [1, NULL, 2] to the temporary table that holds the results:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT ARRAY_AGG(IF(position = 2, NULL, position)) as
positions from example

However, the following will work because the intermediate array with a NULL element is not being saved:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT ARRAY_LENGTH(ARRAY_AGG(IF(position = 2, NULL,
position))) from example

Array of STRUCT

A STRUCT is a group of fields in order. The fields can be named (if omitted, BigQuery will assign them names), which we recommend for readability:

SELECT 
  [
     STRUCT('male' as gender, [9306602, 3955871] as numtrips)
     , STRUCT('female' as gender, [3236735, 1260893] as numtrips)
  ] AS bikerides

This results in the following:

Row	bikerides.gender	bikerides.numtrips
1	male	9306602
		3955871
	female	3236735
		1260893

TUPLE

We could have left out the STRUCT keyword and the names of the fields, in which case we would have ended up with a tuple or anonymous struct. BigQuery assigns arbitrary names for unnamed columns and struct fields in the result of a query; thus

SELECT 
  [
     ('male', [9306602, 3955871])
     , ('female', [3236735, 1260893])
  ]

yields this result:

Row	f0_._field_1	f0_._field_2
1	male	9306602
		3955871
	female	3236735
		1260893

Obviously, leaving out aliases for the field names makes subsequent queries unreadable and unmaintainable. Do not do this except for throwaway experimentation.

Working with Arrays

Given an array, we can find the length of the array and retrieve individual items:

SELECT 
  ARRAY_LENGTH(bikerides) as num_items
  , bikerides[ OFFSET(0) ].gender as first_gender
FROM
(SELECT 
  [
     STRUCT('male' as gender, [9306602, 3955871] as numtrips)
     , STRUCT('female' as gender, [3236735, 1260893] as numtrips)
  ]  AS bikerides)

This yields the following:

Row	num_items	first_gender
1	2	male

Offsets are numbered starting at zero, which is why OFFSET(0) gives us the first item in the array.⁶

UNNEST an Array

In the query

SELECT
  [
     STRUCT('male' as gender, [9306602, 3955871] as numtrips)
     , STRUCT('female' as gender, [3236735, 1260893] as numtrips)
  ]

the SELECT returns exactly one row containing an array, and so both genders are part of the same row (look at the Row column):

Row	f0_.gender	f0_.numtrips
1	male	9306602
		3955871
	female	3236735
		1260893

UNNEST is a function that returns the elements of an array as rows, so you can UNNEST the result array to get a row corresponding to each item in the array:

SELECT * from UNNEST(
  [
     STRUCT('male' as gender, [9306602, 3955871] as numtrips)
     , STRUCT('female' as gender, [3236735, 1260893] as numtrips)
  ])

This yields the following:

Row	gender	numtrips
1	male	9306602
		3955871
2	female	3236735
		1260893

Notice that UNNEST is actually a from_item—you can SELECT from it. You can select just parts of the array as well. For example, we can get only the numtrips column by using this:

SELECT numtrips from UNNEST(
  [
     STRUCT('male' as gender, [9306602, 3955871] as numtrips)
     , STRUCT('female' as gender, [3236735, 1260893] as numtrips)
  ])

This gives us the following results:

Row	numtrips
1	9306602
	3955871
2	3236735
	1260893

Joining Tables

Data warehouse schemas often rely on a primary “fact” table that contains events, and satellite “dimension” tables that contain extended, slowly changing information. For example, a retail schema might have a “Sales” table as the fact table and then “Products” and “Customers” tables as dimensions. When using this type of schema, the majority of queries will require a JOIN operation, such as to return the names of all the products purchased by a particular customer.

BigQuery supports all of the common join types from relational algebra: inner joins, outer joins, cross joins, anti-joins, semi-joins, and anti-semi-joins. Although it can sometimes be faster to avoid a JOIN, BigQuery can efficiently join tables of almost any size. Chapter 7 discusses more about how to optimize JOIN performance, but for now, we describe only the basic JOIN operation.

The JOIN Explained

In Chapter 1, we looked at an example of a JOIN across tables in two different datasets produced by two different organizations. Let’s revisit that for a refresher:

WITH bicycle_rentals AS (
  SELECT
    COUNT(starttime) as num_trips,
    EXTRACT(DATE from starttime) as trip_date
    FROM `bigquery-public-data`.new_york_citibike.citibike_trips
  GROUP BY trip_date
),

rainy_days AS
(
SELECT
  date,
  (MAX(prcp) > 5) AS rainy
FROM (
  SELECT
    wx.date AS date,
    IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
  FROM
    `bigquery-public-data`.ghcn_d.ghcnd_2016 AS wx
  WHERE
    wx.id = 'USW00094728'
)
GROUP BY
  date
)

SELECT
  ROUND(AVG(bk.num_trips)) AS num_trips,
  wx.rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
GROUP BY wx.rainy

In Chapter 1, we asked you to ignore the syntax, but let’s parse it now.

The first WITH pulls out the number of trips by day from the citibike_trips table into a from_item called bicycle_rentals. This is not a table, but it is something from which we can select. Hence, we will refer to it as a “from_item.” The second from_item is called rainy_days and is created from the Global Historical Climate Network (GHCN) observation in each day. This from_item marks each day as being rainy or not depending on whether at least five mm of precipitation was observed at weather station 'USW00094728', which happens to be in New York.

So now we have two from_items. Let’s visualize them separately:

WITH bicycle_rentals AS (
  SELECT
     COUNT(starttime) as num_trips,
     EXTRACT(DATE from starttime) as trip_date
  FROM `bigquery-public-data`.new_york_citibike.citibike_trips
  GROUP BY trip_date
)
SELECT * from bicycle_rentals LIMIT 5

The bicycle_rentals from_item looks like this:

Row	num_trips	trip_date
1	31287	2013-09-16
2	22477	2015-12-30
3	37812	2017-09-02
4	54230	2017-11-15
5	25719	2013-11-07

Similarly, the rainy_days from_item looks like this:

Row	date	rainy
1	2016-10-11	false
2	2016-12-13	false
3	2016-09-28	false
4	2016-01-25	false
5	2016-05-24	false

We can now join these from_items using the join condition that the trip_date in one is the same as the date in the second:

SELECT
  bk.trip_date,
  bk.num_trips,
  wx.rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
LIMIT 5

This creates a table in which columns from the two tables are joined by date:

Row	trip_date	num_trips	rainy
1	2016-07-13	55486	false
2	2016-04-25	42308	false
3	2016-09-27	61346	true
4	2016-07-15	48572	false
5	2016-05-20	52543	false

Given this, finding the average number of trips on rainy and nonrainy dates is straightforward.

What we have illustrated is called an inner join, and it is the type of JOIN used if no join type is specified.

Here’s how the JOIN works:

Create two from_items. These can be anything: any two of a table, a subquery, an array, or a WITH statement from which you can select.
Identify a join condition. The join condition does not need to be an equality condition; any Boolean condition that uses the two from_items will do.
Select the columns that you want. If identically named columns exist in both from_items, use aliases (bk, wx in the previous example query) to clearly specify from which from_item the column needs to come.
If not using an inner join, specify a join type.

The only requirement for carrying out such a join is that all the datasets used to create the from_items are in the same BigQuery region (all BigQuery public datasets are in the US region).

INNER JOIN

There are several types of joins. The INNER JOIN (or simply JOIN), to which the previous example defaulted, creates a common set of rows to select from:

WITH from_item_a AS (
  SELECT 'Dalles' as city, 'OR' as state
  UNION ALL SELECT 'Tokyo', 'Tokyo'
  UNION ALL SELECT 'Mumbai', 'Maharashtra'
),
 
from_item_b AS (
  SELECT 'OR' as state, 'USA' as country
  UNION ALL SELECT 'Tokyo', 'Japan'
  UNION ALL SELECT 'Maharashtra', 'India'
)
 
SELECT from_item_a.*, country
FROM from_item_a
JOIN from_item_b
ON from_item_a.state = from_item_b.state

The first from_item has a list of cities, and the second from_item identifies the country each of the states belongs to. Joining the two yields a dataset with three columns:

Row	city	state	country
1	Dalles	OR	USA
2	Tokyo	Tokyo	Japan
3	Mumbai	Maharashtra	India

Again, the join condition does not need to be an equality check. Any Boolean condition will do, although it’s best to use an equality condition if possible because BigQuery will return an error if the JOIN cannot be executed efficiently.

For example, we might have a business rule that shipping from one country to another involves a surcharge. To get a list of countries for which there will be a surcharge from a given location, we could have specified this:

SELECT from_item_a.*, country AS surcharge
FROM from_item_a
JOIN from_item_b
ON from_item_a.state != from_item_b.state

We would obtain the following:

Row	city	state	surcharge
1	Dalles	OR	Japan
2	Dalles	OR	India
3	Tokyo	Tokyo	USA
4	Tokyo	Tokyo	India
5	Mumbai	Maharashtra	USA
6	Mumbai	Maharashtra	Japan

Notice that we get a row for each time that the join condition is met. Because there are two rows for which the state doesn’t match, we get two rows for each row in the original from_item_a. If the join condition is not met for some row, that row’s data items will not make it to the output.

CROSS JOIN

The CROSS JOIN, or cartesian product, is a join with no join condition. All rows from both from_items are joined. This is the join that we would get if the join condition of an INNER JOIN always evaluated to true.

For example, suppose that you organized a tournament and have a table of the winners of each event in the tournament, and another table containing the gifts for each event. You can give each winner the gift corresponding to their event only by doing an INNER JOIN:

WITH winners AS (
  SELECT 'John' as person, '100m' as event
  UNION ALL SELECT 'Hiroshi', '200m'
  UNION ALL SELECT 'Sita', '400m'
),
gifts AS (
  SELECT 'Google Home' as gift, '100m' as event
  UNION ALL SELECT 'Google Hub', '200m'
  UNION ALL SELECT 'Pixel3', '400m'
)
SELECT winners.*, gifts.gift
FROM winners
JOIN gifts

This would provide the following result:

Row	person	event	gift
1	John	100m	Google Home
2	Hiroshi	200m	Google Hub
3	Sita	400m	Pixel3

On the other hand, if you want to give each gift to each winner (i.e., each winner gets all three gifts), you could do a CROSS JOIN:

WITH winners AS (
  SELECT 'John' as person, '100m' as event
  UNION ALL SELECT 'Hiroshi', '200m'
  UNION ALL SELECT 'Sita', '400m'
),
gifts AS (
  SELECT 'Google Home' as gift
  UNION ALL SELECT 'Google Hub'
  UNION ALL SELECT 'Pixel3'
)
SELECT person, gift
FROM winners
CROSS JOIN gifts

This yields a row for each potential combination:

Row	person	gift
1	John	Google Home
2	John	Google Hub
3	John	Pixel3
4	Hiroshi	Google Home
5	Hiroshi	Google Hub
6	Hiroshi	Pixel3
7	Sita	Google Home
8	Sita	Google Hub
9	Sita	Pixel3

Even though we wrote

SELECT from_item_a.*, from_item_b.*
FROM from_item_a
CROSS JOIN from_item_b

we could also have written this:

SELECT from_item_a.*, from_item_b.*
FROM from_item_a, from_item_b

Therefore, a CROSS JOIN is also termed a comma cross join.

OUTER JOIN

Suppose that we have winners in events for which there is no gift, and gifts for events that didn’t take place in our tournament:

WITH winners AS (
  SELECT 'John' as person, '100m' as event
  UNION ALL SELECT 'Hiroshi', '200m'
  UNION ALL SELECT 'Sita', '400m'
  UNION ALL SELECT 'Kwame', '50m'
),
gifts AS (
  SELECT 'Google Home' as gift, '100m' as event
  UNION ALL SELECT 'Google Hub', '200m'
  UNION ALL SELECT 'Pixel3', '400m'
  UNION ALL SELECT 'Google Mini', '5000m'
)

In an INNER JOIN (on the event column), the winner of the 50-meter dash doesn’t receive a gift, and the gift for the 5,000-meter event goes unclaimed. In a CROSS JOIN, as we noted, every winner receives every gift. OUTER JOINs control what happens if the join condition is not met. Table 2-2 summarizes the various types of joins and the resulting output.

Table 2-2. Summary of types of joins and their outputs
Syntax	What happens	Output
`SELECT person, gift FROM winners INNER JOIN gifts ON winners.event = gifts.event`	Only rows that meet the join condition are retained
`SELECT person, gift FROM winners FULL OUTER JOIN gifts ON winners.event = gifts.event`	All rows are retained even if the join condition is not met
`SELECT person, gift FROM winners LEFT OUTER JOIN gifts ON winners.event = gifts.event`	All the winners are retained, but some gifts are discarded
`SELECT person, gift FROM winners RIGHT OUTER JOIN gifts ON winners.event = gifts.event`	All the gifts are retained, but some winners aren’t

Saving and Sharing

The BigQuery web UI offers the ability to save and share queries. This is handy for collaboration because you can send colleagues a link to the query text that enables them to execute the query immediately. Be aware, though, that if someone has your query, they might not be able to execute it if they don’t have access to your data. We discuss how to share and limit access to your datasets in Chapter 10.

Query History and Caching

We should note that BigQuery retains, for audit and caching purposes, a history of the queries that you submitted to the service (regardless of whether the queries succeeded), as illustrated in Figure 2-1.

The history of queries submitted to the BigQuery service is available via the “Query history” tab in the web UI.

This history includes all queries submitted by you to the service, not just those submitted via the web UI. Clicking any of the queries provides the text of the query and the ability to open the query in the editor so that you can modify and rerun it. In addition, the historical information includes the amount of data processed by the query and the execution time. As of this writing, the history is limited to 1,000 queries and six months.

The actual results of the query are stored in a temporary table that expires after about 24 hours. If you are within that expiry window, you will also be able to browse the results of the query from the web UI. Your personal history is available only to you. Administrators of the project to which your query was billed will also see your query text in the project’s history.

This temporary table is also used as a cache if the exact same query text is submitted to the service and the query does not involve dynamic elements such as CURRENT_TIMESTAMP() or RAND(). Cached query results incur no charges, but note that the algorithm to determine whether a query is a duplicate simply does a string match—even an extra whitespace can result in the query being reexecuted.

Saved Queries

You can save any query by loading it into the query editor, clicking the “Save query” button, and then giving the query a name, as shown in Figure 2-2. BigQuery then provides a URL to the query text.

Save a query by clicking the “Save query” button in the web UI.

You can also choose to make the saved query shareable, in which case anyone who has the URL will be directed to a page with the query text prepopulated.

When you share a query, all that you share is the text of the query; you do not share access to any data. Dataset permissions to execute the query must be provided independently using the IAM controls (we discuss these in Chapter 10). Also, unlike most BigQuery features, the ability to save and share queries is available only from the web UI. As of this writing, there is no REST API or client library available for this.

The list of saved queries is available from the UI. You can turn off link sharing at any time to make the query text private again, as illustrated in Figure 2-3.

Views Versus Shared Queries

One of the advantages of sharing a query link (as opposed to simply copying the text of the query into an email) is that you can continue to edit the query so that your collaborators always get the latest version of the query. This is useful when the envisioned use case is that they might want to examine the query text, modify it, and then run the query.

The query text does not need to be syntactically correct; you can save and share incomplete query text or template queries that need to be completed by the end user. These capabilities are helpful when you’re collaborating with colleagues.

If you expect the person to whom you are sending the query to subset or query the results of your query, it is better to save your query as a view and send your colleague a link to the view. Another advantage of views over shared queries is that views are placed into datasets and offer fine-grained IAM controls. Views can also be materialized.

We look at authorized views and at dynamically filtering them based on the user in Chapter 10.

Summary

In this chapter, you saw how BigQuery supports SQL:2011: selecting records (SELECT), aliasing column names (AS), filtering (WHERE), using subqueries (parentheses and WITH), sorting (ORDER), aggregating (GROUP, AVG, COUNT, MIN, MAX, etc.), filtering grouped items (HAVING), filtering unique values (DISTINCT), and joining (INNER/CROSS/OUTER JOIN). There is also support for arrays (ARRAY_AGG, UNNEST) and structs (STRUCT). You also looked at how to review the history of queries (the text of the query, not the results) submitted to the service. This history is available, you learned, to the user who submitted the query, and to project administrators. And you learned that it is possible to share query text through a link.

¹ This is a data format that is very popular within Google because it provides efficient storage in a programming-language-neutral way. It is now open source; see https://developers.google.com/protocol-buffers/.

² For more details on Dremel, see https://ai.google/research/pubs/pub36632.

³ The “person” in this case is one of the members of the Google Cloud Platform public datasets team. See Google Cloud Public Datasets for what else is available.

⁴ We believe all mentions of price to be correct as of the writing of this book, but please do refer to the relevant policy and pricing sheets because these are subject to change.

⁵ For an entertaining data-driven examination of the correlation between project success and the use of leading commas, see https://oreil.ly/mFZKh.

⁶ You can also use ORDINAL(1) to work with 1-based indexing. We look at arrays in more detail in Chapter 8.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. Query Essentials

Chapter 2. Query Essentials

Simple Queries

Retrieving Rows by Using SELECT

Aliasing Column Names with AS

Filtering with WHERE

SELECT *, EXCEPT, REPLACE

Subqueries with WITH

Sorting with ORDER BY

Aggregates

Computing Aggregates by Using GROUP BY

Counting Records by Using COUNT

Filtering Grouped Items by Using HAVING

Finding Unique Values by Using DISTINCT

A Brief Primer on Arrays and Structs

Creating Arrays by Using ARRAY_AGG

Tip

Note

Array of STRUCT

TUPLE

Working with Arrays

UNNEST an Array

Joining Tables

The JOIN Explained

INNER JOIN

CROSS JOIN

OUTER JOIN

Saving and Sharing

Query History and Caching

Figure 2-1. The history of queries submitted to the BigQuery service is available via the “Query history” tab in the web UI

Saved Queries

Figure 2-2. Save a query by clicking the “Save query” button in the web UI

Figure 2-3. You can turn off link sharing at any time to make the query text private again

Views Versus Shared Queries

Summary

Table of Contents for 2. Query Essentials

Create new playlist

Sign In

Sign Up

Chapter 2. Query Essentials

Simple Queries

Retrieving Rows by Using SELECT

Aliasing Column Names with AS

Filtering with WHERE

SELECT *, EXCEPT, REPLACE

Subqueries with WITH

Sorting with ORDER BY

Aggregates

Computing Aggregates by Using GROUP BY

Counting Records by Using COUNT

Filtering Grouped Items by Using HAVING

Finding Unique Values by Using DISTINCT

A Brief Primer on Arrays and Structs

Creating Arrays by Using ARRAY_AGG

Tip

Note

Array of STRUCT

TUPLE

Working with Arrays

UNNEST an Array

Joining Tables

The JOIN Explained

INNER JOIN

CROSS JOIN

OUTER JOIN

Saving and Sharing

Query History and Caching

Figure 2-1. The history of queries submitted to the BigQuery service is available via the “Query history” tab in the web UI

Saved Queries

Figure 2-2. Save a query by clicking the “Save query” button in the web UI

Figure 2-3. You can turn off link sharing at any time to make the query text private again

Views Versus Shared Queries

Summary

Table of Contents for
2. Query Essentials