BigQuery is first and foremost a data warehouse, by which we mean that it provides persistent storage for structured and semi-structured data (like JSON objects). The four basic CRUD operations are supported on this persistent storage:
SQL INSERT
statement, and through a streaming insert API. You can also use SQL to create database objects like tables, views, and machine learning models as part of BigQuery’s support of the Data Definition Language (DDL). We go into examples of each later.SQL SELECT
statement as well as the bulk read API.SQL UPDATE
and MERGE
statements, which are part of BigQuery’s support of the Data Manipulation Language (DML). Note that, as we discussed in Chapter 1, BigQuery is an analytics tool and is not meant to be used for frequent updates.SQL DELETE
, which is also a DML operation.BigQuery is a tool for data analysis, and the majority of queries you can expect to write will be the aforementioned Read operations. Reading and analyzing your data is accomplished by the SELECT
statement, which is the focus of this chapter. We cover creating, updating, and deleting data in later chapters.
BigQuery supports a dialect of SQL that is compliant with SQL:2011. When the specification is ambiguous or otherwise lacking, BigQuery follows the conventions set by existing SQL engines. There are other areas in which there is no specification at all, such as with machine learning; in these cases, BigQuery defines its own syntax and semantics.
The SELECT
statement allows you to retrieve the values of specified columns from a table. For example, consider the New York bicycle rentals dataset—it contains several columns relating to bicycle rentals, including the trip duration and the gender of the person renting the bicycle. We can pull out the values of these columns by using the SELECT
statement (lines beginning with double dashes or #
are comments):
-- simple select SELECT gender, tripduration FROM `bigquery-public-data`.new_york_citibike.citibike_trips LIMIT 5
The result looks something like this:
Row | gender | tripduration |
---|---|---|
1 | male | 371 |
2 | male | 1330 |
3 | male | 830 |
4 | male | 555 |
5 | male | 328 |
The result set has two columns (gender
and tripduration
) in the order specified in the SELECT
. There are five rows in the result set because we limited it to five in the final line of the query. BigQuery distributes the task of fetching rows to multiple workers, each of which can read a different shard (or part) of the dataset, so if you run the previous query, you might get a different set of five rows.
Note that using a LIMIT
constrains only the amount of data displayed to you and not the amount of data the query engine needs to process. You are typically charged based on the amount of data processed by your queries, and this usually implies that the more columns your query reads, the higher your cost will be. The number of rows processed will usually be the total size of the table that you are reading, although there are ways to optimize this (which we cover in Chapter 7). We examine performance and pricing considerations in later chapters.
The values are being retrieved from the following:
bigquery-public-data.new_york_citibike.citibike_trips
Here, bigquery-public-data
is the project ID, new_york_citibike
is the dataset, and citibike_trips
is the table. The project ID indicates ownership of the persistent storage associated with the dataset and its tables. The owner of bigquery-public-data
is paying the storage costs associated with the new_york
dataset. The cost of the query is paid by the project within which the query is issued. If you run the preceding query, you pay the query costs. Datasets provide for Identity and Access Management (IAM). The person who created the new_york_citibike
dataset3 in BigQuery made it public, which is why we were able to list the tables in the dataset and query one of those tables. The citibike_trips
table contains all of the bicycle trips. The project, dataset, and table are separated by dots. The backtick is needed as an escape character in this case because the hyphen (-) in the project name (bigquery-public-data
) would otherwise be interpreted as subtraction. Most developers simply enclose the entire string within backticks, as shown here:
-- simple select SELECT gender, tripduration FROM `bigquery-public-data.new_york_citibike.citibike_trips` LIMIT 5
Although this is simpler, you lose the ability to use the table name (citibike_trips
) as an alias. So it is worth developing the habit of putting the backticks only around the project name and avoiding the use of hyphens when you create your own datasets and tables.
For a long time, our recommendation was that tables in BigQuery be stored in denormalized form (i.e., a single table often containing all of the data you’d need without the need for joining multiple tables). However, with improvements in the service, this is no longer necessary. It is possible now to achieve good performance even with a star schema.
Table 2-1 reviews the three key components of the name `bigquery-public-data`.new_york_citibike.citibike_trips
.
BigQuery object | Name | Description |
---|---|---|
Project | bigquery-public-data |
Owner of the persistent storage associated with the dataset and its tables. The project also governs the use of all other GCP products as well. |
Dataset | new_york_citibike |
Datasets are top-level containers that are used to organize and control access to tables and views. A user can own multiple datasets. |
Table/View | citibike_trips |
A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.a |
a See https://cloud.google.com/bigquery/docs/datasets-intro. |
Distinguishing between each of these three will be critical later when we discuss geographic location, data access, and sharing of data.
By default, the names of the columns in the result set match those of the table from which the data is retrieved. It is possible to alias the column names by using AS
:
-- Aliasing column names SELECT gender, tripduration AS rental_duration FROM `bigquery-public-data`.new_york_citibike.citibike_trips LIMIT 5
This now yields the following (your results might be a different set of five):
Row | gender | rental_duration |
---|---|---|
1 | male | 432 |
2 | female | 1186 |
3 | male | 799 |
4 | female | 238 |
5 | male | 668 |
Aliasing is useful when you are transforming data. For example, without the alias, a statement such as
SELECT gender, tripduration/60 FROM `bigquery-public-data`.new_york_citibike.citibike_trips LIMIT 5
would result in an automatically assigned column name for the second column in the result set:
Row | gender | f0_ |
---|---|---|
1 | male | 6.183333333333334 |
2 | male | 22.166666666666668 |
3 | male | 13.833333333333334 |
4 | male | 9.25 |
5 | male | 5.466666666666667 |
You can assign the second column a more descriptive name by adding an alias to your query:
SELECT gender, tripduration/60 AS duration_minutes FROM `bigquery-public-data`.new_york_citibike.citibike_trips LIMIT 5
This yields a result similar to the following:
Row | gender | duration_minutes |
---|---|---|
1 | male | 6.183333333333334 |
2 | male | 22.166666666666668 |
3 | male | 13.833333333333334 |
4 | male | 9.25 |
5 | male | 5.466666666666667 |
To find rentals of less than 10 minutes, we could filter the results returned by SELECT
by using a WHERE
clause:
SELECT gender, tripduration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE tripduration < 600 LIMIT 5
As expected, the result set now consists only of rows for which the trip duration is less than 600 seconds:
Row | gender | tripduration |
---|---|---|
1 | male | 178 |
2 | male | 518 |
3 | male | 376 |
4 | male | 326 |
5 | male | 516 |
The WHERE
clause can include Boolean expressions. For example, to find trips rented by females and lasting between 5 and 10 minutes, you could use this:
SELECT gender, tripduration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE tripduration >= 300 AND tripduration < 600 AND gender = 'female' LIMIT 5
The OR
keyword also works, as does NOT
. For example, to find nonfemale riders (i.e., male riders and those whose gender is unknown), the WHERE
clause could be as follows:
WHERE tripduration < 600 AND NOT gender = 'female'
It is also possible to use parentheses to control the order of evaluation. To find female riders who take short trips as well as all male riders, you could use this:
WHERE (tripduration < 600 AND gender = 'female') OR gender = 'male'
The WHERE
clause operates on the columns in the FROM
clause; thus, it is not possible to reference aliases from the SELECT
list in the WHERE
clause. In other words, to retain only trips shorter than 10 minutes, it is not possible to use the following:
SELECT
gender, tripduration/60 AS minutes
FROM
`bigquery-public-data`.new_york_citibike.citibike_trips
WHERE minutes < 10 -- CAN NOT REFERENCE ALIAS IN WHERE
LIMIT 5
Instead, you need to repeat the transformation in the WHERE
clause (we explore better alternatives later):
SELECT gender, tripduration / 60 AS minutes FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE (tripduration / 60) < 10 LIMIT 5
For cost and performance reasons (which we cover in detail in Chapter 7), it is better to select only the columns that you want. If, however, you do want to select all of the columns in the table, you can use SELECT *
:
SELECT * FROM `bigquery-public-data`.new_york_citibike.citibike_stations WHERE name LIKE '%Riverside%'
The WHERE
clause uses the LIKE
operator to look for stations that have Riverside
anywhere in their name.
To select all except for a few columns, use SELECT EXCEPT
:
SELECT * EXCEPT(short_name, last_reported) FROM `bigquery-public-data`.new_york_citibike.citibike_stations WHERE name LIKE '%Riverside%'
This query returns the same result as the previous one except that two of the columns (short_name
and last_reported
) are omitted.
To select all of the columns but replace a column with another, you can use SELECT
REPLACE
. For example, you can add 5
to the number of bikes reported to be available using the following:
SELECT * REPLACE(num_bikes_available + 5 AS num_bikes_available) FROM `bigquery-public-data`.new_york_citibike.citibike_stations
You can reduce the repetitiveness and retain the use of the alias by using a subquery:
SELECT * FROM ( SELECT gender, tripduration / 60 AS minutes FROM `bigquery-public-data`.new_york_citibike.citibike_trips ) WHERE minutes < 10 LIMIT 5
The outer SELECT
operates on the inner subquery that is enclosed within parentheses. Because the alias happens in the inner query, the outer query can use the alias in its WHERE
clause.
Queries with parentheses can become quite difficult to read. A better approach is to use a WITH
clause to provide names to what would otherwise have been subqueries:
WITH all_trips AS ( SELECT gender, tripduration / 60 AS minutes FROM `bigquery-public-data`.new_york_citibike.citibike_trips ) SELECT * from all_trips WHERE minutes < 10 LIMIT 5
In BigQuery, the WITH
clause behaves like a named subquery and does not create temporary tables. We will refer to all_trips
as a “from_item”—it’s not a table, but you can select from it.
To control the order of rows in the result set, use ORDER BY
:
SELECT gender, tripduration/60 AS minutes FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender = 'female' ORDER BY minutes DESC LIMIT 5
By default, rows in results are not ordered. If an order column is specified, the default is ascending order. By asking for the rows to be listed in descending order and limiting to 5
, we get the five longest trips by women in the dataset:
Row | gender | minutes |
---|---|---|
1 | female | 250348.9 |
2 | female | 226437.93333333332 |
3 | female | 207988.71666666667 |
4 | female | 159712.05 |
5 | female | 154239.0 |
Note that we are ordering by minutes, which is an alias—because the ORDER BY
is carried out after the SELECT
, it is possible to use aliases in ORDER BY
.
In the example in the previous section, when we converted seconds to minutes by dividing by 60, we operated on every row in the table and transformed it. It is also possible to apply a function to aggregate all of the rows so that the result set contains only one row.
To find the average duration of trips by male riders, you could do the following:
SELECT AVG(tripduration / 60) AS avg_trip_duration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender = 'male'
This yields the following:
Row | avg_trip_duration |
---|---|
1 | 13.415553172043886 |
This indicates that the average bicycle trip taken by male riders in New York is about 13.4 minutes. Because the dataset is continuously updated, though, your result might be different.
How about female riders? Although you could run the previous query twice, once for male riders and the next for females, it seems wasteful to traverse through the dataset a second time, changing the WHERE
clause. Instead, you can use a GROUP BY
:
SELECT gender, AVG(tripduration / 60) AS avg_trip_duration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE tripduration is not NULL GROUP BY gender ORDER BY avg_trip_duration
This yields the following result:
Row | gender | avg_trip_duration |
---|---|---|
1 | male | 13.415553172043886 |
2 | female | 15.977472148805207 |
3 | unknown | 31.4395230232542 |
The aggregates have now been computed on each group separately. The SELECT
expression can include the thing being grouped by (gender) and aggregates (AVG
). Note that there are actually three genders in the dataset: male, female, and unknown.
To see how many rides went into the previous averages, you can simply add a COUNT()
:
SELECT gender, COUNT(*) AS rides, AVG(tripduration / 60) AS avg_trip_duration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE tripduration IS NOT NULL GROUP BY gender ORDER BY avg_trip_duration
This gives us the result shown here:
Row | gender | rides | avg_trip_duration |
---|---|---|---|
1 | male | 35611787 | 13.415553172043888 |
2 | female | 11376412 | 15.97747214880521 |
3 | unknown | 6120522 | 31.439523023254207 |
It is possible to post-filter the grouped operations via the HAVING
clause. To learn which genders take trips that, on average, last longer than 14 minutes, you can use this:
SELECT gender,AVG(
tripduration / 60) AS avg_trip_duration
FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE tripduration IS NOT NULL GROUP BY gender HAVING avg_trip_duration > 14 ORDER BY avg_trip_duration
This yields the following:
Row | gender | avg_trip_duration |
---|---|---|
1 | female | 15.977472148805209 |
2 | unknown | 31.439523023254203 |
Note that, even though it is possible to filter the gender or trip duration with a WHERE
clause, it is not possible to use it to filter by average duration, because the average duration is computed only after the items have been grouped (try it!).
What values of gender are present in the dataset? Although you could use GROUP BY
, a simpler way to get a list of distinct values of a column is to use SELECT DISTINCT
:
SELECT DISTINCT gender FROM `bigquery-public-data`.new_york_citibike.citibike_trips
This yields a result set with just four rows:
Row | gender |
---|---|
1 | male |
2 | female |
3 | unknown |
4 |
Four rows? What is the fourth row? Let’s explore:
SELECT bikeid, tripduration, gender FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender = "" LIMIT 100
This yields the result shown here:
Row | bikeid | tripduration | gender |
---|---|---|---|
1 | null | null | |
2 | null | null | |
3 | null | null | |
... |
In this particular case, a blank gender value seems to indicate missing or poor-quality data. We discuss missing data (NULL
values) and how you can account for and transform them in Chapter 3, but briefly: if you want to filter for NULL
s in a WHERE
clause, use the IS NULL
or IS NOT NULL
operators because other comparison operators (=
, !=
, <
, >
) applied to a NULL
return NULL
and therefore will never match the WHERE
condition.
Going back to our original query for DISTINCT
genders, it’s important to note that the DISTINCT
modifies the entire SELECT
, not just the gender column. To see what we mean, add a second column to the query’s SELECT
list:
SELECT DISTINCT gender, usertype FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender != ''
This results in six rows; that is, you get a row for every combination of unique gender and user type (subscriber or customer) that exists in the dataset:
Row | gender | usertype |
---|---|---|
1 | male | Subscriber |
2 | unknown | Customer |
3 | female | Subscriber |
4 | female | Customer |
5 | male | Customer |
6 | unknown | Subscriber |
In this section, we provide a brief primer on arrays so that we can illustrate many of the data types and functions in the next chapter on small, illustrative datasets. The combination of ARRAY
(the square brackets in the query that follows) and UNNEST
gives us a quick way to experiment with queries, functions, and data types.
For example, if you want to know how the SPLIT
function of a string behaves, simply try it out:
SELECT city, SPLIT(city, ' ') AS parts FROM ( SELECT * from UNNEST([ 'Seattle WA', 'New York', 'Singapore' ]) AS city )
Here’s the result of this quick query:
Row | city | parts |
---|---|---|
1 | Seattle WA | Seattle |
WA | ||
2 | New York | New |
York | ||
3 | Singapore | Singapore |
This ability to hardcode an array of values in the SQL query itself allows you to play with arrays and data types without the need to find an appropriate dataset or wait for long queries to finish. Even better, this processes 0 bytes and therefore does not incur BigQuery charges.4
Another way to quickly experiment with a set of values employs UNION ALL
to combine single row SELECT
statements:
WITH example AS ( SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways UNION ALL SELECT 'Sun', 2376, 936 UNION ALL SELECT 'Mon', 1476, 736 ) SELECT * from example WHERE numrides < 2000
This yields the two rows in the small inline dataset that have fewer than 2,000 rides:
Row | day | numrides | oneways |
---|---|---|---|
1 | Sat | 1451 | 1018 |
2 | Mon | 1476 | 736 |
In the next chapter, we use such inline datasets with hardcoded numbers to illustrate various aspects of the way different data types and functions behave.
The purpose of this section is to quickly introduce arrays and structs so that we can use them in illustrative examples. We review these concepts in greater detail in Chapter 8, so feel free to quickly skim the remainder of this section for now.
Consider finding the number of trips by gender and year:
SELECT gender , EXTRACT(YEAR from starttime) AS year -- , COUNT(*) AS numtrips FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender != 'unknown' and starttime IS NOT NULL GROUP BY gender, year HAVING year > 2016
This returns the following:
Row | gender | year | numtrips |
---|---|---|---|
1 | male | 2017 | 9306602 |
2 | male | 2018 | 3955871 |
3 | female | 2018 | 1260893 |
4 | female | 2017 | 3236735 |
What’s with the leading commas in the SELECT
clause? Standard SQL (at least as of this writing) does not support a trailing comma, and so moving the comma to the next line allows you to easily reorder or comment lines and still have a working query:
SELECT gender , EXTRACT(YEAR from starttime) AS year -- comment out this line , COUNT(1) AS numtrips FROM etc.
Trust us, the leading comma will become second nature after a while and will greatly speed up your development.5
What would be required, though, if we want to get a time-series of the number of trips associated with each gender over the years—in other words, the following result?
Row | gender | numtrips |
---|---|---|
1 | male | 9306602 |
3955871 | ||
2 | female | 3236735 |
1260893 |
To get this, you would need to create an array of the numbers of trips. You can represent that array in SQL using the ARRAY
type and create such an array by using ARRAY_AGG
:
SELECT gender , ARRAY_AGG(numtrips order by year) AS numtrips FROM ( SELECT gender , EXTRACT(YEAR from starttime) AS year , COUNT(1) AS numtrips FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE gender != 'unknown' and starttime IS NOT NULL GROUP BY gender, year HAVING year > 2016 ) GROUP BY gender
Normally, when you group by gender, you compute a single scalar value for the group, such as the AVG(numtrips)
to find the average number of trips across all years. ARRAY_AGG
allows you to collect the individual values and put them into an ordered list, or ARRAY
.
The ARRAY
type is not limited to the results of queries. Because BigQuery can ingest hierarchical formats such as JSON, it is possible that the input data contains JSON arrays—for example:
[ { "gender": "male", "numtrips": [ "9306602", "3955871" ] }, { "gender": "female", "numtrips": [ "3236735", "1260893" ] } ]
Creating a table by ingesting such a JSON file will result in a table whose numtrips
column is an ARRAY
type. An array is an ordered list of non-NULL
elements; for instance, ARRAY<INT64>
is an array of integers.
Technically, NULL elements in arrays are permissible as long as you don’t try to save them to a table. Thus, for example, the following will not work, because you are trying to save the array [1, NULL, 2]
to the temporary table that holds the results:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT ARRAY_AGG(IF(position = 2, NULL, position)) as positions from example
However, the following will work because the intermediate array with a NULL element is not being saved:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT ARRAY_LENGTH(ARRAY_AGG(IF(position = 2, NULL, position))) from example
A STRUCT
is a group of fields in order. The fields can be named (if omitted, BigQuery will assign them names), which we recommend for readability:
SELECT [ STRUCT('male' as gender, [9306602, 3955871] as numtrips) , STRUCT('female' as gender, [3236735, 1260893] as numtrips) ] AS bikerides
This results in the following:
Row | bikerides.gender | bikerides.numtrips |
---|---|---|
1 | male | 9306602 |
3955871 | ||
female | 3236735 | |
1260893 |
We could have left out the STRUCT
keyword and the names of the fields, in which case we would have ended up with a tuple or anonymous struct. BigQuery assigns arbitrary names for unnamed columns and struct fields in the result of a query; thus
SELECT [ ('male', [9306602, 3955871]) , ('female', [3236735, 1260893]) ]
yields this result:
Row | f0_._field_1 | f0_._field_2 |
---|---|---|
1 | male | 9306602 |
3955871 | ||
female | 3236735 | |
1260893 |
Obviously, leaving out aliases for the field names makes subsequent queries unreadable and unmaintainable. Do not do this except for throwaway experimentation.
Given an array, we can find the length of the array and retrieve individual items:
SELECT ARRAY_LENGTH(bikerides) as num_items , bikerides[ OFFSET(0) ].gender as first_gender FROM (SELECT [ STRUCT('male' as gender, [9306602, 3955871] as numtrips) , STRUCT('female' as gender, [3236735, 1260893] as numtrips) ] AS bikerides)
This yields the following:
Row | num_items | first_gender |
---|---|---|
1 | 2 | male |
Offsets are numbered starting at zero, which is why OFFSET(0)
gives us the first item in the array.6
In the query
SELECT [ STRUCT('male' as gender, [9306602, 3955871] as numtrips) , STRUCT('female' as gender, [3236735, 1260893] as numtrips) ]
the SELECT
returns exactly one row containing an array, and so both genders are part of the same row (look at the Row column):
Row | f0_.gender | f0_.numtrips |
---|---|---|
1 | male | 9306602 |
3955871 | ||
female | 3236735 | |
1260893 |
UNNEST
is a function that returns the elements of an array as rows, so you can UNNEST
the result array to get a row corresponding to each item in the array:
SELECT * from UNNEST( [ STRUCT('male' as gender, [9306602, 3955871] as numtrips) , STRUCT('female' as gender, [3236735, 1260893] as numtrips) ])
This yields the following:
Row | gender | numtrips |
---|---|---|
1 | male | 9306602 |
3955871 | ||
2 | female | 3236735 |
1260893 |
Notice that UNNEST
is actually a from_item—you can SELECT
from it. You can select just parts of the array as well. For example, we can get only the numtrips
column by using this:
SELECT numtrips from UNNEST( [ STRUCT('male' as gender, [9306602, 3955871] as numtrips) , STRUCT('female' as gender, [3236735, 1260893] as numtrips) ])
This gives us the following results:
Row | numtrips |
---|---|
1 | 9306602 |
3955871 | |
2 | 3236735 |
1260893 |
Data warehouse schemas often rely on a primary “fact” table that contains events, and satellite “dimension” tables that contain extended, slowly changing information. For example, a retail schema might have a “Sales” table as the fact table and then “Products” and “Customers” tables as dimensions. When using this type of schema, the majority of queries will require a JOIN
operation, such as to return the names of all the products purchased by a particular customer.
BigQuery supports all of the common join types from relational algebra: inner joins, outer joins, cross joins, anti-joins, semi-joins, and anti-semi-joins. Although it can sometimes be faster to avoid a JOIN
, BigQuery can efficiently join tables of almost any size. Chapter 7 discusses more about how to optimize JOIN
performance, but for now, we describe only the basic JOIN
operation.
In Chapter 1, we looked at an example of a JOIN
across tables in two different datasets produced by two different organizations. Let’s revisit that for a refresher:
WITH bicycle_rentals AS ( SELECT COUNT(starttime) as num_trips, EXTRACT(DATE from starttime) as trip_date FROM `bigquery-public-data`.new_york_citibike.citibike_trips GROUP BY trip_date ), rainy_days AS ( SELECT date, (MAX(prcp) > 5) AS rainy FROM ( SELECT wx.date AS date, IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp FROM `bigquery-public-data`.ghcn_d.ghcnd_2016 AS wx WHERE wx.id = 'USW00094728' ) GROUP BY date ) SELECT ROUND(AVG(bk.num_trips)) AS num_trips, wx.rainy FROM bicycle_rentals AS bk JOIN rainy_days AS wx ON wx.date = bk.trip_date GROUP BY wx.rainy
In Chapter 1, we asked you to ignore the syntax, but let’s parse it now.
The first WITH
pulls out the number of trips by day from the citibike_trips
table into a from_item called bicycle_rentals
. This is not a table, but it is something from which we can select. Hence, we will refer to it as a “from_item.” The second from_item is called rainy_days
and is created from the Global Historical Climate Network (GHCN) observation in each day. This from_item marks each day as being rainy or not depending on whether at least five mm of precipitation was observed at weather station 'USW00094728'
, which happens to be in New York.
So now we have two from_items. Let’s visualize them separately:
WITH bicycle_rentals AS ( SELECT COUNT(starttime) as num_trips, EXTRACT(DATE from starttime) as trip_date FROM `bigquery-public-data`.new_york_citibike.citibike_trips GROUP BY trip_date ) SELECT * from bicycle_rentals LIMIT 5
The bicycle_rentals
from_item looks like this:
Row | num_trips | trip_date |
---|---|---|
1 | 31287 | 2013-09-16 |
2 | 22477 | 2015-12-30 |
3 | 37812 | 2017-09-02 |
4 | 54230 | 2017-11-15 |
5 | 25719 | 2013-11-07 |
Similarly, the rainy_days
from_item looks like this:
Row | date | rainy |
---|---|---|
1 | 2016-10-11 | false |
2 | 2016-12-13 | false |
3 | 2016-09-28 | false |
4 | 2016-01-25 | false |
5 | 2016-05-24 | false |
We can now join these from_items using the join condition that the trip_date
in one is the same as the date in the second:
SELECT bk.trip_date, bk.num_trips, wx.rainy FROM bicycle_rentals AS bk JOIN rainy_days AS wx ON wx.date = bk.trip_date LIMIT 5
This creates a table in which columns from the two tables are joined by date:
Row | trip_date | num_trips | rainy |
---|---|---|---|
1 | 2016-07-13 | 55486 | false |
2 | 2016-04-25 | 42308 | false |
3 | 2016-09-27 | 61346 | true |
4 | 2016-07-15 | 48572 | false |
5 | 2016-05-20 | 52543 | false |
Given this, finding the average number of trips on rainy and nonrainy dates is straightforward.
What we have illustrated is called an inner join, and it is the type of JOIN
used if no join type is specified.
Create two from_items. These can be anything: any two of a table, a subquery, an array, or a WITH statement from which you can select.
Identify a join condition. The join condition does not need to be an equality condition; any Boolean condition that uses the two from_items will do.
Select the columns that you want. If identically named columns exist in both from_items, use aliases (bk
, wx
in the previous example query) to clearly specify from which from_item the column needs to come.
If not using an inner join, specify a join type.
The only requirement for carrying out such a join is that all the datasets used to create the from_items are in the same BigQuery region (all BigQuery public datasets are in the US region).
There are several types of joins. The INNER JOIN
(or simply JOIN
), to which the previous example defaulted, creates a common set of rows to select from:
WITH from_item_a AS ( SELECT 'Dalles' as city, 'OR' as state UNION ALL SELECT 'Tokyo', 'Tokyo' UNION ALL SELECT 'Mumbai', 'Maharashtra' ), from_item_b AS ( SELECT 'OR' as state, 'USA' as country UNION ALL SELECT 'Tokyo', 'Japan' UNION ALL SELECT 'Maharashtra', 'India' ) SELECT from_item_a.*, country FROM from_item_a JOIN from_item_b ON from_item_a.state = from_item_b.state
The first from_item has a list of cities, and the second from_item identifies the country each of the states belongs to. Joining the two yields a dataset with three columns:
Row | city | state | country |
---|---|---|---|
1 | Dalles | OR | USA |
2 | Tokyo | Tokyo | Japan |
3 | Mumbai | Maharashtra | India |
Again, the join condition does not need to be an equality check. Any Boolean condition will do, although it’s best to use an equality condition if possible because BigQuery will return an error if the JOIN
cannot be executed efficiently.
For example, we might have a business rule that shipping from one country to another involves a surcharge. To get a list of countries for which there will be a surcharge from a given location, we could have specified this:
SELECT from_item_a.*, country AS surcharge FROM from_item_a JOIN from_item_b ON from_item_a.state != from_item_b.state
We would obtain the following:
Row | city | state | surcharge |
---|---|---|---|
1 | Dalles | OR | Japan |
2 | Dalles | OR | India |
3 | Tokyo | Tokyo | USA |
4 | Tokyo | Tokyo | India |
5 | Mumbai | Maharashtra | USA |
6 | Mumbai | Maharashtra | Japan |
Notice that we get a row for each time that the join condition is met. Because there are two rows for which the state doesn’t match, we get two rows for each row in the original from_item_a. If the join condition is not met for some row, that row’s data items will not make it to the output.
The CROSS JOIN
, or cartesian product, is a join with no join condition. All rows from both from_items are joined. This is the join that we would get if the join condition of an INNER JOIN
always evaluated to true.
For example, suppose that you organized a tournament and have a table of the winners of each event in the tournament, and another table containing the gifts for each event. You can give each winner the gift corresponding to their event only by doing an INNER JOIN
:
WITH winners AS ( SELECT 'John' as person, '100m' as event UNION ALL SELECT 'Hiroshi', '200m' UNION ALL SELECT 'Sita', '400m' ), gifts AS ( SELECT 'Google Home' as gift, '100m' as event UNION ALL SELECT 'Google Hub', '200m' UNION ALL SELECT 'Pixel3', '400m' ) SELECT winners.*, gifts.gift FROM winners JOIN gifts
This would provide the following result:
Row | person | event | gift |
---|---|---|---|
1 | John | 100m | Google Home |
2 | Hiroshi | 200m | Google Hub |
3 | Sita | 400m | Pixel3 |
On the other hand, if you want to give each gift to each winner (i.e., each winner gets all three gifts), you could do a CROSS JOIN
:
WITH winners AS ( SELECT 'John' as person, '100m' as event UNION ALL SELECT 'Hiroshi', '200m' UNION ALL SELECT 'Sita', '400m' ), gifts AS ( SELECT 'Google Home' as gift UNION ALL SELECT 'Google Hub' UNION ALL SELECT 'Pixel3' ) SELECT person, gift FROM winners CROSS JOIN gifts
This yields a row for each potential combination:
Row | person | gift |
---|---|---|
1 | John | Google Home |
2 | John | Google Hub |
3 | John | Pixel3 |
4 | Hiroshi | Google Home |
5 | Hiroshi | Google Hub |
6 | Hiroshi | Pixel3 |
7 | Sita | Google Home |
8 | Sita | Google Hub |
9 | Sita | Pixel3 |
Even though we wrote
SELECT from_item_a.*, from_item_b.* FROM from_item_a CROSS JOIN from_item_b
we could also have written this:
SELECT from_item_a.*, from_item_b.* FROM from_item_a, from_item_b
Suppose that we have winners in events for which there is no gift, and gifts for events that didn’t take place in our tournament:
WITH winners AS ( SELECT 'John' as person, '100m' as event UNION ALL SELECT 'Hiroshi', '200m' UNION ALL SELECT 'Sita', '400m' UNION ALL SELECT 'Kwame', '50m' ), gifts AS ( SELECT 'Google Home' as gift, '100m' as event UNION ALL SELECT 'Google Hub', '200m' UNION ALL SELECT 'Pixel3', '400m' UNION ALL SELECT 'Google Mini', '5000m' )
In an INNER JOIN
(on the event column), the winner of the 50-meter dash doesn’t receive a gift, and the gift for the 5,000-meter event goes unclaimed. In a CROSS JOIN
, as we noted, every winner receives every gift. OUTER JOIN
s control what happens if the join condition is not met. Table 2-2 summarizes the various types of joins and the resulting output.
Syntax | What happens | Output |
---|---|---|
SELECT person, gift FROM winners INNER JOIN gifts ON winners.event = gifts.event |
Only rows that meet the join condition are retained | |
SELECT person, gift FROM winners FULL OUTER JOIN gifts ON winners.event = gifts.event |
All rows are retained even if the join condition is not met | |
SELECT person, gift FROM winners LEFT OUTER JOIN gifts ON winners.event = gifts.event |
All the winners are retained, but some gifts are discarded | |
SELECT person, gift FROM winners RIGHT OUTER JOIN gifts ON winners.event = gifts.event |
All the gifts are retained, but some winners aren’t |
The BigQuery web UI offers the ability to save and share queries. This is handy for collaboration because you can send colleagues a link to the query text that enables them to execute the query immediately. Be aware, though, that if someone has your query, they might not be able to execute it if they don’t have access to your data. We discuss how to share and limit access to your datasets in Chapter 10.
We should note that BigQuery retains, for audit and caching purposes, a history of the queries that you submitted to the service (regardless of whether the queries succeeded), as illustrated in Figure 2-1.
This history includes all queries submitted by you to the service, not just those submitted via the web UI. Clicking any of the queries provides the text of the query and the ability to open the query in the editor so that you can modify and rerun it. In addition, the historical information includes the amount of data processed by the query and the execution time. As of this writing, the history is limited to 1,000 queries and six months.
The actual results of the query are stored in a temporary table that expires after about 24 hours. If you are within that expiry window, you will also be able to browse the results of the query from the web UI. Your personal history is available only to you. Administrators of the project to which your query was billed will also see your query text in the project’s history.
This temporary table is also used as a cache if the exact same query text is submitted to the service and the query does not involve dynamic elements such as CURRENT_TIMESTAMP()
or RAND()
. Cached query results incur no charges, but note that the algorithm to determine whether a query is a duplicate simply does a string match—even an extra whitespace can result in the query being reexecuted.
You can save any query by loading it into the query editor, clicking the “Save query” button, and then giving the query a name, as shown in Figure 2-2. BigQuery then provides a URL to the query text.
You can also choose to make the saved query shareable, in which case anyone who has the URL will be directed to a page with the query text prepopulated.
When you share a query, all that you share is the text of the query; you do not share access to any data. Dataset permissions to execute the query must be provided independently using the IAM controls (we discuss these in Chapter 10). Also, unlike most BigQuery features, the ability to save and share queries is available only from the web UI. As of this writing, there is no REST API or client library available for this.
The list of saved queries is available from the UI. You can turn off link sharing at any time to make the query text private again, as illustrated in Figure 2-3.
In this chapter, you saw how BigQuery supports SQL:2011: selecting records (SELECT
), aliasing column names (AS
), filtering (WHERE
), using subqueries (parentheses and WITH
), sorting (ORDER
), aggregating (GROUP
, AVG
, COUNT
, MIN
, MAX
, etc.), filtering grouped items (HAVING
), filtering unique values (DISTINCT
), and joining (INNER
/CROSS
/OUTER JOIN
). There is also support for arrays (ARRAY_AGG
, UNNEST
) and structs (STRUCT
). You also looked at how to review the history of queries (the text of the query, not the results) submitted to the service. This history is available, you learned, to the user who submitted the query, and to project administrators. And you learned that it is possible to share query text through a link.
1 This is a data format that is very popular within Google because it provides efficient storage in a programming-language-neutral way. It is now open source; see https://developers.google.com/protocol-buffers/.
2 For more details on Dremel, see https://ai.google/research/pubs/pub36632.
3 The “person” in this case is one of the members of the Google Cloud Platform public datasets team. See Google Cloud Public Datasets for what else is available.
4 We believe all mentions of price to be correct as of the writing of this book, but please do refer to the relevant policy and pricing sheets because these are subject to change.
5 For an entertaining data-driven examination of the correlation between project success and the use of leading commas, see https://oreil.ly/mFZKh.
6 You can also use ORDINAL(1)
to work with 1-based indexing. We look at arrays in more detail in Chapter 8.