In the bike rental queries in the previous chapters, when we divided the trip duration by 60, we were able to do so because trip duration was a numeric type. Trying to divide the gender by 60 would not have worked because gender is a string. The functions and operations you have at your disposal might be restricted based on the type of data to which you are applying them.
BigQuery supports several data types to store numeric, string, time, geographic, structured, and semi-structured data:
INT64
FLOAT64
, and for Booleans, use BOOL
.NUMERIC
NUMERIC
offers 38 digits of precision and 9 decimal digits of scale and is suitable for exact calculations, such as in finance.STRING
BYTES
are variable-length sequences of characters (not Unicode).TIMESTAMP
DATETIME
DATE
and TIME
are also available separately.GEOGRAPHY
GEOGRAPHY
represents points, lines, and polygons on the surface of the Earth.STRUCT
and ARRAY
As just stated, there is only one integer type (INT64
) and only one floating-point type (FLOAT64
). Both of these types support the typical arithmetic operations (+, –, /, *—for add, subtract, divide, and multiply, respectively). Thus, we can find the fraction of bike rentals that are one-way by simply dividing one column by the other:
WITH example AS ( SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways UNION ALL SELECT 'Sun', 2376, 936 ) SELECT *, (oneways/numrides) AS frac_oneway from example
This yields the following:
Row | day | numrides | oneways | frac_oneway |
---|---|---|---|---|
1 | Sat | 1451 | 1018 | 0.7015851137146796 |
2 | Sun | 2376 | 936 | 0.3939393939393939 |
Besides the arithmetic operators, bitwise operations (<<
and >>
for shifting, &
and |
for bitwise AND and OR, etc.) are also supported on integer types.
To operate on data types, we can use functions. Functions perform operations on the values that are input to them. As with other programming languages, functions in SQL encapsulate reusable logic and abstract away the complexity of their implementation. Table 3-1 presents the various types of functions.
Type of function | Description | Example |
---|---|---|
Scalar | A function that operates on one or more input parameters and returns a single value. A scalar function can be used wherever its return data type is allowed. |
ROUND(3.14) returns 3, which is a FLOAT64 , and so the ROUND function can be used wherever a FLOAT64 is allowed.SUBSTR(“hello”, 1, 2) returns “he” and is an example of a scalar function that takes three input parameters. |
Aggregate | A function that performs a calculation on a collection of values and returns a single value. Aggregate functions are often used with a GROUP BY to perform a computation over a group of rows. |
MAX(tripduration) computes the maximum value within the tripduration column.Other aggregate functions include SUM() , COUNT() , AVG() , etc. |
Analytic | Analytic functions operate on a collection of values but return an output for each value in the collection. A window frame is used to specify the set of rows to which the analytic function applies. |
row_number() , rank() , etc. are analytic functions. We look at these in Chapter 8. |
Table-valued | A function that returns a result set and can therefore be used in FROM clauses. |
You can call UNNEST on an array and then select from it. |
User-defined | A function that is not built in, but whose implementation is specified by the user. User-defined functions can be written in SQL (or JavaScript) and can themselves return any of the aforementioned types. |
CREATE TEMP FUNCTION lastElement(arr ANY TYPE) AS ( arr[ORDINAL(ARRAY_LENGTH(arr))] ); |
Had we wanted to round off the end-result of the query that computed the fraction of bike rentals that were one-way, we would have used one of the many built-in mathematical functions that work on integer and floating-point types:
WITH example AS ( SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways UNION ALL SELECT 'Sun', 2376, 936 ) SELECT *, ROUND(oneways/numrides, 2) AS frac_oneway from example
This returns the following:
Row | day | numrides | oneways | frac_oneway |
---|---|---|---|---|
1 | Sat | 1451 | 1018 | 0.7 |
2 | Sun | 2376 | 936 | 0.39 |
The division operator fails if the denominator is zero or if the result overflows. Rather than protect the division by checking for zero values beforehand, it is better to use a special function for division whenever the denominator could be zero, as is the case in the previous example. A better form of that query would be this:
WITH example AS ( SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways UNION ALL SELECT 'Sun', 2376, 936 UNION ALL SELECT 'Wed', 0, 0 ) SELECT *, ROUND(IEEE_Divide(oneways, numrides), 2) AS frac_oneway from example
The IEEE_Divide
function follows the standard set by the Institute of Electrical and Electronics Engineers (IEEE) and returns a special floating-point number called Not-a-Number (NaN
) when a division by zero is attempted.
Also try the previous query using the standard division operator and using SAFE_DIVIDE
(discussed shortly).1 Recall that, for your copy-pasting convenience, all the queries in this book are available in the book’s GitHub repository.
You can make any scalar function return NULL
instead of raising an error by prefixing it with SAFE
. For example, the following query will raise an error because the logarithm of a negative number is undefined:
SELECT LOG(10, -3), LOG(10, 3)
However, by prefixing the LOG
with SAFE,
like so:
SELECT SAFE.LOG(10, -3), SAFE.LOG(10, 3)
you will get NULL
for the result of LOG(10, -3)
:
Row | f0_ | f1_ |
---|---|---|
1 | null | 2.095903274289385 |
The SAFE
prefix works for mathematical functions, string functions (for example, the SUBSTR
function would normally raise an error if the starting index is negative, but it returns NULL
if invoked as SAFE.SUBSTR
), and time functions. It is, however, restricted to scalar functions and will not work for aggregate functions, analytic functions, or user-defined functions.
Comparisons are carried out using operators. The operators <
, <=
, >
, >=,
and !=
(or <>
) are used to obtain the results of comparison. NULL
, followed by NaN
, is assumed to be smaller than valid numbers (including -inf
) for the purposes of ordering. However, comparisons with NaN
always return false and comparisons with NULL
always return NULL
. This can lead to seemingly paradoxical results:
WITH example AS ( SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways UNION ALL SELECT 'Sun', 2376, 936 UNION ALL SELECT 'Mon', NULL, NULL UNION ALL SELECT 'Tue', IEEE_Divide(-3,0), 0 -- this is -inf,0 ) SELECT * from example ORDER BY numrides
This query returns the following:
Row | day | numrides | oneways |
---|---|---|---|
1 | Mon | null | null |
2 | Tue | -Infinity | 0 |
3 | Sat | 1451.0 | 1018 |
4 | Sun | 2376.0 | 936 |
However, filtering for fewer than 2000
rides with
SELECT * from example WHERE numrides < 2000
yields only two results, not three:
Row | day | numrides | oneways |
---|---|---|---|
1 | Sat | 1451.0 | 1018 |
2 | Tue | -Infinity | 345 |
This is because the WHERE
clause returns only those rows for which the result is true, and when NULL
is compared to 2000
, the result is NULL
and not true.
Note that the operators &
and |
exist in BigQuery but are used only for bitwise operations. The !
symbol, as in !=
, means NOT, but it does not work as a standalone—you cannot say !gender
to compute the logical negative of gender, as you can in other languages. An alternate way to specify not-equals is to write <>
, but be consistent on whether you use !=
or <>
.
INT64
and FLOAT64
are designed to be flexible and fast, but they are limited by the fact that they are stored in a base-2 (0s and 1s) form in a 64-bit area of computer memory when being used for calculations. This is a trade-off well worth making in most applications, but financial and accounting applications often require exact calculations for numbers represented in decimal (base-10).
The NUMERIC
data type in BigQuery provides 38 digits to represent numbers, with 9 of those digits appearing after the decimal point. It uses 16 bytes for storage and can represent decimal fractions exactly, thus making it suitable for financial calculations.
For example, imagine that you needed to compute the sum of three payments. You’d want the results to be exact. When using FLOAT64
values, however, the tiny differences between how the number is represented in memory and how the number is represented in decimals can add up:
WITH example AS ( SELECT 1.23 AS payment UNION ALL SELECT 7.89 UNION ALL SELECT 12.43 ) SELECT SUM(payment) AS total_paid, AVG(payment) AS average_paid FROM example
Look at what we get:
Row | total_paid | average_paid |
---|---|---|
1 | 21.549999999999997 | 7.183333333333334 |
In financial and accounting applications, these imprecisions can add up and make balancing the books tricky.
Watch what happens when we change the data type of payment to be NUMERIC
:
WITH example AS ( SELECT NUMERIC '1.23' AS payment UNION ALL SELECT NUMERIC '7.89' UNION ALL SELECT NUMERIC '12.43' ) SELECT SUM(payment) AS total_paid, AVG(payment) AS average_paid FROM example
The problem goes away. The sum of the payments is now precise (the average cannot be represented precisely even in NUMERIC
because it is a repeating decimal):
Row | total_paid | average_paid |
---|---|---|
1 | 21.55 | 7.183333333 |
Note that NUMERIC
types need to be directly ingested into BigQuery as strings (NUMERIC '1.23'
); otherwise, the floating-point representation will obviate any of the precision gains to be had.
Boolean variables are those that can be either True
or False
. Because SQL is case insensitive, TRUE
, true
, and so on also work.
Recall from the section on filtering within the WHERE
clause that the WHERE
clause can include Boolean expressions that include AND
, OR
, and NOT
, as well as parentheses to control the order of execution. We used this query to illustrate these options:
SELECT gender, tripduration FROM `bigquery-public-data`.new_york_citibike.citibike_trips WHERE (tripduration < 600 AND gender = 'female') OR gender = 'male'
You could use comparison operators with Boolean variables, as in the following:
WITH example AS ( SELECT NULL AS is_vowel, NULL as letter, -1 as position UNION ALL SELECT true, 'a', 1 UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT * from example WHERE is_vowel != false
This gives us the following:
Row | is_vowel | letter | position |
---|---|---|---|
1 | true | a | 1 |
However, it is often simpler to use the IS
operator when comparing against built-in constants, as shown in this example:
WITH example AS ( SELECT NULL AS is_vowel, NULL as letter, -1 as position UNION ALL SELECT true, 'a', 1 UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT * from example WHERE is_vowel IS NOT false
This yields the following:
Row | is_vowel | letter | position |
---|---|---|---|
1 | null | null | -1 |
2 | true | a | 1 |
Note that the two queries yield different results. The comparators (=
, !=
, <
, etc.) return NULL
for comparisons against NULL
, whereas the IS
operator doesn’t.
NULL
s typically represent missing values or values that were not collected. They have no value and are not zero, empty strings, or blanks. If your dataset has NULL
s, you must tread carefully since comparisons with NULL
always return NULL
, and so the WHERE clause will filter out NULL
values. Use the IS
operator to check where a value is NULL
.
It is simpler and more readable to use Boolean variables directly:
WITH example AS ( SELECT NULL AS is_vowel, NULL as letter, -1 as position UNION ALL SELECT true, 'a', 1 UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT * from example WHERE is_vowel
The result here is like is_vowel
IS TRUE
:
Row | is_vowel | letter | position |
---|---|---|---|
1 | true | a | 1 |
Of course, such readability depends on naming the Boolean variables well!
It is not just in the WHERE
clause that Booleans are useful. It is possible to simplify many queries by using conditional expressions in the SELECT
. For example, suppose that you need to compute the sales price of each item in a catalog based on the desired markup and tax rate corresponding to the item. If your catalog is missing values for some of the necessary information, you might want to impute a default markup or default tax rate. You can achieve this with the IF
function:
WITH catalog AS ( SELECT 30.0 AS costPrice, 0.15 AS markup, 0.1 AS taxRate UNION ALL SELECT NULL, 0.21, 0.15 UNION ALL SELECT 30.0, NULL, 0.09 UNION ALL SELECT 30.0, 0.30, NULL UNION ALL SELECT 30.0, NULL, NULL ) SELECT *, ROUND( costPrice * IF(markup IS NULL, 1.05, 1+markup) * IF(taxRate IS NULL, 1.10, 1+taxRate) , 2) AS salesPrice FROM catalog
This yields a valid salesPrice
for all items except those for which we don’t know the cost:
Row | costPrice | markup | taxRate | salesPrice |
---|---|---|---|---|
1 | 30.0 | 0.15 | 0.1 | 37.95 |
2 | null | 0.21 | 0.15 | null |
3 | 30.0 | null | 0.09 | 34.34 |
4 | 30.0 | 0.3 | null | 42.9 |
5 | 30.0 | null | null | 34.65 |
The way the IF
function works is that the first parameter is the condition to be evaluated. If the condition is true, the second parameter is used, or else the third parameter is used. Because this function occurs in the SELECT
, it is carried out row by row.
What if you want to do the imputation if a single value is missing, but not if more than one value is missing? In other words, if you have no tax rate, you are willing to impute a 10% tax rate, but not if you also don’t know the markup on the item.
A convenient way to keep evaluating expressions until we get to a non-NULL
value is to use COALESCE
:
WITH catalog AS ( SELECT 30.0 AS costPrice, 0.15 AS markup, 0.1 AS taxRate UNION ALL SELECT NULL, 0.21, 0.15 UNION ALL SELECT 30.0, NULL, 0.09 UNION ALL SELECT 30.0, 0.30, NULL UNION ALL SELECT 30.0, NULL, NULL ) SELECT *, ROUND(COALESCE( costPrice * (1+markup) * (1+taxrate), costPrice * 1.05 * (1+taxrate), costPrice * (1+markup) * 1.10, NULL ),2) AS salesPrice FROM catalog
This yields the following (only the last row is different from the previous computation):
Row | costPrice | markup | taxRate | salesPrice |
---|---|---|---|---|
1 | 30.0 | 0.15 | 0.1 | 37.95 |
2 | null | 0.21 | 0.15 | null |
3 | 30.0 | null | 0.09 | 34.34 |
4 | 30.0 | 0.3 | null | 42.9 |
5 | 30.0 | null | null | null |
The COALESCE
short-circuits the calculation whenever possible—that is, later expressions are not evaluated after a non-NULL result is obtained. Therefore, the final NULL
in the COALESCE
is not required, but it makes the intent clearer.
BigQuery supports the IFNULL
function as a simplification of COALESCE
when you have only two inputs. IFNULL(a, b)
is the same as COALESCE(a, b)
and yields b
if a
is NULL
. In other words, IFNULL(a, b)
is the same as IF(a IS NULL, b, a)
.
The very first query in this section on conditional expressions could have been simplified as follows:
SELECT *, ROUND( costPrice * (1 + IFNULL(markup, 0.05)) * (1 + IFNULL(taxrate,0.10)) , 2) AS salesPrice FROM catalog
Consider this example dataset in which the number of hours worked by an employee is stored as a string in order to accommodate reasons for a leave of absence (this is a bad schema design, but bear with us):
WITH example AS ( SELECT 'John' as employee, 'Paternity Leave' AS hours_worked UNION ALL SELECT 'Janaki', '35' UNION ALL SELECT 'Jian', 'Vacation' UNION ALL SELECT 'Jose', '40' )
Now suppose that you want to find the total number of hours worked. This won’t work because the hours_worked
is a string, not a numeric type:
WITH example AS ( SELECT 'John' as employee, 'Paternity Leave' AS hours_worked UNION ALL SELECT 'Janaki', '35' UNION ALL SELECT 'Jian', 'Vacation' UNION ALL SELECT 'Jose', '40' ) SELECT SUM(hours_worked) from example
We need to explicitly convert the hours_worked
to an INT64
before doing any aggregation. Explicit conversion is called casting, and it requires the explicit use of the CAST()
function. If casting fails, BigQuery raises an error. To have it return NULL
instead, use SAFE_CAST
. For example, the following raises an error:
SELECT CAST("true" AS bool), CAST("invalid" AS bool)
Now try using SAFE_CAST
:
SELECT CAST("true" AS bool), SAFE_CAST("invalid" AS bool)
You should see the following:
Row | f0_ | f1_ |
---|---|---|
1 | true | null |
Implicit conversion is called coercion, and this happens automatically when a data type is used in a situation for which another data type is required. For example, when we use an INT64
in a situation when a FLOAT64
is needed, the integer will be coerced into a floating-point number. The only coercions done by BigQuery are to convert INT64
to FLOAT64
and NUMERIC
, and NUMERIC
to FLOAT64
. Every other conversion is explicit and requires a CAST
.
With the problem of the total number of hours worked, not all of the hours_worked
strings can be converted to integers, so you should use a SAFE_CAST
:
WITH example AS ( SELECT 'John' as employee, 'Paternity Leave' AS hours_worked UNION ALL SELECT 'Janaki', '35' UNION ALL SELECT 'Jian', 'Vacation' UNION ALL SELECT 'Jose', '40' ) SELECT SUM(SAFE_CAST(hours_worked AS INT64)) from example
This yields the following:
Row | f0_ |
---|---|
1 | 75 |
Had it simply been a schema problem and all the rows contained numbers but were stored as strings, you could have used a simple CAST
:
WITH example AS ( SELECT 'John' as employee, '0' AS hours_worked UNION ALL SELECT 'Janaki', '35' UNION ALL SELECT 'Jian', '0' UNION ALL SELECT 'Jose', '40' ) SELECT SUM(CAST(hours_worked AS INT64)) from example
Consider this example dataset:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT * from example
Here’s the result of the query:
Row | is_vowel | letter | position |
---|---|---|---|
1 | true | a | 1 |
2 | false | b | 2 |
3 | false | c | 3 |
Now suppose that you want to find the total number of vowels. You might be tempted to do something simple, such as the following:
SELECT SUM(is_vowel) as num_vowels from example
This won’t work, however (try it!), because SUM
, AVG
, and others are not defined on Booleans. You could cast the Booleans to an INT64
before doing the aggregation, like so:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT SUM(CAST (is_vowel AS INT64)) as num_vowels from example
This would yield the following:
Row | num_vowels |
---|---|
1 | 1 |
However, you should try to avoid casting as much as possible. In this case, a cleaner approach is to use the IF
statement on the Booleans:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT SUM(IF(is_vowel, 1, 0)) as num_vowels from example
An even cleaner approach is to use COUNTIF
:
WITH example AS ( SELECT true AS is_vowel, 'a' as letter, 1 as position UNION ALL SELECT false, 'b', 2 UNION ALL SELECT false, 'c', 3 ) SELECT COUNTIF(is_vowel) as num_vowels from example
String manipulation is a common requirement for data wrangling, so BigQuery provides a library of built-in string functions—for example:
WITH example AS ( SELECT * from unnest([ 'Seattle', 'New York', 'Singapore' ]) AS city ) SELECT city , LENGTH(city) AS len , LOWER(city) AS lower , STRPOS(city, 'or') AS orpos FROM example
This example computes the length of the string, makes the string lowercase, and finds the location of a substring in the “city” column, which gives us the following result:
Row | city | len | lower | orpos |
---|---|---|---|---|
1 | Seattle | 7 | seattle | 0 |
2 | New York | 8 | new york | 6 |
3 | Singapore | 9 | singapore | 7 |
The substring “or” occurs in “New York” and in “Singapore,” but not in “Seattle.”
Two particularly useful functions for string manipulation are SUBSTR
and CONCAT
. SUBSTR
extracts a substring, and CONCAT
concatenates the input values. The following query finds the position of the @
symbol in an email address, extracts the username, and concatenates the city in which the individual lives:
WITH example AS ( SELECT '[email protected]' AS email, 'Annapolis, MD' as city UNION ALL SELECT '[email protected]', 'Boulder, CO' UNION ALL SELECT '[email protected]', 'Chicago, IL' ) SELECT CONCAT( SUBSTR(email, 1, STRPOS(email, '@') - 1), -- username ' from ', city) AS callers FROM example
Here’s what the result looks like:
Row | callers |
---|---|
1 | armin from Annapolis, MD |
2 | boyan from Boulder, CO |
3 | carrie from Chicago, IL |
Strings in BigQuery are Unicode, so avoid assumptions that rely on English. For example, the “upper” case is a no-op in Japanese, and the default UTF-8 encoding that is carried out by the cast as bytes is insufficient for languages such as Tamil, as demonstrated here:
WITH example AS ( SELECT * from unnest([ 'Seattle', 'New York', 'சிங்கப்பூர்', '東京' ]) AS city ) SELECT city , UPPER(city) AS allcaps , CAST(city AS BYTES) as bytes FROM example
As you can see, this simply doesn’t work as presumably intended:
Row | city | allcaps | bytes |
---|---|---|---|
1 | Seattle | SEATTLE | U2VhdHRsZQ== |
2 | New York | NEW YORK | TmV3IFlvcms= |
3 | சிங்கப்பூர் | சிங்கப்பூர் | 4K6a4K6/4K6Z4K+N4K6V4K6q4K+N4K6q4K+C4K6w4K+N |
4 | 東京 | 東京 | 5p2x5Lqs |
BigQuery supports three different ways to represent strings—as an array of Unicode characters, as an array of bytes, and as an array of Unicode code points (INT64
):
WITH example AS ( SELECT * from unnest([ 'Seattle', 'New York', 'சிங்கப்பூர்', '東京' ]) AS city ) SELECT city , CHAR_LENGTH(city) as char_len , TO_CODE_POINTS(city)[ORDINAL(1)] as first_code_point , ARRAY_LENGTH(TO_CODE_POINTS(city)) as num_code_points , CAST (city AS BYTES) as bytes , BYTE_LENGTH(city) as byte_len FROM example
Note the difference between the results for CHAR_LENGTH
and BYTE_LENGTH
on the same strings, and how the number of code points is the same as the number of characters:
Row | city | char_len | first_code_point | num_code_points | bytes | byte_len |
---|---|---|---|---|---|---|
1 | Seattle | 7 | 83 | 7 | U2VhdHRsZQ== | 7 |
2 | New York | 8 | 78 | 8 | TmV3IFlvcms= | 8 |
3 | சிங்கப்பூர் | 11 | 2970 | 11 | 4K6a4K6/4K6Z4K+N4K6V4K6q4K+N4K6q4K+C4K6w4K+N | 33 |
4 | 東京 | 2 | 26481 | 2 | 5p2x5Lqs | 6 |
Because of these differences, you need to recognize which columns might contain text in different languages, and be aware of language differences when using string manipulation functions.
You can simply cast a string as an INT64
or a FLOAT64
in order to parse it, but customizing the string representation will require the use of FORMAT
:
SELECT CAST(42 AS STRING) , CAST('42' AS INT64) , FORMAT('%03d', 42) , FORMAT('%5.3f', 32.457842) , FORMAT('%5.3f', 32.4) , FORMAT('**%s**', 'H') , FORMAT('%s-%03d', 'Agent', 7)
Here is the result of that query:
Row | f0_ | f1_ | f2_ | f3_ | f4_ | f5_ | f6_ |
---|---|---|---|---|---|---|---|
1 | 42 | 42 | 042 | 32.458 | 32.400 | **H** | Agent-007 |
FORMAT
works similarly to C’s printf
, and it accepts the same format specifiers. A few of the more useful specifiers are demonstrated in the preceding example. Although FORMAT
also accepts dates and timestamps, it is better to use FORMAT_DATE
and FORMAT_TIMESTAMP
so that the display formats can be locale-aware.
Manipulating strings is such a common need in Extract, Transform, and Load (ETL) pipelines that these BigQuery convenience functions are worth having on speed dial:
SELECT ENDS_WITH('Hello', 'o') -- true , ENDS_WITH('Hello', 'h') -- false , STARTS_WITH('Hello', 'h') -- false , STRPOS('Hello', 'e') -- 2 , STRPOS('Hello', 'f') -- 0 for not-found , SUBSTR('Hello', 2, 4) -- 1-based , CONCAT('Hello', 'World')
The result of this query is as follows:
Row | f0_ | f1_ | f2_ | f3_ | f4_ | f5_ | f6_ |
---|---|---|---|---|---|---|---|
1 | true | false | false | 2 | 0 | ello | HelloWorld |
Note how SUBSTR()
behaves. The first parameter is the starting position (it is 1-based), and the second parameter is the desired number of characters in the substring.
Another set of functions that is worth becoming familiar with are those that allow you to manipulate the string:
SELECT LPAD('Hello', 10, '*') -- left pad with * , RPAD('Hello', 10, '*') -- right pad , LPAD('Hello', 10) -- left pad with spaces , LTRIM(' Hello ') -- trim whitespace on left , RTRIM(' Hello ') -- trim whitespace on right , TRIM (' Hello ') -- trim whitespace both ends , TRIM ('***Hello***', '*') -- trim * both ends , REVERSE('Hello') -- reverse the string
Let’s look at the result of this query:
Row | f0_ | f1_ | f2_ | f3_ | f4_ | f5_ | f6_ | f7_ |
---|---|---|---|---|---|---|---|---|
1 | *****Hello | Hello***** | Hello | Hello | Hello | Hello | Hello | olleH |
Regular expressions provide much more powerful semantics than the convenience functions. For instance, STRPOS
and others can find only specific characters, whereas you can use REGEXP_CONTAINS
for more powerful searches.
For example, you could do the following to determine whether a column contains a US zip code (the short form of which is a five-digit number and the long form of which has an additional four digits separated by either a hyphen or a space):
SELECT column , REGEXP_CONTAINS(column, r'd{5}(?:[-s]d{4})?') has_zipcode , REGEXP_CONTAINS(column, r'^d{5}(?:[-s]d{4})?$') is_zipcode , REGEXP_EXTRACT(column, r'd{5}(?:[-s]d{4})?') the_zipcode , REGEXP_EXTRACT_ALL(column, r'd{5}(?:[-s]d{4})?') all_zipcodes , REGEXP_REPLACE(column, r'd{5}(?:[-s]d{4})?', '*****') masked FROM ( SELECT * from unnest([ '12345', '1234', '12345-9876', 'abc 12345 def', 'abcde-fghi', '12345 ab 34567', '12345 9876' ]) AS column )
Here’s what this query yields:
Row | column | has_zipcode | is_zipcode | the_zipcode | all_zipcodes | masked |
---|---|---|---|---|---|---|
1 | 12345 | true | true | 12345 | 12345 | ***** |
2 | 1234 | false | false | null | 1234 | |
3 | 12345-9876 | true | true | 12345-9876 | 12345-9876 | ***** |
4 | abc 12345 def | true | false | 12345 | 12345 | abc ***** def |
5 | abcde-fghi | false | false | null | abcde-fghi | |
6 | 12345 ab 34567 | true | false | 12345 | 12345 | ***** ab ***** |
34567 | ||||||
7 | 12345 9876 | true | true | 12345 9876 | 12345 9876 | ***** |
There are a few things to note:
The regular expression d{5}
matches any string consisting of five decimal numbers.
The second part of the expression, in parentheses, looks for an optional (note the ?
at the end of the parentheses) group (?:
) of four decimal numbers (d{4}
), which is separated from the first five numbers by either a hyphen or by a space (s
).
The presence of d
, s
, and others in the string could cause problems, so we prefix the string with an r
(for raw), which makes it a string literal.
The second expression illustrates how to find an exact match: simply insist that the string in question must start (^
) and end ($
) with the specified string.
To extract the part of the string matched by the regular expression, use REGEXP_EXTRACT
. This returns NULL
if the expression is not matched, and only the first match if there are multiple matches.
REGEXP_EXTRACT_ALL
returns all the matches. If there is no match, it returns an empty array.
REGEXP_REPLACE
replaces every match with the replacement string.
The regular expression support in BigQuery follows that of Google’s open source RE2 library. To see the syntax accepted by this library, visit https://github.com/google/re2/wiki/Syntax. Regular expressions can be cryptic, but they are a rich topic that is well worth mastering.2
Because strings are so common in data analysis, it is worth learning the broad contours of the available functions. You can always refer to the BigQuery documentation for the exact syntax. Table 3-2 separates them into their respective categories.
Category | Functions | Notes |
---|---|---|
Representations | CHAR_LENGTH , BYTE_LENGTH , TO_CODE_POINTS ,CODE_POINTS_TO_STRING ,SAFE_CONVERT_BYTES_TO_STRING ,TO_HEX , TO_BASE32 , TO_BASE64 , FROM_HEX , FROM_BASE32 , FROM_BASE64 , NORMALIZE |
Normalize allows, for example, different Unicode space characters to be made equivalent. |
Printing and parsing | FORMAT , REPEAT , SPLIT |
The syntax of FORMAT is similar to C’s printf: format( "%03d", 12 ) yields 012. For locale-aware conversions, use FORMAT_DATE , etc. |
Convenience | ENDS_WITH , LENGTH , STARTS_WITH , STRPOS , SUBSTR , CONCAT |
The LENGTH function is equivalent to CHAR_LENGTH for Strings and to BYTE_LENGTH for Bytes. |
Transformations | LPAD , LOWER , LTRIM , REPLACE , REVERSE , RPAD , RTRIM , TRIM , UPPER |
The default trim characters are Unicode whitespace, but it is possible to specify a different set of trim characters. |
Regular expressions | REGEXP_CONTAINS ,REGEXP_EXTRACT , REGEXP_EXTRACT_ALL , REGEXP_REPLACE |
See https://github.com/google/re2/wiki/Syntax for the syntax accepted by BigQuery. |
A timestamp represents an absolute point in time regardless of location. Thus a timestamp of 2017-09-27 12:30:00.45 (Sep 27, 2017, at 12:30 UTC) represents the same time as 2017-09-27 13:30:00.45+1:00 (1:30 p.m. at a time zone that is an hour behind):
SELECT t1, t2, TIMESTAMP_DIFF(t1, t2, MICROSECOND) FROM (SELECT TIMESTAMP "2017-09-27 12:30:00.45" AS t1, TIMESTAMP "2017-09-27 13:30:00.45+1" AS t2 )
This returns the following:
Row | t1 | t2 | f0_ |
---|---|---|---|
1 | 2017-09-27 12:30:00.450 UTC | 2017-09-27 12:30:00.450 UTC | 0 |
BigQuery is somewhat forgiving when it comes to parsing the timestamp. The date and time parts of this string representation can be separated either by a T or by a space in accordance with ISO 8601. Similarly, the month, day, hour, and so on might or might not have leading zeros. However, best practice is to use the canonical representation shown in the previous paragraph. As that string representation would indicate, this timestamp can represent only four-digit years; years before the common era cannot be represented using TIMESTAMP
.
You can use PARSE_TIMESTAMP
to parse a string that is not in the canonical format:
SELECT fmt, input, zone , PARSE_TIMESTAMP(fmt, input, zone) AS ts FROM ( SELECT '%Y%m%d-%H%M%S' AS fmt, '20181118-220800' AS input, '+0' as zone UNION ALL SELECT '%c', 'Sat Nov 24 21:26:00 2018', 'America/Los_Angeles' UNION ALL SELECT '%x %X', '11/18/18 22:08:00', 'UTC' )
Here is what this would yield:
Row | fmt | input | zone | ts |
---|---|---|---|---|
1 | %Y%m%d%-H%M%S | 20181118-220800 | +0 | 2018-11-18 22:08:00 UTC |
2 | %c | Sat Nov 24 21:26:00 2018 | America/Los_Angeles | 2018-11-25 05:26:00 UTC |
3 | %x %X | 11/18/18 22:08:00 | UTC | 2018-11-18 22:08:00 UTC |
The first example uses format specifiers for the year, month, day, and so on to create a timestamp from the provided string. The second and third examples use preexisting specifiers for commonly encountered date-time formats.3
Conversely, you can use FORMAT_TIMESTAMP
to print out a timestamp in any desired format:
SELECT ts, fmt , FORMAT_TIMESTAMP(fmt, ts, '+6') AS ts_output FROM ( SELECT CURRENT_TIMESTAMP() AS ts, '%Y%m%d-%H%M%S' AS fmt UNION ALL SELECT CURRENT_TIMESTAMP() AS ts, '%c' AS fmt UNION ALL SELECT CURRENT_TIMESTAMP() AS ts, '%x %X' AS fmt )
This results in the following:
Row | ts | fmt | ts_output |
---|---|---|---|
1 | 2018-11-25 05:42:13.939840 UTC | %Y%m%d-%H%M%S | 20181125-114213 |
2 | 2018-11-25 05:42:13.939840 UTC | %c | Sun Nov 25 11:42:13 2018 |
3 | 2018-11-25 05:42:13.939840 UTC | %x %X | 11/25/18 11:42:13 |
The preceding example uses the function CURRENT_TIMESTAMP()
to retrieve the system time at the time the query is executed. In both PARSE_TIMESTAMP
and FORMAT_TIMESTAMP
, the time zone is optional; if omitted, the time zone is assumed to be UTC.
Given a timestamp, it is possible to extract information about the Gregorian calendar corresponding to the timestamp. For example, we can extract information about Armistice Day4 using this:
SELECT ts , FORMAT_TIMESTAMP('%c', ts) AS repr , EXTRACT(DAYOFWEEK FROM ts) AS dayofweek , EXTRACT(YEAR FROM ts) AS year , EXTRACT(WEEK FROM ts) AS weekno FROM ( SELECT PARSE_TIMESTAMP('%Y%m%d-%H%M%S', '19181111-054500') AS ts )
Here is the result:
Row | ts | repr | dayofweek | year | weekno |
---|---|---|---|---|---|
1 | 1918-11-11 05:45:00 UTC | Mon Nov 11 05:45:00 1918 | 2 | 1918 | 45 |
The week is assumed to begin on Sunday, and days prior to the first Sunday of the year are assigned to week 0. This is not internationally safe. Hence, if you’re in a country (such as Israel) where the week begins on Saturday, it is possible to specify a different day for the start of the week:
EXTRACT(WEEK('SATURDAY') FROM ts)
The number of seconds from the Unix epoch (January 1, 1970) is not available through EXTRACT
. Instead, special functions exist to convert to and from the Unix epoch:
SELECT UNIX_MILLIS(TIMESTAMP "2018-11-25 22:30:00 UTC") , UNIX_MILLIS(TIMESTAMP "1918-11-11 22:30:00 UTC") --invalid , TIMESTAMP_MILLIS(1543185000000)
This yields the following:
Row | f0_ | f1_ | f2_ |
---|---|---|---|
1 | 1543185000000 | -1613784600000 | 2018-11-25 22:30:00 UTC |
Note that the second one overflows and yields a negative number, but no error is raised.
It is possible to add or subtract time durations from timestamps. It is also possible to find the time difference between two timestamps. In all of these functions, you need to specify the units in which the durations are expressed:
SELECT EXTRACT(TIME FROM TIMESTAMP_ADD(t1, INTERVAL 1 HOUR)) AS plus_1h , EXTRACT(TIME FROM TIMESTAMP_SUB(t1, INTERVAL 10 MINUTE)) AS minus_10min , TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 MINUTE), SECOND) AS plus_1min , TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 MINUTE), SECOND) AS minus_1min FROM (SELECT TIMESTAMP "2017-09-27 12:30:00.45" AS t1 )
This returns the timestamps an hour from now, 10 minutes ago, and the time difference in seconds corresponding to one minute from now and one minute earlier:
Row | plus_1h | minus_10min | plus_1min | minus_1min |
---|---|---|---|---|
1 | 13:30:00.450000 | 12:20:00.450000 | 60 | -60 |
BigQuery has three other functions for representing time: DATE
, TIME
, and DATETIME
. DATE
is useful for when you are tracking only the day in which something happens, and any more precision is unnecessary. TIME
is useful to represent the time of day that things happen, and to perform mathematical operations with those times. With TIME
, you can answer questions like, “What time will it be eight hours from the starting time?” DATETIME
is a TIMESTAMP
rendered in a specific time zone, so it is useful when you have an unambiguous time zone in which an event occurred and you don’t need to do time zone conversions.
Counterparts to most of the TIMESTAMP
functions are available for DATETIME
. Thus, you can call DATETIME_ADD
, DATETIME_SUB
, and DATETIME_DIFF
, as well as PARSE_DATETIME
and FORMAT_DATETIME
. You can also EXTRACT
calendar parts from a DATETIME
. The two types are quite interoperable—it is possible to extract a DATETIME
from a TIMESTAMP
and cast a DATETIME
to a TIMESTAMP
:
SELECT EXTRACT(DATETIME FROM CURRENT_TIMESTAMP()) as dt , CAST(CURRENT_DATETIME() AS TIMESTAMP) as ts
The following shows the result:
Row | dt | ts |
---|---|---|
1 | 2018-11-25T07:03:15.055141 | 2018-11-25 07:03:15.055141 UTC |
Note that the canonical representation of a DATETIME
has the letter T separating the date part and the time part, whereas the representation of a TIMESTAMP
uses a space. The TIMESTAMP
also explicitly includes the time zone, whereas the time zone is implicit in the DATETIME
. But for the most part, you can use DATETIME
and TIMESTAMP
interchangeably in BigQuery.
DATE
is just the date part of a DATETIME
(or a TIMESTAMP
, interpreted in some time zone), and TIME
is the time part. Because many real-world scenarios might happen on a certain date (i.e., at multiple times throughout that day), many database tables contain just a DATE
. So there is some benefit to being able to directly parse and format dates. On the other hand, there is very little need for the TIME
type other than as the “missing” part of a DATETIME
.
For the most part, therefore, our advice is to just use TIMESTAMP
and DATE
. There is, however, one practical wrinkle to using TIMESTAMP
. Timestamps in BigQuery are stored using eight bytes with microsecond resolution. This means that you can store years 0 through 9999, and any microsecond in between. In some other databases (e.g., MySQL), TIMESTAMP
is stored using four bytes and DATETIME
using eight bytes. In those systems, the range of a TIMESTAMP
is within the limits of the Unix epoch time (years 1970 to 2038), which means that you cannot even store the birthdays of 60-year-old people or the expiry dates of 30-year mortgages. So, whereas a TIMESTAMP
might work in BigQuery, you might not be able to use the same schema in MySQL, and this might make moving queries and data between BigQuery and MySQL challenging.
We look at geography functions in much more detail in Chapter 8, which looks at advanced features. In this section, we provide only a brief introduction.
The GEOGRAPHY
type can be used to represent points, lines, and polygons on the surface of the Earth (i.e., there is no height associated with them). Because the Earth is a lumpy mass, points on its surface can be represented only on spherical and ellipsoidal approximations of the surface. In BigQuery, the geographic positions of the points and the vertices of the lines and polygons are represented in the WGS84 ellipsoid
. Practically speaking, this is the same ellipsoid as used by the Global Positioning System (GPS), so you will be able to take the longitude and latitude positions reported by most sensors and use them directly in BigQuery.
The simplest geography is a point specified by its longitude and latitude. So, for example,
ST_GeogPoint(-122.33, 47.61)
represents a point at 47.61N and 122.33W—Seattle, Washington.
The BigQuery public datasets include a table that contains polygons corresponding to each of the US states and territories. We can therefore write a query to find out which state the geographic point is in:
SELECT state_name FROM `bigquery-public-data`.utility_us.us_states_area WHERE ST_Contains( state_geom, ST_GeogPoint(-122.33, 47.61))
As anticipated, this returns the following:
Row | state_name |
---|---|
1 | Washington |
The query uses the ST_Contains
function to determine whether the state’s geometry (stored as the state_geom
column in the BigQuery dataset) contains the point we are interested in. The spatial functions that BigQuery supports follow the SQL/MM 3 specification and are similar to what the PostGIS library provides for Postgres.
To summarize what we’ve covered in this chapter, Table 3-3 presents the data types that BigQuery supports.
Data type | Sample functions and operators supported | Notes |
---|---|---|
INT64 |
Arithmetic operations (+, –, /, *, for add, subtract, divide, and multiply, respectively). | Approximately 10–19 to 1019. |
NUMERIC |
Arithmetic operations. | 38 digits of precision and 9 decimal digits of scale; this is suitable for financial calculations. |
FLOAT64 |
Arithmetic operations. Also: IEEE_DIVIDE . |
IEEE-754 behavior if one of the values is NaN or ± inf. |
BOOL |
Conditional statements.MIN , MAX .However, SUM , AVG , etc. are not supported (you’d need to cast the Booleans to INT64 first). |
Is either True and False. SQL is case insensitive, so TRUE, true, and so on also work. |
STRING |
Use special String functions such as CONCAT , LENGTH , etc. to operate on strings. |
Strings are Unicode characters and are variable length. |
BYTES |
Variable length characters. Many String operations are also defined on BYTES . |
|
TIMESTAMP |
CURRENT_TIMESTAMP() represents “now.”You can extract month, year, dayofweek, etc. from a timestamp. Arithmetic on timestamps is supported via special functions, not through arithmetic operators. |
Absolute point in time, to microsecond precision, represented in a subset of ISO 8601. This is the recommended way to store times in BigQuery. |
DATE |
CURRENT_DATE() represents the current date in the UTC time zone, whereas CURRENT_DATE("America/Los_Angeles") represents the current date in the Los Angeles time zone.Like TIMESTAMP , arithmetic on dates is supported via special functions. |
2018-3-14 (or 2018-03-14) is March 14, 2018, independent of time zone. Because this represents different 24-hour blocks in different time zones, use TIMESTAMP to represent an absolute point in time. You can then construct a DATE from a TIMESTAMP relative to a particular time zone. |
DATETIME |
As with DATE . |
2018-03-14 3:14:57 or 2018-03-14T03:14:57.000000 is, like DATE , independent of time zone. Most applications will want to use TIMESTAMP . |
TIME |
As with DATETIME , except that the DATE part is absent. |
Independent of a specific date or time zone. This ranges from 00:00:00 to 23:59:59.999999. |
GEOGRAPHY |
Topological functions on geographies are supported via special functions. | Points, lines, and polygons on the surface of the Earth (i.e., there is no height). The representations are in the WGS84 ellipsoid; this is the same ellipsoid as used by the Global Positioning System (GPS). The simplest geography is a point specified by its longitude and latitude. |
STRUCT |
You can deference the fields by name. | A collection of fields in order. The field name is optional; that is, you could have either: STRUCT<INT64, STRING> or STRUCT<id INT64, name STRING> . |
ARRAY |
You can deference the items by offset, aggregate the items in the array, or unnest them to get the items one by one. | Ordered list of non-null elements; e.g., ARRAY<INT64> . Arrays of arrays are not allowed, but you can get around this by creating an array of STRUCT in which the struct itself contains an array; i.e., ARRAY<STRUCT<ARRAY<INT64>>> (We cover arrays in Chapter 2). |
You can use all data types, except for arrays and structs, in ORDER BY
and GROUP BY
.
1 The standard division operator raises a division-by-zero error. SAFE_DIVIDE
returns NULL
for the entry when division by zero is attempted.
2 Start with Mastering Regular Expressions by Jeffrey Friedl (O’Reilly).
3 For the full list of specifiers, consult the documentation.
4 According to https://en.wikipedia.org/wiki/Armistice_Day, the agreement was signed at 5:45 a.m. on November 11, 1918. In Winter 1918, unlike now, France was in the UTC time zone; see https://www.timeanddate.com/time/zone/france/paris.