Chapter 3. Data Types, Functions, and Operators

In the bike rental queries in the previous chapters, when we divided the trip duration by 60, we were able to do so because trip duration was a numeric type. Trying to divide the gender by 60 would not have worked because gender is a string. The functions and operations you have at your disposal might be restricted based on the type of data to which you are applying them.

BigQuery supports several data types to store numeric, string, time, geographic, structured, and semi-structured data:

INT64
This is the only integer type. It can represent numbers ranging from approximately 10–19 to 1019. For real-valued numbers, use FLOAT64, and for Booleans, use BOOL.
NUMERIC
NUMERIC offers 38 digits of precision and 9 decimal digits of scale and is suitable for exact calculations, such as in finance.
STRING
This is a first-class type and represents variable-length sequences of Unicode characters. BYTES are variable-length sequences of characters (not Unicode).
TIMESTAMP
This represents an absolute point in time.
DATETIME
This represents a calendar date and time. DATE and TIME are also available separately.
GEOGRAPHY
GEOGRAPHY represents points, lines, and polygons on the surface of the Earth.
STRUCT and ARRAY
See the description for each of these in Chapter 2.

Numeric Types and Functions

As just stated, there is only one integer type (INT64) and only one floating-point type (FLOAT64). Both of these types support the typical arithmetic operations (+, –, /, *—for add, subtract, divide, and multiply, respectively). Thus, we can find the fraction of bike rentals that are one-way by simply dividing one column by the other:

WITH example AS (
  SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways
  UNION ALL SELECT 'Sun', 2376, 936
)
SELECT *, (oneways/numrides) AS frac_oneway from example

This yields the following:

Row day numrides oneways frac_oneway
1 Sat 1451 1018 0.7015851137146796
2 Sun 2376 936 0.3939393939393939

Besides the arithmetic operators, bitwise operations (<< and >> for shifting, & and | for bitwise AND and OR, etc.) are also supported on integer types.

To operate on data types, we can use functions. Functions perform operations on the values that are input to them. As with other programming languages, functions in SQL encapsulate reusable logic and abstract away the complexity of their implementation. Table 3-1 presents the various types of functions.

Table 3-1. Types of functions
Type of function Description Example
Scalar A function that operates on one or more input parameters and returns a single value.
A scalar function can be used wherever its return data type is allowed.
ROUND(3.14) returns 3, which is a FLOAT64, and so the ROUND function can be used wherever a FLOAT64 is allowed.
SUBSTR(“hello”, 1, 2) returns “he” and is an example of a scalar function that takes three input parameters.
Aggregate A function that performs a calculation on a collection of values and returns a single value.
Aggregate functions are often used with a GROUP BY to perform a computation over a group of rows.
MAX(tripduration) computes the maximum value within the tripduration column.
Other aggregate functions include SUM(), COUNT(), AVG(), etc.
Analytic Analytic functions operate on a collection of values but return an output for each value in the collection.
A window frame is used to specify the set of rows to which the analytic function applies.
row_number(), rank(), etc. are analytic functions. We look at these in Chapter 8.
Table-valued A function that returns a result set and can therefore be used in FROM clauses. You can call UNNEST on an array and then select from it.
User-defined A function that is not built in, but whose implementation is specified by the user.
User-defined functions can be written in SQL (or JavaScript) and can themselves return any of the aforementioned types.
CREATE TEMP FUNCTION lastElement(arr ANY TYPE) AS (
  arr[ORDINAL(ARRAY_LENGTH(arr))] );

Mathematical Functions

Had we wanted to round off the end-result of the query that computed the fraction of bike rentals that were one-way, we would have used one of the many built-in mathematical functions that work on integer and floating-point types:

WITH example AS (
  SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways
  UNION ALL SELECT 'Sun', 2376, 936
)
SELECT *, ROUND(oneways/numrides, 2) AS frac_oneway from example

This returns the following:

Row day numrides oneways frac_oneway
1 Sat 1451 1018 0.7
2 Sun 2376 936 0.39

Standard-Compliant Floating-Point Division

The division operator fails if the denominator is zero or if the result overflows. Rather than protect the division by checking for zero values beforehand, it is better to use a special function for division whenever the denominator could be zero, as is the case in the previous example. A better form of that query would be this:

WITH example AS (
  SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways
  UNION ALL SELECT 'Sun', 2376, 936
  UNION ALL SELECT 'Wed', 0, 0
)
SELECT 
   *, ROUND(IEEE_Divide(oneways, numrides), 2)
AS frac_oneway from example

The IEEE_Divide function follows the standard set by the Institute of Electrical and Electronics Engineers (IEEE) and returns a special floating-point number called Not-a-Number (NaN) when a division by zero is attempted.

Also try the previous query using the standard division operator and using SAFE_DIVIDE (discussed shortly).1 Recall that, for your copy-pasting convenience, all the queries in this book are available in the book’s GitHub repository.

SAFE Functions

You can make any scalar function return NULL instead of raising an error by prefixing it with SAFE. For example, the following query will raise an error because the logarithm of a negative number is undefined:

SELECT LOG(10, -3), LOG(10, 3)

However, by prefixing the LOG with SAFE, like so:

SELECT SAFE.LOG(10, -3), SAFE.LOG(10, 3)

you will get NULL for the result of LOG(10, -3):

Row f0_ f1_
1 null 2.095903274289385

The SAFE prefix works for mathematical functions, string functions (for example, the SUBSTR function would normally raise an error if the starting index is negative, but it returns NULL if invoked as SAFE.SUBSTR), and time functions.  It is, however, restricted to scalar functions and will not work for aggregate functions, analytic functions, or user-defined functions.

Comparisons

Comparisons are carried out using operators. The operators <, <=, >, >=, and != (or <>) are used to obtain the results of comparison. NULL, followed by NaN, is assumed to be smaller than valid numbers (including -inf) for the purposes of ordering. However, comparisons with NaN always return false and comparisons with NULL always return NULL. This can lead to seemingly paradoxical results:

WITH example AS (
  SELECT 'Sat' AS day, 1451 AS numrides, 1018 AS oneways
  UNION ALL SELECT 'Sun', 2376, 936
  UNION ALL SELECT 'Mon', NULL, NULL
  UNION ALL SELECT 'Tue', IEEE_Divide(-3,0), 0 -- this is -inf,0
)
SELECT * from example
ORDER BY numrides

This query returns the following:

Row day numrides oneways
1 Mon null null
2 Tue -Infinity 0
3 Sat 1451.0 1018
4 Sun 2376.0 936

However, filtering for fewer than 2000 rides with

SELECT * from example
WHERE numrides < 2000

yields only two results, not three:

Row day numrides oneways
1 Sat 1451.0 1018
2 Tue -Infinity 345

This is because the WHERE clause returns only those rows for which the result is true, and when NULL is compared to 2000, the result is NULL and not true.

Note that the operators & and | exist in BigQuery but are used only for bitwise operations. The ! symbol, as in !=, means NOT, but it does not work as a standalone—you cannot say !gender to compute the logical negative of gender, as you can in other languages. An alternate way to specify not-equals is to write <>, but be consistent on whether you use != or <>.

Precise Decimal Calculations with NUMERIC

INT64 and FLOAT64 are designed to be flexible and fast, but they are limited by the fact that they are stored in a base-2 (0s and 1s) form in a 64-bit area of computer memory when being used for calculations. This is a trade-off well worth making in most applications, but financial and accounting applications often require exact calculations for numbers represented in decimal (base-10).

The NUMERIC data type in BigQuery provides 38 digits to represent numbers, with 9 of those digits appearing after the decimal point. It uses 16 bytes for storage and can represent decimal fractions exactly, thus making it suitable for financial calculations.

For example, imagine that you needed to compute the sum of three payments. You’d want the results to be exact. When using FLOAT64 values, however, the tiny differences between how the number is represented in memory and how the number is represented in decimals can add up:

WITH example AS (
  SELECT 1.23 AS payment
  UNION ALL SELECT 7.89
  UNION ALL SELECT 12.43
)
SELECT 
  SUM(payment) AS total_paid,
  AVG(payment) AS average_paid
FROM example

Look at what we get:

Row total_paid average_paid
1 21.549999999999997 7.183333333333334

In financial and accounting applications, these imprecisions can add up and make balancing the books tricky.

Watch what happens when we change the data type of payment to be NUMERIC:

WITH example AS (
  SELECT NUMERIC '1.23' AS payment
  UNION ALL SELECT NUMERIC '7.89'
  UNION ALL SELECT NUMERIC '12.43'
)
SELECT 
  SUM(payment) AS total_paid,
  AVG(payment) AS average_paid
FROM example

The problem goes away. The sum of the payments is now precise (the average cannot be represented precisely even in NUMERIC because it is a repeating decimal):

Row total_paid average_paid
1 21.55 7.183333333

Note that NUMERIC types need to be directly ingested into BigQuery as strings (NUMERIC '1.23'); otherwise, the floating-point representation will obviate any of the precision gains to be had.

Working with BOOL

Boolean variables are those that can be either True or False. Because SQL is case insensitive, TRUE, true, and so on also work.

Logical Operations

Recall from the section on filtering within the WHERE clause that the WHERE clause can include Boolean expressions that include AND, OR, and NOT, as well as parentheses to control the order of execution. We used this query to illustrate these options:

SELECT
  gender, tripduration
FROM
  `bigquery-public-data`.new_york_citibike.citibike_trips
WHERE (tripduration < 600 AND gender = 'female') OR gender = 'male'

You could use comparison operators with Boolean variables, as in the following:

WITH example AS (
  SELECT NULL AS is_vowel, NULL as letter, -1 as position
  UNION ALL SELECT true, 'a', 1
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT * from example WHERE is_vowel != false

This gives us the following:

Row is_vowel letter position
1 true a 1

However, it is often simpler to use the IS operator when comparing against built-in constants, as shown in this example:

WITH example AS (
  SELECT NULL AS is_vowel, NULL as letter, -1 as position
  UNION ALL SELECT true, 'a', 1
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT * from example WHERE is_vowel IS NOT false

This yields the following:

Row is_vowel letter position
1 null null -1
2 true a 1

Note that the two queries yield different results. The comparators (=, !=, <, etc.) return NULL for comparisons against NULL, whereas the IS operator doesn’t.

Tip

NULLs typically represent missing values or values that were not collected. They have no value and are not zero, empty strings, or blanks. If your dataset has NULLs, you must tread carefully since comparisons with NULL always return NULL, and so the WHERE clause will filter out NULL values. Use the IS operator to check where a value is NULL.

It is simpler and more readable to use Boolean variables directly:

WITH example AS (
  SELECT NULL AS is_vowel, NULL as letter, -1 as
position
  UNION ALL SELECT true, 'a', 1
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)

SELECT * from example WHERE is_vowel

The result here is like is_vowel IS TRUE:

Row is_vowel letter position
1 true a 1

Of course, such readability depends on naming the Boolean variables well!

Conditional Expressions

It is not just in the WHERE clause that Booleans are useful. It is possible to simplify many queries by using conditional expressions in the SELECT. For example, suppose that you need to compute the sales price of each item in a catalog based on the desired markup and tax rate corresponding to the item. If your catalog is missing values for some of the necessary information, you might want to impute a default markup or default tax rate. You can achieve this with the IF function:

WITH catalog AS (
   SELECT 30.0 AS costPrice, 0.15 AS markup, 0.1 AS taxRate
   UNION ALL SELECT NULL, 0.21, 0.15
   UNION ALL SELECT 30.0, NULL, 0.09
   UNION ALL SELECT 30.0, 0.30, NULL
   UNION ALL SELECT 30.0, NULL, NULL
)
SELECT
  *, ROUND(
    costPrice * 
    IF(markup IS NULL, 1.05, 1+markup) * 
    IF(taxRate IS NULL, 1.10, 1+taxRate) 
  , 2) AS salesPrice
FROM catalog

This yields a valid salesPrice for all items except those for which we don’t know the cost:

Row costPrice markup taxRate salesPrice
1 30.0 0.15 0.1 37.95
2 null 0.21 0.15 null
3 30.0 null 0.09 34.34
4 30.0 0.3 null 42.9
5 30.0 null null 34.65

The way the IF function works is that the first parameter is the condition to be evaluated. If the condition is true, the second parameter is used, or else the third parameter is used. Because this function occurs in the SELECT, it is carried out row by row.

Cleaner NULL-Handling with COALESCE

What if you want to do the imputation if a single value is missing, but not if more than one value is missing? In other words, if you have no tax rate, you are willing to impute a 10% tax rate, but not if you also don’t know the markup on the item.

A convenient way to keep evaluating expressions until we get to a non-NULL value is to use COALESCE:

WITH catalog AS (
    SELECT 30.0 AS costPrice, 0.15 AS markup, 0.1 AS taxRate
    UNION ALL SELECT NULL, 0.21, 0.15
    UNION ALL SELECT 30.0, NULL, 0.09
    UNION ALL SELECT 30.0, 0.30, NULL
    UNION ALL SELECT 30.0, NULL, NULL
)
SELECT
  *, ROUND(COALESCE(
  costPrice * (1+markup) * (1+taxrate),
  costPrice * 1.05 * (1+taxrate),
  costPrice * (1+markup) * 1.10,
  NULL
  ),2) AS salesPrice
FROM catalog

This yields the following (only the last row is different from the previous computation):

Row costPrice markup taxRate salesPrice
1 30.0 0.15 0.1 37.95
2 null 0.21 0.15 null
3 30.0 null 0.09 34.34
4 30.0 0.3 null 42.9
5 30.0 null null null

The COALESCE short-circuits the calculation whenever possible—that is, later expressions are not evaluated after a non-NULL result is obtained. Therefore, the final NULL in the COALESCE is not required, but it makes the intent clearer.

BigQuery supports the IFNULL function as a simplification of COALESCE when you have only two inputs. IFNULL(a, b) is the same as COALESCE(a, b) and yields b if a is NULL. In other words, IFNULL(a, b) is the same as IF(a IS NULL, b, a).

The very first query in this section on conditional expressions could have been simplified as follows:

SELECT
  *, ROUND(
     costPrice * 
     (1 + IFNULL(markup, 0.05)) * 
     (1 + IFNULL(taxrate,0.10)) 
  , 2) AS salesPrice
FROM catalog

Casting and Coercion

Consider this example dataset in which the number of hours worked by an employee is stored as a string in order to accommodate reasons for a leave of absence (this is a bad schema design, but bear with us):

WITH example AS (
  SELECT 'John' as employee, 'Paternity Leave' AS hours_worked
  UNION ALL SELECT 'Janaki', '35'
  UNION ALL SELECT 'Jian', 'Vacation'
  UNION ALL SELECT 'Jose', '40'
)

Now suppose that you want to find the total number of hours worked. This won’t work because the hours_worked is a string, not a numeric type:

WITH example AS (
  SELECT 'John' as employee, 'Paternity Leave' AS hours_worked
  UNION ALL SELECT 'Janaki', '35'
  UNION ALL SELECT 'Jian', 'Vacation'
  UNION ALL SELECT 'Jose', '40'
)
SELECT SUM(hours_worked) from example

We need to explicitly convert the hours_worked to an INT64 before doing any aggregation. Explicit conversion is called casting, and it requires the explicit use of the CAST() function. If casting fails, BigQuery raises an error. To have it return NULL instead, use SAFE_CAST. For example, the following raises an error:

SELECT CAST("true" AS bool), CAST("invalid" AS bool)

Now try using SAFE_CAST:

SELECT CAST("true" AS bool), SAFE_CAST("invalid" AS bool)

You should see the following:

Row f0_ f1_
1 true null

Implicit conversion is called coercion, and this happens automatically when a data type is used in a situation for which another data type is required. For example, when we use an INT64 in a situation when a FLOAT64 is needed, the integer will be coerced into a floating-point number. The only coercions done by BigQuery are to convert INT64 to FLOAT64 and NUMERIC, and NUMERIC to FLOAT64. Every other conversion is explicit and requires a CAST.

With the problem of the total number of hours worked, not all of the hours_worked strings can be converted to integers, so you should use a SAFE_CAST:

WITH example AS (
  SELECT 'John' as employee, 'Paternity Leave' AS hours_worked
  UNION ALL SELECT 'Janaki', '35'
  UNION ALL SELECT 'Jian', 'Vacation'
  UNION ALL SELECT 'Jose', '40'
)
SELECT SUM(SAFE_CAST(hours_worked AS INT64)) from example

This yields the following:

Row f0_
1 75

Had it simply been a schema problem and all the rows contained numbers but were stored as strings, you could have used a simple CAST:

WITH example AS (
  SELECT 'John' as employee, '0' AS hours_worked
  UNION ALL SELECT 'Janaki', '35'
  UNION ALL SELECT 'Jian', '0'
  UNION ALL SELECT 'Jose', '40'
)
SELECT SUM(CAST(hours_worked AS INT64)) from example

Using COUNTIF to Avoid Casting Booleans

Consider this example dataset:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT * from example

Here’s the result of the query:

Row is_vowel letter position
1 true a 1
2 false b 2
3 false c 3

Now suppose that you want to find the total number of vowels. You might be tempted to do something simple, such as the following:

SELECT SUM(is_vowel) as num_vowels from example

This won’t work, however (try it!), because SUM, AVG, and others are not defined on Booleans. You could cast the Booleans to an INT64 before doing the aggregation, like so:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT SUM(CAST (is_vowel AS INT64)) as num_vowels from example

This would yield the following:

Row num_vowels
1 1

However, you should try to avoid casting as much as possible. In this case, a cleaner approach is to use the IF statement on the Booleans:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT SUM(IF(is_vowel, 1, 0)) as num_vowels from example

An even cleaner approach is to use COUNTIF:

WITH example AS (
  SELECT true AS is_vowel, 'a' as letter, 1 as position
  UNION ALL SELECT false, 'b', 2
  UNION ALL SELECT false, 'c', 3
)
SELECT COUNTIF(is_vowel) as num_vowels from example

String Functions

String manipulation is a common requirement for data wrangling, so BigQuery provides a library of built-in string functions—for example:

WITH example AS (
  SELECT * from unnest([
     'Seattle', 'New York', 'Singapore'
  ]) AS city
)
SELECT 
  city
  , LENGTH(city) AS len
  , LOWER(city) AS lower
  , STRPOS(city, 'or') AS orpos
FROM example

This example computes the length of the string, makes the string lowercase, and finds the location of a substring in the “city” column, which gives us the following result:

Row city len lower orpos
1 Seattle 7 seattle 0
2 New York 8 new york 6
3 Singapore 9 singapore 7

The substring “or” occurs in “New York” and in “Singapore,” but not in “Seattle.”

Two particularly useful functions for string manipulation are SUBSTR and CONCAT. SUBSTR extracts a substring,  and CONCAT concatenates the input values. The following query finds the position of the @ symbol in an email address, extracts the username, and concatenates the city in which the individual lives:

WITH example AS (
  SELECT '[email protected]' AS email, 'Annapolis, MD' as city
  UNION ALL SELECT '[email protected]', 'Boulder, CO'
  UNION ALL SELECT '[email protected]', 'Chicago, IL'
)
 
SELECT 
  CONCAT(
      SUBSTR(email, 1, STRPOS(email, '@') - 1), -- username
      ' from ', city) AS callers
FROM example

Here’s what the result looks like:

Row callers
1 armin from Annapolis, MD
2 boyan from Boulder, CO
3 carrie from Chicago, IL

Internationalization

Strings in BigQuery are Unicode, so avoid assumptions that rely on English.  For example, the “upper” case is a no-op in Japanese, and the default UTF-8 encoding that is carried out by the cast as bytes is insufficient for languages such as Tamil, as demonstrated here:

WITH example AS (
  SELECT * from unnest([
     'Seattle', 'New York', 'சிங்கப்பூர்', '東京'
  ]) AS city
)
SELECT 
  city
  , UPPER(city) AS allcaps
  , CAST(city AS BYTES) as bytes
FROM example

As you can see, this simply doesn’t work as presumably intended:

Row city allcaps bytes
1 Seattle SEATTLE U2VhdHRsZQ==
2 New York NEW YORK TmV3IFlvcms=
3 சிங்கப்பூர் சிங்கப்பூர் 4K6a4K6/4K6Z4K+N4K6V4K6q4K+N4K6q4K+C4K6w4K+N
4 東京 東京 5p2x5Lqs

BigQuery supports three different ways to represent strings—as an array of Unicode characters, as an array of bytes, and as an array of Unicode code points (INT64):

WITH example AS (
  SELECT * from unnest([
     'Seattle', 'New York', 'சிங்கப்பூர்', '東京'
  ]) AS city
)
SELECT 
  city
  , CHAR_LENGTH(city) as char_len
  , TO_CODE_POINTS(city)[ORDINAL(1)] as first_code_point
  , ARRAY_LENGTH(TO_CODE_POINTS(city)) as num_code_points
  , CAST (city AS BYTES) as bytes
  , BYTE_LENGTH(city) as byte_len
FROM example

Note the difference between the results for CHAR_LENGTH and BYTE_LENGTH on the same strings, and how the number of code points is the same as the number of characters:

Row city char_len first_code_point num_code_points bytes byte_len
1 Seattle 7 83 7 U2VhdHRsZQ== 7
2 New York 8 78 8 TmV3IFlvcms= 8
3 சிங்கப்பூர் 11 2970 11 4K6a4K6/4K6Z4K+N4K6V4K6q4K+N4K6q4K+C4K6w4K+N 33
4 東京 2 26481 2 5p2x5Lqs 6

Because of these differences, you need to recognize which columns might contain text in different languages, and be aware of language differences when using string manipulation functions.

Printing and Parsing

You can simply cast a string as an INT64 or a FLOAT64 in order to parse it, but customizing the string representation will require the use of FORMAT:

SELECT
  CAST(42 AS STRING)
  , CAST('42' AS INT64)
  , FORMAT('%03d', 42)
  , FORMAT('%5.3f', 32.457842)
  , FORMAT('%5.3f', 32.4)
  , FORMAT('**%s**', 'H')
  , FORMAT('%s-%03d', 'Agent', 7)

Here is the result of that query:

Row f0_ f1_ f2_ f3_ f4_ f5_ f6_
1 42 42 042 32.458 32.400 **H** Agent-007

FORMAT works similarly to C’s printf, and it accepts the same format specifiers. A few of the more useful specifiers are demonstrated in the preceding example. Although FORMAT also accepts dates and timestamps, it is better to use FORMAT_DATE and FORMAT_TIMESTAMP so that the display formats can be locale-aware.

String Manipulation Functions

Manipulating strings is such a common need in Extract, Transform, and Load (ETL) pipelines that these BigQuery convenience functions are worth having on speed dial:

SELECT
  ENDS_WITH('Hello', 'o') -- true
  , ENDS_WITH('Hello', 'h') -- false
  , STARTS_WITH('Hello', 'h') -- false
  , STRPOS('Hello', 'e') -- 2
  , STRPOS('Hello', 'f') -- 0 for not-found
  , SUBSTR('Hello', 2, 4) -- 1-based
  , CONCAT('Hello', 'World')

The result of this query is as follows:

Row f0_ f1_ f2_ f3_ f4_ f5_ f6_
1 true false false 2 0 ello HelloWorld

Note how SUBSTR() behaves. The first parameter is the starting position (it is 1-based), and the second parameter is the desired number of characters in the substring.

Transformation Functions

Another set of functions that is worth becoming familiar with are those that allow you to manipulate the string:

SELECT
  LPAD('Hello', 10, '*') -- left pad with *
  , RPAD('Hello', 10, '*') -- right pad
  , LPAD('Hello', 10) -- left pad with spaces
  , LTRIM('   Hello   ') -- trim whitespace on left
  , RTRIM('   Hello   ') -- trim whitespace on right
  , TRIM ('   Hello   ') -- trim whitespace both ends
  , TRIM ('***Hello***', '*') -- trim * both ends
  , REVERSE('Hello') -- reverse the string

Let’s look at the result of this query:

Row f0_ f1_ f2_ f3_ f4_ f5_ f6_ f7_
1 *****Hello Hello*****     Hello Hello    Hello Hello Hello olleH

Regular Expressions

Regular expressions provide much more powerful semantics than the convenience functions. For instance, STRPOS and others can find only specific characters, whereas you can use REGEXP_CONTAINS for more powerful searches.

For example, you could do the following to determine whether a column contains a US zip code (the short form of which is a five-digit number and the long form of which has an additional four digits separated by either a hyphen or a space):

SELECT
  column
  , REGEXP_CONTAINS(column, r'd{5}(?:[-s]d{4})?') has_zipcode
  , REGEXP_CONTAINS(column, r'^d{5}(?:[-s]d{4})?$') is_zipcode
  , REGEXP_EXTRACT(column, r'd{5}(?:[-s]d{4})?') the_zipcode 
  , REGEXP_EXTRACT_ALL(column, r'd{5}(?:[-s]d{4})?') all_zipcodes 
  , REGEXP_REPLACE(column, r'd{5}(?:[-s]d{4})?', '*****') masked
FROM (
  SELECT * from unnest([
      '12345', '1234', '12345-9876', 
      'abc 12345 def', 'abcde-fghi',
      '12345 ab 34567', '12345 9876'
  ]) AS column
)

Here’s what this query yields:

Row column has_zipcode is_zipcode the_zipcode all_zipcodes masked
1 12345 true true 12345 12345 *****
2 1234 false false null   1234
3 12345-9876 true true 12345-9876 12345-9876 *****
4 abc 12345 def true false 12345 12345 abc ***** def
5 abcde-fghi false false null   abcde-fghi
6 12345 ab 34567 true false 12345 12345 ***** ab *****
          34567  
7 12345 9876 true true 12345 9876 12345 9876 *****

There are a few things to note:

  • The regular expression d{5} matches any string consisting of five decimal numbers.

  • The second part of the  expression, in parentheses, looks for an optional (note the ? at the end of the parentheses) group (?:) of four decimal numbers (d{4}), which is separated from the first five numbers by either a hyphen or by a space (s).

  • The presence of d, s, and others in the string could cause problems, so we prefix the string with an r (for raw), which makes it a string literal.

  • The second expression illustrates how to find an exact match: simply insist that the string in question must start (^) and end ($) with the specified string.

  • To extract the part of the string matched by the regular expression, use REGEXP_EXTRACT. This returns NULL if the expression is not matched, and only the first match if there are multiple matches.

  • REGEXP_EXTRACT_ALL returns all the matches. If there is no match, it returns an empty array.

  • REGEXP_REPLACE replaces every match with the replacement string.

The regular expression support in BigQuery follows that of Google’s open source RE2 library. To see the syntax accepted by this library, visit https://github.com/google/re2/wiki/Syntax. Regular expressions can be cryptic, but they are a rich topic that is well worth mastering.2

Summary of String Functions

Because strings are so common in data analysis, it is worth learning the broad contours of the available functions. You can always refer to the BigQuery documentation for the exact syntax. Table 3-2 separates them into their respective categories.

Table 3-2. Categories of string functions
Category Functions Notes
Representations CHAR_LENGTH, BYTE_LENGTH, TO_CODE_POINTS,
CODE_POINTS_TO_STRING,
SAFE_CONVERT_BYTES_TO_STRING,
TO_HEX, TO_BASE32, TO_BASE64, FROM_HEX, FROM_BASE32, FROM_BASE64, NORMALIZE
Normalize allows, for example, different Unicode space characters to be made equivalent.
Printing and parsing FORMAT, REPEAT, SPLIT The syntax of FORMAT is similar to C’s printf: format("%03d", 12) yields 012. For locale-aware conversions, use FORMAT_DATE, etc.
Convenience ENDS_WITH, LENGTH, STARTS_WITH, STRPOS, SUBSTR, CONCAT The LENGTH function is equivalent to CHAR_LENGTH for Strings and to BYTE_LENGTH for Bytes.
Transformations LPAD, LOWER, LTRIM, REPLACE, REVERSE, RPAD, RTRIM, TRIM, UPPER The default trim characters are Unicode whitespace, but it is possible to specify a different set of trim characters.
Regular expressions REGEXP_CONTAINS,
REGEXP_EXTRACT, REGEXP_EXTRACT_ALL, REGEXP_REPLACE
See https://github.com/google/re2/wiki/Syntax for the syntax accepted by BigQuery.

Working with TIMESTAMP

A timestamp represents an absolute point in time regardless of location. Thus a timestamp of 2017-09-27 12:30:00.45 (Sep 27, 2017, at 12:30 UTC) represents the same time as 2017-09-27 13:30:00.45+1:00 (1:30 p.m. at a time zone that is an hour behind):

SELECT t1, t2, TIMESTAMP_DIFF(t1, t2, MICROSECOND)
FROM (SELECT
  TIMESTAMP "2017-09-27 12:30:00.45" AS t1,
  TIMESTAMP "2017-09-27 13:30:00.45+1" AS t2
)

This returns the following:

Row t1 t2 f0_
1 2017-09-27 12:30:00.450 UTC 2017-09-27 12:30:00.450 UTC 0

Parsing and Formatting Timestamps

BigQuery is somewhat forgiving when it comes to parsing the timestamp. The date and time parts of this string representation can be separated either by a T or by a space in accordance with ISO 8601. Similarly, the month, day, hour, and so on might or might not have leading zeros. However, best practice is to use the canonical representation shown in the previous paragraph. As that string representation would indicate, this timestamp can represent only four-digit years; years before the common era cannot be represented using TIMESTAMP.

You can use PARSE_TIMESTAMP to parse a string that is not in the canonical format:

SELECT
  fmt, input, zone
  , PARSE_TIMESTAMP(fmt, input, zone) AS ts
FROM (
  SELECT '%Y%m%d-%H%M%S' AS fmt, '20181118-220800' AS input, '+0' as zone
  UNION ALL SELECT '%c', 'Sat Nov 24 21:26:00 2018', 'America/Los_Angeles'
  UNION ALL SELECT '%x %X', '11/18/18 22:08:00', 'UTC'
)

Here is what this would yield:

Row fmt input zone ts
1 %Y%m%d%-H%M%S 20181118-220800 +0 2018-11-18 22:08:00 UTC
2 %c Sat Nov 24 21:26:00 2018 America/Los_Angeles 2018-11-25 05:26:00 UTC
3 %x %X 11/18/18 22:08:00 UTC 2018-11-18 22:08:00 UTC

The first example uses format specifiers for the year, month, day, and so on to create a timestamp from the provided string. The second and third examples use preexisting specifiers for commonly encountered date-time formats.3

Conversely, you can use FORMAT_TIMESTAMP to print out a timestamp in any desired format:

SELECT
  ts, fmt
  , FORMAT_TIMESTAMP(fmt, ts, '+6') AS ts_output
FROM (
  SELECT CURRENT_TIMESTAMP() AS ts, '%Y%m%d-%H%M%S' AS fmt
  UNION ALL SELECT CURRENT_TIMESTAMP() AS ts, '%c' AS fmt
  UNION ALL SELECT CURRENT_TIMESTAMP() AS ts, '%x %X' AS fmt
)

This results in the following:

Row ts fmt ts_output
1 2018-11-25 05:42:13.939840 UTC %Y%m%d-%H%M%S 20181125-114213
2 2018-11-25 05:42:13.939840 UTC %c Sun Nov 25 11:42:13 2018
3 2018-11-25 05:42:13.939840 UTC %x %X 11/25/18 11:42:13

The preceding example uses the function CURRENT_TIMESTAMP() to retrieve the system time at the time the query is executed. In both PARSE_TIMESTAMP and FORMAT_TIMESTAMP, the time zone is optional; if omitted, the time zone is assumed to be UTC.

Extracting Calendar Parts

Given a timestamp, it is possible to extract information about the Gregorian calendar corresponding to the timestamp. For example, we can extract information about Armistice Day4 using this:

SELECT
  ts
  , FORMAT_TIMESTAMP('%c', ts) AS repr
  , EXTRACT(DAYOFWEEK FROM ts) AS dayofweek
  , EXTRACT(YEAR FROM ts) AS year
  , EXTRACT(WEEK FROM ts) AS weekno
FROM (
  SELECT PARSE_TIMESTAMP('%Y%m%d-%H%M%S', '19181111-054500') AS ts
)

Here is the result:

Row ts repr dayofweek year weekno
1 1918-11-11 05:45:00 UTC Mon Nov 11 05:45:00 1918 2 1918 45

The week is assumed to begin on Sunday, and days prior to the first Sunday of the year are assigned to week 0. This is not internationally safe. Hence, if you’re in a country (such as Israel) where the week begins on Saturday, it is possible to specify a different day for the start of the week:

EXTRACT(WEEK('SATURDAY') FROM ts)

The number of seconds from the Unix epoch (January 1, 1970) is not available through EXTRACT. Instead, special functions exist to convert to and from the Unix epoch:

SELECT
  UNIX_MILLIS(TIMESTAMP "2018-11-25 22:30:00 UTC")
  , UNIX_MILLIS(TIMESTAMP "1918-11-11 22:30:00 UTC") --invalid
  , TIMESTAMP_MILLIS(1543185000000)

This yields the following:

Row f0_ f1_ f2_
1 1543185000000 -1613784600000 2018-11-25 22:30:00 UTC

Note that the second one overflows and yields a negative number, but no error is raised.

Arithmetic with Timestamps

It is possible to add or subtract time durations from timestamps. It is also possible to find the time difference between two timestamps. In all of these functions, you need to specify the units in which the durations are expressed:

SELECT 
  EXTRACT(TIME FROM TIMESTAMP_ADD(t1, INTERVAL 1 HOUR)) AS plus_1h
  , EXTRACT(TIME FROM TIMESTAMP_SUB(t1, INTERVAL 10 MINUTE)) AS minus_10min
  , TIMESTAMP_DIFF(CURRENT_TIMESTAMP(),
              TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 MINUTE),
              SECOND) AS plus_1min
  , TIMESTAMP_DIFF(CURRENT_TIMESTAMP(),
              TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 MINUTE),
              SECOND) AS minus_1min
FROM (SELECT
  TIMESTAMP "2017-09-27 12:30:00.45" AS t1
)

This returns the timestamps an hour from now, 10 minutes ago, and the time difference in seconds corresponding to one minute from now and one minute earlier:

Row plus_1h minus_10min plus_1min minus_1min
1 13:30:00.450000 12:20:00.450000 60 -60

Date, Time, and DateTime

BigQuery has three other functions for representing time: DATE, TIME, and DATETIME. DATE is useful for when you are tracking only the day in which something happens, and any more precision is unnecessary. TIME is useful to represent the time of day that things happen, and to perform mathematical operations with those times. With TIME, you can answer questions like, “What time will it be eight hours from the starting time?” DATETIME is a TIMESTAMP rendered in a specific time zone, so it is useful when you have an unambiguous time zone in which an event occurred and you don’t need to do time zone conversions.

Counterparts to most of the TIMESTAMP functions are available for DATETIME. Thus, you can call DATETIME_ADD, DATETIME_SUB, and DATETIME_DIFF, as well as PARSE_DATETIME and FORMAT_DATETIME. You can also EXTRACT calendar parts from a DATETIME. The two types are quite interoperable—it is possible to extract a DATETIME from a TIMESTAMP and cast a DATETIME to a TIMESTAMP:

SELECT 
  EXTRACT(DATETIME FROM CURRENT_TIMESTAMP()) as dt
  , CAST(CURRENT_DATETIME() AS TIMESTAMP) as ts

The following shows the result:

Row dt ts
1 2018-11-25T07:03:15.055141 2018-11-25 07:03:15.055141 UTC

Note that the canonical representation of a DATETIME has the letter T separating the date part and the time part, whereas the representation of a TIMESTAMP uses a space. The TIMESTAMP also explicitly includes the time zone, whereas the time zone is implicit in the DATETIME. But for the most part, you can use DATETIME and TIMESTAMP interchangeably in BigQuery.

DATE is just the date part of a DATETIME (or a TIMESTAMP, interpreted in some time zone), and TIME is the time part. Because many real-world scenarios might happen on a certain date (i.e., at multiple times throughout that day), many database tables contain just a DATE. So there is some benefit to being able to directly parse and format dates. On the other hand, there is very little need for the TIME type other than as the “missing” part of a DATETIME.

For the most part, therefore, our advice is to just use TIMESTAMP and DATE. There is, however, one practical wrinkle to using TIMESTAMP. Timestamps in BigQuery are stored using eight bytes with microsecond resolution. This means that you can store years 0 through 9999, and any microsecond in between. In some other databases (e.g., MySQL), TIMESTAMP is stored using four bytes and DATETIME using eight bytes. In those systems, the range of a TIMESTAMP is within the limits of the Unix epoch time (years 1970 to 2038), which means that you cannot even store the birthdays of 60-year-old people or the expiry dates of 30-year mortgages. So, whereas a TIMESTAMP might work in BigQuery, you might not be able to use the same schema in MySQL, and this might make moving queries and data between BigQuery and MySQL challenging.

Working with GIS Functions

We look at geography functions in much more detail in Chapter 8, which looks at advanced features. In this section, we provide only a brief introduction.

The GEOGRAPHY type can be used to represent points, lines, and polygons on the surface of the Earth (i.e., there is no height associated with them). Because the Earth is a lumpy mass, points on its surface can be represented only on spherical and ellipsoidal approximations of the surface. In BigQuery, the geographic positions of the points and the vertices of the lines and polygons are represented in the WGS84 ellipsoid. Practically speaking, this is the same ellipsoid as used by the Global Positioning System (GPS), so you will be able to take the longitude and latitude positions reported by most sensors and use them directly in BigQuery.

The simplest geography is a point specified by its longitude and latitude. So, for example,

ST_GeogPoint(-122.33, 47.61)

represents a point at 47.61N and 122.33W—Seattle, Washington.

The BigQuery public datasets include a table that contains polygons corresponding to each of the US states and territories. We can therefore write a query to find out which state the geographic point is in:

SELECT 
  state_name
FROM `bigquery-public-data`.utility_us.us_states_area
WHERE
   ST_Contains(
     state_geom,
     ST_GeogPoint(-122.33, 47.61))

As anticipated, this returns the following:

Row state_name
1 Washington

The query uses the ST_Contains function to determine whether the state’s geometry (stored as the state_geom column in the BigQuery dataset) contains the point we are interested in. The spatial functions that BigQuery supports follow the SQL/MM 3 specification and are similar to what the PostGIS library provides for Postgres.

Summary

To summarize what we’ve covered in this chapter, Table 3-3 presents the data types that BigQuery supports.

Table 3-3. Data types supported by BigQuery
Data type Sample functions and operators supported Notes
INT64 Arithmetic operations (+, –, /, *, for add, subtract, divide, and multiply, respectively). Approximately 10–19 to 1019.
NUMERIC Arithmetic operations. 38 digits of precision and 9 decimal digits of scale; this is suitable for financial calculations.
FLOAT64 Arithmetic operations.
Also: IEEE_DIVIDE.
IEEE-754 behavior if one of the values is NaN or ± inf.
BOOL Conditional statements.
MIN, MAX.
However, SUM, AVG, etc. are not supported (you’d need to cast the Booleans to INT64 first).
Is either True and False.
SQL is case insensitive, so TRUE, true, and so on also work.
STRING Use special String functions such as CONCAT, LENGTH, etc. to operate on strings. Strings are Unicode characters and are variable length.
BYTES   Variable length characters.
Many String operations are also defined on BYTES.
TIMESTAMP CURRENT_TIMESTAMP() represents “now.”
You can extract month, year, dayofweek, etc. from a timestamp.
Arithmetic on timestamps is supported via special functions, not through arithmetic operators.
Absolute point in time, to microsecond precision, represented in a subset of ISO 8601. This is the recommended way to store times in BigQuery.
DATE CURRENT_DATE() represents the current date in the UTC time zone, whereas CURRENT_DATE("America/Los_Angeles") represents the current date in the Los Angeles time zone.
Like TIMESTAMP, arithmetic on dates is supported via special functions.
2018-3-14 (or 2018-03-14) is March 14, 2018, independent of time zone. Because this represents different 24-hour blocks in different time zones, use TIMESTAMP to represent an absolute point in time. You can then construct a DATE from a TIMESTAMP relative to a particular time zone.
DATETIME As with DATE. 2018-03-14 3:14:57 or 2018-03-14T03:14:57.000000 is, like DATE, independent of time zone. Most applications will want to use TIMESTAMP.
TIME As with DATETIME, except that the DATE part is absent. Independent of a specific date or time zone. This ranges from 00:00:00 to 23:59:59.999999.
GEOGRAPHY Topological functions on geographies are supported via special functions. Points, lines, and polygons on the surface of the Earth (i.e., there is no height).
The representations are in the WGS84 ellipsoid; this is the same ellipsoid as used by the Global Positioning System (GPS).
The simplest geography is a point specified by its longitude and latitude.
STRUCT You can deference the fields by name. A collection of fields in order.
The field name is optional; that is, you could have either:
STRUCT<INT64, STRING>
or
STRUCT<id INT64, name STRING>.
ARRAY You can deference the items by offset, aggregate the items in the array, or unnest them to get the items one by one. Ordered list of non-null elements; e.g., ARRAY<INT64>. Arrays of arrays are not allowed, but you can get around this by creating an array of STRUCT in which the struct itself contains an array; i.e., ARRAY<STRUCT<ARRAY<INT64>>>
(We cover arrays in Chapter 2).

You can use all data types, except for arrays and structs, in ORDER BY and GROUP BY.

1 The standard division operator raises a division-by-zero error. SAFE_DIVIDE returns NULL for the entry when division by zero is attempted.

2 Start with Mastering Regular Expressions by Jeffrey Friedl (O’Reilly).

3 For the full list of specifiers, consult the documentation.

4 According to https://en.wikipedia.org/wiki/Armistice_Day, the agreement was signed at 5:45 a.m. on November 11, 1918. In Winter 1918, unlike now, France was in the UTC time zone; see https://www.timeanddate.com/time/zone/france/paris.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset