Chapter 16 – Statistical Aggregate Functions

“You can make more friends in two months by becoming interested in other people than you will in two years by trying to get other people interested in you."

- Dale Carnegie

Numeric Manipulation Functions

image

The functions above are often used for algebraic, trigonometric or geometric calculations.

Finding the Cube Root

image

Find the cube root with the cbrt function.

Ceiling Gets the Smallest Integer Not Smaller Than X

image

Find the smallest integer not smaller than x by using the ceil command. This stands for a number’s integer ceiling. Notice that a ceil (3.333) returns a 4. The CEIL command did not return a number lower than 3.333.

Floor Finds the Largest Integer Not Greater Than X

image

Find the largest integer not greater than x by using the floor command. This stands for a number’s integer floor. Notice that a ceil (3.333) returns a 3. The CEIL command did not return a number higher than 3.333.

The Round Function and Precision

image

Use the round function to round to the precision you need.

The Conv Function

image

The Conv Function converts a number from one numerical base to another. The syntax is: conv (num, from_base, to_base). If the to_base argument is negative, then treat num as a signed integer, otherwise, treat it as an unsigned integer.

The Stats Table

image

Above is the Stats_Table data in which we will use in our statistical examples.

Compute_Stats Function

Col1 Numbers

1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Col2 Numbers

1  1  3   3   3   4   5  5   5   5  7   7   9  9  9  9  10 10 10 10 10 10 13 13 13 14 15 15 16 16

SELECT compute_stats(col1, col2)
FROM Stats_Table

{"columntype":"Long","min":1,"max":30,"countnulls":0,"numdistinctvalues":23}

The compute_stats function returns the statistical summary of a set of primitive type values.

The STDDEV_POP Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

Syntax for using STDDEV_POP:

STDDEV_POP(<column-name>)

SELECT STDDEV_POP(col1) AS SDPCol1
FROM Stats_Table;

SDPCol1

8.66

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root’s square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean. The STDDEV_POP function is one of two that calculates the standard deviation. The population is of all the rows included based on the comparison in the WHERE clause.

A STDDEV_POP Example

image

The STDDEV_SAMP Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax for using STDDEV_SAMP:

STDDEV_SAMP(<column-name>)

SELECT STDDEV_SAMP(col1) AS SDSCol1
FROM Stats_Table;

SDSCol1 

8.8

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root’s square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean. The STDDEV_SAMP function is one of two that calculates the standard deviation. The sample is a random selection of all rows returned based on the comparisons in the WHERE clause. The population is for all of the rows based on the WHERE clause.

A STDDEV_SAMP Example

image

The VAR_POP Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax for using VAR_POP:

VAR_POP(<column-name>)

SELECT VAR_POP(col1) AS VPCol1
FROM Stats_Table;

 VPCol1 

74.92

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Hadoop. VAR_POP is for the entire population of data rows allowed by the WHERE clause. Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often, variance is used in theoretical work where a variance of the sample is needed. There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

A VAR_POP Example

image

The VAR_SAMP Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax for using VAR_SAMP:

VAR_SAMP(<column-name>)

SELECT VAR_SAMP(col1) AS VSCol1
FROM Stats_Table;

 VSCol1 

77.50

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Hadoop. VAR_SAMP is used for a random sampling of the data rows allowed through by the WHERE clause. Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to look for consistency. There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

A VAR_SAMP Example

image

The VARIANCE Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax for using VARIANCE:

VARIANCE(<column-name>)

SELECT VARIANCE (col1) AS VSCol1
FROM Stats_Table;

 VSCol1 

74.92

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Hadoop. VAR_SAMP is used for a random sampling of the data rows allowed through by the WHERE clause. Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to look for consistency. There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

A VARIANCE Example

image

The CORR Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax for using CORR:

CORR(<column-name>, <column-name>)

SELECT CORR(col1, col2) AS CCol1and2
FROM Stats_Table;

CCol1and2

0.99

The correlation coefficient is a number between -1 and 1. It is calculated from a number of pairs of observations or linear points (X,Y) Where:

1 = perfect positive correlation

0 = no correlation

-1 = perfect negative correlation

The CORR function is a binary function, meaning that two variables are used as input to it. It measures the association between 2 random variables. If the variables are such that when one changes the other does so in a related manner, they are correlated. Independent variables are not correlated because the change in one does not necessarily cause the other to change.

A CORR Example

image

Another CORR Example so you can Compare

image

The COVAR_POP Function

Col1 Numbers

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax:

COVAR(<column-name>, <column-name>)

SELECT COVAR_POP(col1, col2) AS CCol1_2
FROM Stats_Table;

CCol1_2

37.50

The covariance is a statistical measure of the tendency of two variables to change in conjunction with each other. It is equal to the product of their standard deviations and correlation coefficients. The covariance is a statistic used for bivariate samples or bivariate distribution. It is used for working out the equations for regression lines and the product-moment correlation coefficient.

A COVAR_POP Example

image

Another COVAR_POP Example so you can Compare

image

The COVAR_SAMP Function

Col1 Numbers

123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Syntax:

COVAR_SAMP (expression1,expression2)

SELECT COVAR_SAMP(col1, col2)AS CCol1_2

FROM Stats_Table;

CCol1_2

38.79

The COVAR_SAMP function returns the sample covariance for a set of expression pairs deemed (expression1 and expression2). The actual return value data type by default is that of DOUBLE PRECISION. The function eliminates all expression pairs when either expression in the pair is NULL. If no rows remain, the function returns a NULL value.

A COVAR_SAMP Example

image

Another COVAR_SAMP Example so you can Compare

image

Using GROUP BY

SELECT

col3

,count(*)

AS Cnt

,avg(col1)

AS Avg1

,stddev_pop(col1)

AS SD1

,var_pop(col1)

AS VP1

,avg(col4)

AS Avg4

,stddev_pop(col4)

AS SD4

,var_pop(col4)

AS VP4

,avg(col6)

AS Avg6

,stddev_pop(col6)

AS SD6

,var_pop(col6)

AS VP6

FROM     Stats_Table GROUP BY Col3 ORDER BY 1;

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset