Chapter 10 - Collect Statistics

“Science is a collection of successful recipes”

-- Paul Valery

The Teradata Parsing Engine (Optimizer) is Cost Based

The Parsing Engine (PE) is often referred to as the Teradata Optimizer, and it will actually generate several plans to choose from and ultimately choose the one with the lowest cost of resources. This is critical to performance in supporting mixed workloads ranging from OLTP to large joins and Decision Support (DS). All cost based optimizers require statistical information about the data and the machine resources (CPU, disk, memory, processors, etc.).

The other type of optimizer is a rules based optimizer which is designed for transactional On-Line Transaction Processing (OLTP) workloads where queries are well known and the data has been logically and physically structured to support OLTP workloads.

images

A cost based optimizer is much better than a rules based optimizer for data warehouses.

The Purpose of Collect Statistics

The Teradata Parsing Engine (PE) is in charge of creating the PLAN for the AMPs to follow. The PE works best when Statistics have been collected on a table. Then it knows:

  1. The number of rows in the table
  2. The average row size
  3. Information on all Indexes in which statistics were collected
  4. The range of values for the column(s) in which statistics were collected
  5. The number of rows per value for the column(s) in which statistics were collected
  6. The number of NULLs for the column(s) in which statistics were collected

 

 

The purpose of the COLLECT STATISTICS command is to gather and store demographic data for one or more columns or indices of a table or join index. This process computes a statistical profile of the collected data, and stores the synopsis in the Data Dictionary (DD) inside USER DBC for use during the PE's optimizing phase of SQL statement parsing. The optimizer uses this synopsis data to generate efficient table access and join plans. Do NOT COLLECT Statistics on all columns in the table.

When Teradata Collects Statistics it creates a Histogram

images

When statistics are collected, Teradata does a full table scan, sorts the column or index, places them into a default of 250 intervals, and then provides the above Histogram.

 

1. Highest Sorted Value in the Interval

2. Most Frequent Value in the Interval

3. Rows with the Most Frequent Value

4. Other Values in the Interval

5. Number of Rows of other Values

This is what is stored in statistics. This is tricky to understand at first, but recognize first that there are 55 people with a Last_Name of Anderson, 150 Bells, 50 Canes, and 160 with the name Davis. Each interval shows the most popular value and row count.

The Interval of the Collect Statistics Histogram

images

The PE now knows there are 55 Andersons in the table, and it assumes for any other name falling between ‘A’ and ‘Axelrod’, that there are 9 values for each (900 / 100 = 9)

When statistics are collected Teradata does a full table scan, sorts the column or index, places them into a default of 250 intervals and then provides the above Histogram. This is what the PE uses to build a plan. Above you see only interval one of 250.

Histogram Quiz

images

 

1. Highest Sorted Value in the Interval

2. Most Frequent Value in the Interval

3. Rows with the Most Frequent Value

4. Other Values in the Interval

5. Number of Rows of other Values

  1. Which Interval would the PE look to find the Last_Name of ‘Apple’? _______
  2. How many people are in the Employee_Table with a Last_Name of ‘Davis’? _____
  3. In Interval 2 how many other names are there other than ‘Bell’? __________
  4. How many people named ‘Baker’ would Teradata estimate? ____________
  5. How many people name ‘Donaldson’ would Teradata estimate? ___________
  6. How many people named ‘Cooper’ would Teradata estimate? ___________

Answers to Histogram Quiz

images

 

1. Highest Sorted Value in the Interval

2. Most Frequent Value in the Interval

3. Rows with the Most Frequent Value

4. Other Values in the Interval

5. Number of Rows of other Values

  1. Which Interval would the PE look to find the Last_Name of ‘Apple’? Interval 1
  2. How many people are in the Employee_Table with a Last_Name of ‘Davis’? 160
  3. In Interval 2 how many other names are there other than ‘Bell’? 200 other names
  4. How many people named ‘Baker’ would Teradata estimate? 4 (800 / 200)
  5. How many people name ‘Donaldson’ would Teradata estimate? 16 (800 / 50)
  6. How many people named ‘Cooper’ would Teradata estimate? 6 (900 / 300)

What to COLLECT STATISTICS On?

You don't COLLECT STATISTICS on all columns and indexes because it takes up too much space for unnecessary reasons, but you do collect on:

  • All Non-Unique Primary Indexes and All Non-Unique Secondary Indexes
  • Non-indexed columns used in joins
  • The Unique Primary Index of small tables (less than 1,000 rows per AMP)
  • Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins
  • Primary Index of a Join Index
  • Secondary Indexes defined on any join index
  • Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions

The first time you collect statistics, you collect them at the index or column level. After that you just collect statistics at the table level and all previous columns collected previously are collected again. It is a mistake to collect statistics only once and then never do it again. COLLECT STATISTICS each time a table's data changes by 10%.

Why Collect Statistics?

What does collect statistics do to help the PE come up with a better plan?

•   Access Path – The PE will easily choose and use any Primary Index access (UPI or NUPI) and it will also easily choose a Unique Secondary Index (USI), but statistics really help the PE decide whether or not to do a Full Table Scan or use a Non-Unique Secondary Index (NUSI) or if it can use multiple NUSI's ANDed together to perform a NUSI bitmap.

•   Join Method – When you collect statistics it gives Teradata a better idea whether or not to do a merge join, product join, hash join or nested join.

•   Join Geography – When two rows are joined together they must physically be located on the same AMP. The only way that this happens naturally is if the join column (PK/FK) is the Primary Index of both tables. Most of the time this is not the case and Teradata must decide the Join Geography of how it will relocate the rows to co-locate them on the same AMP. Will it redistribute (rehash by the join column) one or both of the tables, or will it duplicate the smaller table across all AMPs? A redistribution or duplication are the paths to co-location.

•   Join Order – All joins are performed two tables at a time. What will be the best order to join the tables together? When two or more tables are involved this becomes very important.

.It is the access path, the join method, the join geography and the order that makes statistics collection so vital to all Teradata systems.

How do you know if Statistics were collected on a Table?

Syntax: HELP Statistics <Table Name>

images

 

The HELP Statistics command will show you what statistics have been collected or specifically tell you that no statistics were collected on the table.

A Huge Hint that No Statistics Have Been Collected

EXPLAIN SELECT * FROM New_Employee_Table ;

3) We do an all-AMPs RETRIEVE step from SQL_CLASS.New_Employee_Table by way of an all-rows scan with no residual conditions into Spool1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 12 rows (684 bytes). The estimated time for this step is 0.03 seconds.

 

COLLECT STATISTICS ON New_Employee_Table
Column Employee_No ;
 
EXPLAIN SELECT * FROM New_Employee_Table ;

3) We do an all-AMPs RETRIEVE step from SQL_CLASS.New_Employee_Table by way of an all-rows scan with no residual conditions into Spool1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 9 rows (513 bytes). The estimated time for this step is 0.03 seconds.

If you run an Explain on a query and the row estimate has No Confidence or Low Confidence, then that is a sign that no statistics were collected. Notice how the Explain above changed to High Confidence after we collected statistics on the table.

The Basic Syntax for COLLECT STATISTICS

Here is the syntax for collecting on columns and indexes.

images COLLECT STATISTICS on <Tablename>
COLUMN <Column Name > ;
images COLLECT STATISTICS on <Tablename>
INDEX (<Column Name(s)>);

Here are three actual examples

images

The example commands above provide good fundamentals and concepts to follow.

COLLECT STATISTICS Examples for a better Understanding

images

The New Teradata V14 Way to Collect Statistics

In previous versions, Teradata required that you had to Collect Statistics for each column separately, thus always performing a full table scan each time. Those days are over!

Old Way New Teradata V14 Way
COLLECT STATISTICS COLUMN
(First_Name, Last_Name)
ON Employee_Table ;
COLLECT STATISTICS
  COLUMN(First_Name, Last_Name)
, COLUMN(First_Name)
, COLUMN(Dept_No)
COLLECT STATISTICS COLUMN
(First_Name)
ON Employee_Table;
ON Employee_Table;
COLLECT STATISTICS COLUMN
(Dept_No)
ON Employee_Table;
 

 

The new way to collect statistics in Teradata V14 is to do it all at the same time. This is a much better strategy. Only a single table scan is required, instead of 3 table scans using the old approach. This is an incredible improvement.

COLLECT STATISTICS Directly From another Table

 

CREATE TABLE Stats_Test AS
(Select * from Employee_Table)
WITH DATA AND STATISTICS ;

Data and Statistics have been copied

images

 

In Teradata V13 and above, you can Collect Statistics directly from another table.

Where Does Teradata Keep the Collected Statistics?

USER DBC in the Data Dictionary Tables

images

The Collect Statistics information is kept in user DBC in the Data Dictionary. The rows are spread evenly across all AMPs in three tables:

  1. DBC.Indexes(for multi-column indexes only)
  2. DBC.TVFields(for all columns and single column indexes)
  3. DBC.StatsTbl(Teradata V14 and beyond)

DBC is the most powerful user and DBC owns the Data Dictionary (DD) so it makes sense that DBC will house the statistics in DBC tables. In V14 the statistics are housed in the DBC.StatsTbl relieving the contention for the DBC.Indexes and TVFields tables.

The Official Syntax for COLLECT STATISTICS

images Syntax 1

 

COLLECT STATISTICS [ USING SAMPLE ]
ON [ TEMPORARY ] { <table-name> | <join-index-name> | <hash-index-name> }

  [ COLUMN { <column-name> | (<column-list>) }

| [ UNIQUE ] INDEX { <index-name> [ ALL ] | (<column-list>) } [ ORDER BY { HASH | VALUES } [ <column-name> ] ]   ] ;

 

images Syntax 2

 

COLLECT STATISTICS [ USING SAMPLE ]
[ COLUMN { <column-name> | (<column-list>) }

| [ UNIQUE ] INDEX { <index-name> [ ALL ] | (<column-list>) } [ ORDER BY { HASH | VALUES } [ <column-name> ] ]   ]

ON [ TEMPORARY ] { <table-name> | <join-index-name> | <hash-index-name> } ;

How to Recollect STATISTICS on a Table

Here is the syntax for re-collecting statistics on a table:

COLLECT STATISTICS ON <Tablename> ;

 

Below is an actual example

images

 

The first time you collect statistics, you do it for each individual column or index that you want to collect on. When a table changes its data by 10% due to Inserts, Updates, or Deletes you merely use the command above, and it re-collects on the same columns and indexes previously collected on.

Teradata Always Does a Random AMP Sample

 

SELECT *
FROM Sales_Table
WHERE Product_ID = 1000

images

The Parsing Engine will hash the Table ID for a table being queried and then use the Hash Map to determine which AMP will be assigned to do a Random AMP Sample for this table.

Remember that a Random AMP sample only applies to indexed columns and table row counts.

 

In the old days, Teradata never did a Random AMP Sample unless statistics were not collected, but these days Teradata always does a Random AMP Sample before placing the Table Header inside each AMP's FSG Cache. This allows Teradata to compare these statistics with collected statistics to determine if statistics are old and stale. If the statistics are determined to be out of date then the Random AMP Sample is used.

Random Sample is Kept in the Table Header in FSG Cache

Parsing Engine

If the Sales_Table Header is not in FSG cache, I need for one AMP to do a Random Sample of the table's indexes and then share it.

Then, check the Data Dictionary to see if statistics were collected.

If NO statistics were collected, then use the random sample.

If statistics were collected, then compare the random AMP sample to the collected statistics to determine if statistics are stale.

If statistics are stale, use the random sample.

images

Teradata compares the collected statistics to a Random AMP Sample (obtained by sampling a single AMP before placing the Table Header in FSG Cache). This compare determines if the statistics will be used or if they should be replaced by the sample.

Multiple Random AMP Samplings

The PE does Random AMP Sampling based on the Table ID. The Table ID is hashed, and that AMP is always selected as the sampled AMP for that table. This assures that no single AMP will be tasked for too many tables, but if the table is badly skewed this can confuse the PE.

So now, more than one AMP can be sampled when generating row counts for a query plan for much better estimations on row count, row size, and rows per value estimates per table.

In the DBS Control area, field 65 can now set the standard for how AMPs are sampled.

65. RandomAmpSampling – this field determines the number of AMPs to be sampled for getting the row estimates of a table. The valid values are D, L, M, N or A.

  • D - The default is one AMP sampling (D is the default unless changed.)
  • L - Maximum of two AMPs sampling
  • M - Maximum of five AMPs sampling
  • N - Node Level sampling (all the AMPs in a node are sampled).
  • A - System Level sampling (all the AMPs in the system are sampled).

Multiple AMPs can now be used for the random AMP sample so a higher number of AMPs sampled will provide better estimates to counter skewed results. But, it can cause short running queries to run slower just so long running queries can run faster.

How a Random AMP gets a Table Row count

For row counts, read 1 or 2 cylinders from 1 (or more) AMPs. Calculate the number of rows in the table by taking the:

  1. Average Number of Rows per Block in the sampled Cylinder(s)
  2. Multiple this by the Number of Data Blocks in the sampled Cylinder(s)
  3. Multiple this by the Number of Cylinders for this table on this AMP(s)
  4. Multiple this by the Number of AMPs in the system

images

If a table (or index subtable) spans more than 1 cylinder, it will sample the first and the last cylinder. If it fits into 1 cylinder, it will only sample that one cylinder. This way, the AMP can then do some simple math to estimate the total row count for a table.

Random AMP Estimates for NUSI Secondary Indexes

For Non-Unique Secondary Indexes (NUSI) estimates, read 1 or 2 cylinders from the NUSI subtable and then count the number of NUSI values in the cylinder(s).

The table row count is divided by the NUSI row counts to get a Rows Per NUSI Value.

It also assumes the number of distinct values on one AMP = total distinct values.

images

The Random AMP reads a couple of cylinders and some data NUSI blocks and then does some simple math to estimate the Rows Per NUSI Value. The PE then knows how strong or weak the WHERE clause is using the NUSI and if it should even use the NUSI. This is the most important decision for the Parsing Engine. Should it just do a Full Table Scan or use the NUSI? That is the biggest reason the PE needs statistics. That is why you should always collect statistics on all NUSI indexes.

USI Random AMP Samples are Not Considered

Random AMP sampling assumes that the number of distinct values in a USI equals its cardinality, so it does not read the index subtable for USI equality conditions. Because equality conditions on a unique index return only one row, the PE automatically uses the USI without considering statistics. However, if a USI will frequently be specified in non-equality statements, such as range constraints, then you should collect statistics on the Unique Secondary Index.

 

SELECT *
FROM Sales_Data_Mart
WHERE Sale_Date BETWEEN
‘2013-01-01’ AND ‘2013-01-31’ ;

images

You really only need to collect statistics on a Unique Secondary Index column if there are a lot of SQL statements on non-equality conditions such as range queries.

There's No Random AMP Estimate for Non-Indexed Columns

For non-indexed columns without statistics, the optimizer uses a fixed formula to estimate the number of rows. This is sometimes referred to as heuristics.

Teradata assumes 10% for one column in an equality condition in the WHERE clause.

Teradata assumes 7.5% for two columns, each in an equality condition and ANDed together.

images

Teradata does not do a Random AMP Sample for non-indexed columns that are used in the WHERE clause of the SQL. It uses the above for a quick and dirty estimate.

A Summary of the PE Plan if No Statistics Were Collected

Today's Teradata systems always perform a random AMP sample even if tables have statistics. Then, they compare the random AMP sample with the statistics to determine if the statistics are stale.

A random AMP is selected for a random sample. Two things happen:

1) Indexes are sampled on the random AMP, and the PE estimates based on the total number of AMPs in the system.

2) If a column in the WHERE clause of the SQL is not an Index, the PE assumes that 10% of the rows will come back. If two columns are in the WHERE clause, then it assumes 7.5% of the rows will come back. If three columns are in the WHERE Clause, it assumes 5%.

 

A random AMP sample is selected by the PE, so if it finds there are no statistics on the table, or if the statistics are old and stale, it has options.

Stale Statistics Detection and Extrapolation

The PE estimates the table row count based on the Primary Index collection of a table (called the histogram), but it also does a Random AMP Sample for comparison. If these two metrics differ by a threshold of 10% or for small tables (10,000 new rows), then statistics are considered stale.

Any cardinality estimations will now use extrapolation. What is extrapolation? This means that the PE will estimate based on past statistics and go with new estimations. In other words, the PE will derive its own estimate based on past history and the random sample.

Small tables with less than 25 rows per AMP, or skewed tables, no extrapolation will done.

images

When statistics are determined to be stale, the PE will use the Random AMP sample and also extrapolate which means to estimate new statistics based on historical data.

Extrapolation for Future Dates

images

EXPLAIN
SELECT *
FROM Order_Table
WHERE Order_Date BETWEEN
'2015-01-01' AND '2015-03-31' ;

Parsing Engine

  1. I only have statistics for 2014.
  2. During the first quarter of 2014 I had 1,000,000 orders.
  3. Orders have been increasing by 10%.
  4. I think I can extrapolate and estimate that there will be 1,100,000 orders in the first quarter of 2015.

 

Above, you can see the PE only has collect statistics for the year 2014, but the Explain is asking about 2015 data. Teradata will extrapolate new estimates for future dates based on history and growth estimates. This allows for better estimates and less concern about statistics collection.

How to Copy a Table with Data and the Statistics

This next example is pretty amazing. Assume that the original Employee_Table had COLLECT STATISTICS on the columns Employee_No and Dept_No. The new table we have called Employee_Table_New will have DDL exactly like the Employee_Table plus data plus the statistics. Yes, the exact same statistics will be copied to the new table. Below is the actual example!

images

The example above will CREATE a new table called Employee_Table_New, and it will have the exact same DDL as the Employee_Table, the exact same data, and the exact same statistics.

How to Copy a Table with NO Data and the Statistics

This next example is clever. Assume that the original Employee_Table had COLLECT STATISTICS on the columns Employee_No and Dept_No. The new table we have called Employee_Table_99 will have DDL exactly like the Employee_Table but NO data. It will have the Statistics, but they will be Zeroed Statistics.

images

You have just Re-Collected Statistics on Employee_Table_99 for the columns Employee_No and Dept_No. The easy re-collection on the columns previously collected on (after the data is loaded) was the entire purpose of getting the Zeroed Statistics in the first place. Make sure you recollect after your data is loaded though!

When to COLLECT STATISTICS Using only a SAMPLE

You might consider Collecting Statistics with SAMPLE if:

images You are collecting statistics on a very large table.
images When collecting statistics becomes a problem with system performance or cost because the system is so busy.

Don't consider Collecting Statistics with SAMPLE if:

images The tables are small.
images To replace all existing full scan Collect Statistics.
images If the column's data is skewed badly.

COLLECT STATISTICS can be very time consuming because it performs a full table scan and then performs a lot of statistical calculations. Because Collect Statistics runs infrequently and benefits query optimization, it is considered a necessary task. Without statistics, query performance will suffer. The bad news about sampled statistics is that they may not be as accurate, which could negatively affect the PE's plans. In most cases, sampled statistics are better than no statistics. Don't use Sample unless necessary!

Examples of COLLECT STATISTICS Using only a SAMPLE

images

If you recollect statistics on a sample, it recollects with the same sample amount!

Sampled statistics are generally more accurate for data that is not skewed. For example, columns or indexes that are unique or nearly unique are not skewed. Because the PE needs to be aware of skewed data, you should not collect with sample on skewed data. That is why sampling is generally more appropriate for indexes than non-indexed column(s). If you recollect statistics on a Sample, it recollects with the same Sample!

Examples of COLLECT STATISTICS for V14

To collect sample statistics using the system default sample:

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (Product_ID) ON Sales_Table;

To collect sample statistics by scanning 15 percent of the rows and use 100 intervals:

COLLECT STATISTICS USING SAMPLE 15 PERCENT AND MAXINTERVALS 100 COLUMN (Product_ID) AS Product_Stats ON Sales_Table;

To change sample statistics to 20 percent (for Product_ID) and use 250 intervals:

COLLECT STATISTICS USING SAMPLE 20 PERCENT AND MAXINTERVALS 250 COLUMN (Product_ID) AS Prod_Stats ON Sales_Table;

To display the COLLECT STATISTICS statements for a table:

SHOW STATISTICS ON Sales_Table;

To display statistics details – summary section, high bias values, and intervals:

SHOW STATISTICS VALUES COLUMN Product_ID ON Sales_Table;

How to Collect Statistics on a PPI Table on the Partition

Here is the syntax for collecting statistics on a PPI table.

COLLECT STATISTICS on <Tablename> COLUMN PARTITION;

 

Here is an actual example

COLLECT STATISTICS on Order_Table_PPI COLUMN PARTITION;

 

Three reasons to Collect on the Partition:

images The Parsing Engine will have a better plan for PPI Tables.
images This helps the most with Partition Elimination on Range Queries.
images This is especially helpful when a table has a lot of empty partitions.

 

The Parsing Engine can use this information to better estimate the query cost when there are a significant number of empty partitions. If PARTITION statistics are not collected, empty partitions may cause the Parsing Engine to underestimate the number of rows in a partition. You shouldn't use WITH SAMPLE to collect on Partitions.

Teradata V12 and V13 Statistics Enhancements

In V12, Extrapolate Statistics is designed to more accurately provide for a statistical estimate for date range-based queries that specify a “future” date that is outside the bounds of the current statistics. This results in less re-collections.

In V12, Stale Statistics Detection compares the Random AMP Sample with the statistics collected and determines if they are stale, and should not be used.

In V13, Statistics can now be collected on Volatile Tables.

In V13, PARTITION statistic capabilities have been added to Global Temporary Tables.

In V13, Multi-Column statistics are now available on Hash and Join Indexes.

In V13, Sample Statistics are available on Tables, Volatile Tables, Global Temporary Tables, Hash Indexes and Join Indexes, including the Partition Columns.

Teradata V14 Statistics Enhancements

•  There is now a SUMMARY option to collect table-level statistics.

•  SYSTEM SAMPLE option allows the system to determine the sampled system percentage.

•  Sampling options have been enhanced (e.g., SAMPLE n PERCENT).

•  Statistics are stored in DBC.StatsTbl to reduce access contention and improve performance.

•  New views DBC.StatsV, DBC.ColumnStatsV, DBC.MultiColumnStatsV, and IndexStatsV.

•  SHOW STATISTICS statement reports detailed statistics in plain text or XML formatting.

•  Internal PE enhancements for histogram structure and use, including:

•  Storing statistics data in their native Teradata data types without losing precision

•  Enhanced extrapolation methods for stale statistics

•  Maintaining statistics history

 

Teradata V14 now allows you to determine a sampling percentage for sampled statistics. You can even collect/recollect either summary statistics or both full and summary statistics combined. You can now collect statistics on global temporary tables, and you can provide a name for statistics collected while also being able to specify the column ordering for multicolumn statistics. There is also a dedicated statistics cache that is designed to improve query optimization time.

Teradata V14 Summary Statistics

New in Teradata 14.0, table-level statistics known as “summary statistics” are collected whenever column or index statistics are collected. Summary statistics do not cause their own histogram to be built, but rather they create important facts about the table undergoing collection that are held in the new DBC.StatsTbl. Here are some of the items in “summary statistics”:

  • Row count
  • Average block size
  • block level compression metrics
  • Temperature

One critical advantage is that the optimizer now uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations.

Here is how you can see the Summary Statistics.

SHOW SUMMARY STATISTICS VALUES ON Employee_Table;

You can specifically request summary statistics for a table, but you never need to do that because each individual statistics collection statement causes summary stats to be gathered (it is a quick process). It is best in V14 to now write your collection scripts to do it all in one scan Vs. multiple statements.

Teradata V14 MaxValueLength

COLLECT STATISTICS
     USING MAXVALUELENGTH 50
     COLUMN ( Product_Name)
ON Product_Table ;

 

Before V14, whenever you collected statistics, Teradata only placed the first 16 bytes in the statistics. Long names were cut off. Now, the default is 25 bytes, but you can use the MaxValueLength keyword (example above) to specify the length you want.

 

 

MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic. The new default length is 25 bytes when previously it was 16. If needed, you can specify well over 1000 bytes for a maximum value length. The 16-byte limit on value sizes in earlier releases was always padded to 16 bytes, but now the length can be longer, however no padding is done.

Teradata V14 MaxIntervals

COLLECT STATISTICS
     USING MaxIntervals 500
     COLUMN ( Last_Name)
ON Employee_Table ;

 

Before V14, whenever you collected statistics Teradata did a full table scan on the values, sorted them, and then placed them into 200 intervals. Now, the default is 250 intervals, but you can specify (example above) the number of intervals you desire.

 

 

Each statistics interval highlights its single most popular value, and the number of rows that carry that value are recorded. The rest of the values in the interval are estimated. By increasing the number of intervals, the optimizer can accurately get a better row count for a greater number of the most popular values. A larger number of intervals can be useful if you have widespread skew on a column or index you are collecting statistics on and you want more individual high-row-count values to be represented in the histogram. The range is 0 – 500 for MaxIntervals.

Teradata V14 Sample N Percent

COLLECT STATISTICS
     USING Sample 20 Percent
     COLUMN ( Last_Name)
ON Employee_Table ;

Using Sample before defaulted each time to the system parameter, but now you can specifically state the percent you want for each column or index.

 

 

SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection level, rather than at the system level. Now, different levels of statistics sampling to different columns and indexes can be performed. The better you get at knowing your data and the queries upon them, the more you can specifically use the sampling to better help the PE.

Teradata V14.10 Statistics Collection Improvements

Expressions can now be specified in COLLECT STATS statements.

Expressions include: Substr(), MOD, CONCATENATION, Format Conversion expressions like Uppercase, Lowercase, Data type Conversions, CASE-Expressions, BEGIN and END expressions of PERIOD types.

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN
(
EXTRACT(MONTH FROM Order_Date), Order_Total) AS Month_Plus_TotalON Order_Table ;

It provides automatic downgrading of Full Stats to Sample Stats if column is eligible and system determines if full statistics would be better off if avoided.

Teradata will detect the spool space required before attempting the statistics collection. If it is determined that there will not be enough space, then Teradata will:

  • Build a new layer of local statistics on one of the AMPS, and then the global statistics will be based on local statistics.

It provides statistics re-collection performance improvements to reduce statistics re-collection overhead by automatically skipping the recollection process when not needed.

Above are the features of Teradata V14.10 Statistics Collection Improvements.

Teradata V14.10 Statistics Collection Threshold Examples

TIME and CHANGE based THRESHOLD option can be used to reduce statistics re-collection overhead by automatically skipping the recollection process when not necessary.

For example, if the user sets a TIME THRESHOLD for specific statistics collection to be 5 days, and issues a collect stats after 3 days, the system will not collect statistics.

For example, if a user sets a CHANGED based THRESHOLD of 10%, and the data has not changed by 10%, the system will not collect statistics. Insert and Updates are taken into consideration to determine data growth.

 

COLLECT STATISTICS USING THRESHOLD 10 PERCENT AND THRESHOLD 5 DAYS

COLUMN (Customer_Number) ON Customer_Table ;

If a COLLECT STATISTICS statement is issued less than 5 days after the above statement, or if the data has not changed by 10%, the collection will not take place.

Here is SQL that will override the above statement:

COLLECT STATISTICS USING NO THRESHOLD
FOR CURRENT COLUMN (Customer_Number) ON Customer_Table ;

Teradata V14.10 AutoStats feature

Teradata V14.10 Autostats identifies and collects missing statistics needed and detects stale statistics for refreshing. If statistics are not being used by optimizer, they will be removed.

A new repository of a system supplied database named TDSTATS stores metadata for all stats collections. This is created by a new process called DIPSTATS.

External stored procedures (XSPs) perform the stats management process.

There have been DBQL Enhancements to the Log Optimizer statistics recommendations and usage with dedicated DBQL logging option. DBQL will analyze query plans that have been logged to recommend new stats or identify unused stats.

Teradata recommends that customers should NOT replace of all their existing stats management procedures, but instead begin experimenting with AutoStats and build upon the experience.

Above are the features of Teradata V14.10 AutoStats.

Teradata Statistics Wizard

The Teradata Statistics Wizard is a graphical tool that automates the collection and re-collection of statistics which helps the DBA to better manage the statistics process.

Now the DBA can specify a workload to be analyzed and receive statistics recommendations.

Any database or collection of tables, indexes, or columns can be selected for the collection, or re-collection of statistics.

Recommendations can be based on table demographics and general heuristics.

Defer execution of and schedule the arbitrary collection/re-collections for later.

Display and modify the interval statistics for a column or index.

Provides numerous reports on statistics recommendations update cost analysis and table usage.

 

 

The Statistics Wizard can be used to help the DBA with statistics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset