One of the key mistakes people make is labeling their database as a data warehouse solely based on its size. Over the past decade, three phenomena have occurred resulting in major increases in average database size:
The cost of space versus the value of the data has decreased.
Companies now value the data as a critical business asset.
Companies have merged into large multi-national entities.
In other words, the cost of keeping data online is cheap, the perceived value of that data is now very high, and the size of companies and their data needs have grown. As such, many of today's OLTP and ODS databases routinely grow into the 100–800 gigabyte (GB) range. But that does not make them data warehouses. For example, SAP and PeopleSoft enterprise resource planning (ERP) databases of 400 GB or more are not uncommon, yet they are not data warehouses, even at these extremely large sizes. Remember, size alone does not a data warehouse make.
The simplest way to avoid labeling a large database as a data warehouse is to add some DBA-centric questions and answers to the description of the nature of that database. For each subject area in your data warehouse, simply ask the physical DBA to provide estimates for the following seven items:
The number of tables
Average big table row count
Average big table size in GB
Largest table's row count
Largest table's size in GB
Largest transaction rollback needed in GB
Largest temporary segment needed in GB
Data warehouses generally have fewer, larger tables, whereas non-data warehouse databases usually possess more, smaller tables. Of additional interest are the temporary and rollback segment needs of the database. Data warehouses tend to need them as large as the largest object (for rebuilds), whereas non-data warehouse databases only need them large enough for the largest transaction.
Use the criteria outlined in Table 1-2 for your evaluation.
OLTP | ODS | OLAP | DM / DW | |
---|---|---|---|---|
Number of Tables | 100–1000's | 100–1000's | 10–100's | 10–100's |
Average Table's Row Count | 10's of Thousands | 10's of Thousands | 10–100's of Millions | 100–1000's of Millions |
Average Table's Size in GB | 10's of MB | 10's of MB | 10's of GB | 10–100's of GB |
Largest Table's Row Count | 10–100's of Millions | 10–100's of Millions | 10–100's of Millions | 100–10,000's of Millions |
Largest Table's Size in GB | 10's of GB | 10's of GB | 10's of GB | 10–100's of GB |
Rollback Segment's Size in GB | 100's of MB | 100's of MB | N/A | 10–100's of GB |
Temp Segment's Size in GB | 100's of MB | 100's of MB | N/A | 10–100's of GB |
Continuing with our previous example, suppose your requirements are as follows:
8 tables
500 million rows per big table
50 GB per big table
2 billion rows for largest table
160 GB for largest table
160 GB to rebuild largest table
60 GB to rebuild largest index
From this example, we can again discern that we have a data mart or data warehouse. First, we have very few tables. A typical OLTP or ERP database would have hundreds or even thousands of tables. Second, the row counts of our smallest big table and largest table have the right order of magnitude. Row counts expressed with lots of zeros or in powers of ten greater than ten (e.g., 1010) are more likely to be in data warehouses. Finally, look at our rollback and temporary segments' needs. They're as big as some entire databases!
While it may seem like I've once again painted an example tailored to the conclusion, I've actually found the process to be this straightforward and easy in most cases as well. Unfortunately, these days, people tend to call any very large database a data warehouse. Once again, it's okay for people to call their projects whatever they like. But as pointed out, the techniques in this book only apply to the DM/DW column of Table 1-2.