Chapter 6
Dimensional Modeling Fundamentals

Building on the bus architecture foundation established in Chapter 5, it's time to delve into the basics of dimensional modeling. This chapter begins with an overview of fact and dimension tables, along with the fundamental activities of drilling down, drilling across, and handling time in the data warehouse. Graceful modifications to existing dimensional models are also described.

We then turn our attention to dimensional modeling Dos and Don'ts. Finally, we discuss common myths and misunderstandings about dimensional modeling.

Basics of Dimensional Modeling

This first set of articles describes the fundamental constructs of a dimensional model.

6.1 Fact Tables and Dimension Tables

Ralph Kimball, Intelligent Enterprise, Jan 1, 2003

Dimensional modeling is a design discipline that straddles the formal relational model and the engineering realities of text and number data. Compared to third normal form entity-relationship modeling, it's less rigorous (allowing the designer more discretion in organizing the tables), but more practical because it accommodates database complexity and improves performance. Dimensional modeling has an extensive portfolio of techniques for handling real-world situations.

Measurements and Context

Dimensional modeling begins by dividing the world into measurements and context. Measurements are usually numeric and taken repeatedly. Numeric measurements are facts. Facts are always surrounded by mostly textual context that's true at the moment the fact is recorded. Facts are very specific, well-defined numeric attributes. By contrast, the context surrounding the facts is open ended and verbose. It's not uncommon for the designer to add context to a set of facts partway through the implementation.

Although you could lump all context into a wide, logical record associated with each measured fact, you'll usually find it convenient and intuitive to divide the context into independent logical clumps. When you record facts—dollar sales for a grocery store purchase of an individual product, for example—you naturally divide the context into clumps named product, store, time, customer, clerk, and several others. We call these logical clumps dimensions and assume informally that these dimensions are independent. Figure 6.1 shows the dimensional model for a typical grocery store fact.

Grocery Store Retail fact table links to dimension tables for date, store, customer, product, promotion, and clerk.

Figure 6.1 A dimensional model for grocery store sales.

In truth, dimensions rarely are completely independent in a strong statistical sense. In the grocery store example, customer and store clearly will show a statistical correlation. But it's usually the right decision to model customer and store as separate dimensions. A single, combined dimension would likely be unwieldy with tens of millions of rows. And the record of when a given customer shopped in a given store would be expressed more naturally in a fact table that also showed the date dimension.

The assumption of dimension independence would mean that all the dimensions, such as product, store, and customer, are independent of time. But you have to account for the slow, episodic change of these dimensions in the way you handle them. In effect, as keepers of the data warehouse, we have taken a pledge to faithfully represent these changes. This predicament gives rise to the technique of slowly changing dimensions.

Dimensional Keys

If the facts are truly measures taken repeatedly, you find that fact tables always create a characteristic many-to-many relationship among the dimensions. Many customers buy many products in many stores at many times.

Therefore, you logically model measurements as fact tables with multiple foreign keys referring to the contextual entities. And the contextual entities are each dimensions with a single primary key, e.g., for the customer dimension, as in Figure 6.1. Although you can separate the logical design from the physical design, in a relational database fact tables and dimension tables are most often explicit tables.

Actually, a real relational database has two levels of physical design. At the higher level, tables are explicitly declared together with their fields and keys. The lower level of physical design describes the way the bits are organized on the disk and in memory. Not only is this design highly dependent on the particular database, but some implementations may even “invert” the database beneath the level of table declarations and store the bits in ways that are not directly related to the higher level physical records. What follows is a discussion of the higher level physical design only.

A fact table in a dimensional star schema consists of multiple foreign keys, each paired with a primary key in a dimension, together with the facts containing the measurements. In Figure 6.1, the foreign keys in the fact table are labeled FK, and the primary keys in the dimension tables are labeled PK, as shown for the customer dimension. (The field labeled DD, special degenerate dimension key, is discussed later in this article.)

I insist that the foreign keys in the fact table obey referential integrity with respect to the primary keys in their respective dimensions. In other words, every foreign key in the fact table has a match to a unique primary key in the respective dimension. Note that this design allows the dimension table to possess primary keys that aren't found in the fact table. Therefore, a product dimension table might be paired with a sales fact table in which some of the products are never sold. This situation is perfectly consistent with referential integrity and proper dimensional modeling.

In the real world, there are many compelling reasons to build the FK-PK pairs as surrogate keys that are just sequentially assigned integers. It's a major mistake to build data warehouse keys out of the natural keys that come from the underlying operational data sources.

Occasionally a perfectly legitimate measurement will involve a missing dimension. Perhaps in some situations a product can be sold to a customer in a transaction without a store defined. In this case, rather than attempting to store a null value in the store FK, you build a special record in the store dimension representing “No Store.” Now this condition has a perfectly normal FK-PK representation in the fact table.

Theoretically, a fact table doesn't need a primary key because, depending on the information available, two different legitimate observations could be represented identically. Practically speaking, this is a terrible idea because normal SQL makes it very hard to select one of the records without selecting the other. It would also be hard to check data quality if multiple records were indistinguishable from each other.

Relating the Two Modeling Worlds

Dimensional models are full-fledged relational models, where the fact table is in third normal form and the dimension tables are in second normal form, confusingly referred to as denormalized. Remember that the chief difference between second and third normal forms is that repeated entries are removed from a second normal form table and placed in their own “snowflake.” Thus the act of removing the context from a fact record and creating dimension tables places the fact table in third normal form.

I resist the urge to further snowflake the dimension tables and am content to leave them in flat second normal form because the flat tables are much more efficient to query. In particular, dimension attributes with many repeated values are perfect targets for bitmap indexes. Snowflaking a dimension into third normal form, while not incorrect, destroys the ability to use bitmap indexes and increases the user perceived complexity of the design. Remember that in the presentation area of the data warehouse, you don't have to worry about enforcing many-to-one data rules in the physical table design by demanding snowflaked dimensions. The ETL staging system has already enforced those rules.

Declaring the Grain

Although theoretically any mixture of measured facts could be shoehorned into a single table, a proper dimensional design allows only facts of a uniform grain (the same dimensionality) to coexist in a single fact table. Uniform grain guarantees that all the dimensions are used with all the fact records (keeping in mind the “No Store” example) and greatly reduces the possibility of application errors due to combining data at different grains. For example, it's usually meaningless to blithely add daily data to yearly data. When you have facts at two different grains, place the facts in separate tables.

Additive Facts

At the heart of every fact table is the list of facts that represent the measurements. Because most fact tables are huge, with millions or even billions of rows, you almost never fetch a single record into your answer set. Rather, you fetch a very large number of records, which you compress into digestible form by adding, counting, averaging, or taking the min or max. But for practical purposes, the most common choice, by far, is adding. Applications are simpler if they store facts in an additive format as often as possible. Thus, in the grocery example, you don't need to store the unit price. You merely compute the unit price by dividing the dollar sales by the unit sales whenever necessary.

Some facts, like bank balances and inventory levels, represent intensities that are awkward to express in an additive format. You can treat these semi-additive facts as if they were additive—but just before presenting the results to the business user, divide the answer by the number of time periods to get the right result. This technique is called averaging over time.

Some perfectly good fact tables represent measurement events with no facts, so we call them factless fact tables. The classic example of a factless fact table is a record representing a student attending a class on a specific day. The dimensions are day, student, professor, course, and location, but there are no obvious numeric facts. The tuition paid and grade received are good facts, but not at the grain of the daily attendance.

Degenerate Dimensions

In many modeling situations where the grain is a child, the natural key of the parent header winds up as an orphan in the design. In the Figure 6.1 grocery example, the grain is the line item on a sales ticket, but the ticket number is the natural key of the parent ticket. Because you have systematically stripped off the ticket context as dimensions, the ticket number is left exposed without any attributes of its own. You model this reality by placing the ticket number by itself right in the fact table. We call this key a degenerate dimension. The ticket number is useful because it's the glue that holds the child records together.

6.2 Drilling Down, Up, and Across

Ralph Kimball, DBMS, Mar 1996

In data warehouse applications we often talk about drilling down, and occasionally we talk about reversing the process and drilling up. It is time for us as an industry to be more consistent and more precise with our vocabulary concerning drilling.

Drilling Down

Drilling down is the oldest and most venerable kind of drilling in a data warehouse. Drilling down means nothing more than “give me more detail.” In our standard dimensional schema, the attributes in the dimension tables play a crucial role. These attributes are textual (or behave like text), take on discrete values, and are the source of application constraints and grouping columns in the final reports. In fact, you can always imagine creating a grouping column in a report by opportunistically dragging a dimension attribute from any of the dimension tables down into the report, thereby making it a grouping column, as shown in Figure 6.2. The beauty of the dimensional model is that all dimension attributes can become grouping columns. The process of adding grouping columns can be compounded with as many grouping columns from as many dimension tables as the user wishes. The great strength of SQL is that these grouping columns simply get added to the SELECT list and the GROUP BY clause and the right thing happens. Usually you add these grouping columns to the ORDER BY clause also so that you get the grouping in a prescribed order.

Brand in Product dimension table, District in Store dimension table, and Sales Dollars in Sales fact table are encircled with arrows (Drag and Drop) pointing to Zip Chocolate, Atherton, and 2,035, respectively.

Figure 6.2 Dragging dimension attributes and facts into a report.

From this discussion, you can see that the precise definition of drilling down is “add a grouping column.” A few query tool vendors have tried to be overly helpful and implemented a drill-down command in their user interfaces that adds specific grouping columns, usually from the “product hierarchy.” For instance, the first time you press the drill-down button, you add a category attribute. The next time you use the button, you add the subcategory attribute, and then the brand attribute. Finally, you add the detailed product description attribute to the bottom of the product hierarchy. This is very limiting and often not what the user wants. Not only does real drill down mix both hierarchical and nonhierarchical attributes from all the available dimensions, but there is no such thing as a single obvious hierarchy in a business in any case.

It may happen that there is more than one well-defined hierarchy in a given dimension. In some companies, marketing and finance have incompatible and different views of the product hierarchy. Although you might wish that there was only a single product hierarchy, all the marketing defined attributes and all the financially defined attributes are included in the detailed master product table illustrated in Figure 6.3. The user must be allowed to traverse any hierarchy and choose unrelated attributes that are not part of the hierarchy.

Image described by caption.

Figure 6.3 A product dimension table with both marketing and finance attributes.

Large customer dimension tables often have three simultaneous hierarchies. If the grain of the customer table is the ship-to location, you automatically have a geographic hierarchy defined by the customer's address. You probably also have a hierarchy that is defined by the customer's organization, such as division and corporation. Finally, you may have your own sales hierarchy based on the sales rep assignment to the customer ship-to address. This sales hierarchy could be organized by sales territory, sales zone, and sales region. A richly defined customer table could have all three hierarchies happily coexisting, and awaiting all possible flavors of user drill down.

Drilling Up

If drilling down is adding grouping columns from the dimension tables, then drilling up is subtracting grouping columns. Of course, it is not necessary to subtract the grouping columns in the same order that they were added. In general, each time the user adds or subtracts a grouping column, a new multi-table join query must be launched. If you have an aggregate navigator, as described in article 12.17, The Aggregate Navigator, then each multi-table join query smoothly seeks its proper level in the space of explicitly stored aggregates. In a properly tuned data warehouse, there is little difference in performance between bringing back 1000 answer set rows at a high level of aggregation and bringing back 1000 answer set rows at a low level of aggregation.

Drilling Across

If drilling down is requesting ever finer and more granular data from the same fact table, then drilling across is the process of linking two or more fact tables at the same granularity, or, in other words, tables with the same set of grouping columns and dimensional constraints. Drilling across is a valuable technique whenever a business has several fundamental business processes that can be arranged in a value chain. Each business process gets its own separate fact table. For example, almost all manufacturers have an obvious value chain representing the demand side of their businesses consisting of finished goods inventory, orders, shipments, customer inventory, and customer sales, as shown in Figure 6.4. The product and time dimensions thread through all of these fact tables. Some dimensions, such as customer, thread through some, but not all of the fact tables. For instance, customer does not apply to finished goods inventory.

Schematic illustrating separate fact tables (squares) for each business process share common dimensions (circles).

Figure 6.4 Separate fact tables for each business process share common dimensions.

A drill-across report can be created by using grouping columns that apply to all the fact tables used in the report. Thus in our manufacturing value chain example, attributes may be freely chosen from the product and time dimension tables because they make sense for every fact table. Attributes from customer can only be used as grouping columns if we avoid touching the finished goods inventory fact table. When multiple fact tables are tied to a dimension table, the fact tables should all link to that dimension table. When we use dimensions that share common fields with each of the fact tables, we say that these dimensions are conformed across the fact tables in our value chain.

After building the grouping columns and additive fact columns, you must launch the report's query one fact table at a time, and assemble the report by performing an outer join of the separate answer sets on the grouping columns. This outer join must be performed by the requesting client tool, not the database. You must never try to launch a single SQL SELECT statement that refers to more than one fact table. You will lose control of performance to our friend, the cost-based optimizer. Note that the necessary outer join assembles the final report, column by column. You cannot solve this with SQL UNION, which assembles reports row by row.

Some of you may be wondering why each business process is modeled with its own separate fact table. Why not combine all of the processes together into a single fact table? Unfortunately, this is impossible for several reasons. Most important, the separate fact tables in the value chain do not share all the dimensions. You simply can't put the customer ship-to dimension on the finished goods inventory data. A second reason is that each fact table possesses different facts, and the fact table records are recorded at different times along the value chain.

Once you have set up multiple fact tables for drilling across, you can certainly drill up and down at the same time. In this case, you take the whole value chain and simultaneously ask all the fact tables for more granular data (drill down) or less granular data (drill up).

6.3 The Soul of the Data Warehouse, Part One: Drilling Down

Ralph Kimball, Intelligent Enterprise, Mar 20, 2003

Although data warehouses come in many shapes and sizes and deal with many different subject areas, every data warehouse must embody a few fundamental themes. The three most important are drilling down, drilling across, and handling time. Modern data warehouses so deeply embed these three themes that I think an “if-and-only-if” relationship has developed between them and a real data warehouse. If a system supports drilling down, drilling across, and handling time, then as long as it's easy to use and runs fast, it automatically qualifies as a data warehouse. But as simple as these three themes might seem, they give rise to a set of detailed and powerful architectural guidelines that should not be compromised.

In this article, I drill down into drilling down, starting with a precise operational definition. Then, as a good engineer should, I lay out practical guidelines for building systems that do a good job of drilling down.

Drilling down in a relational database means “adding a row header” to an existing SELECT statement. For instance, if you're analyzing the sales of products at a manufacturer level, the select list of the query reads SELECT MANUFACTURER, SUM(SALES). Of course the rest of the query contains join specifications and constraints on other parts of the database, such as time and geography. If you wish to drill down on the list of manufacturers to show the brands sold, you add the brand row header: SELECT MANUFACTURER, BRAND, SUM(SALES). Now each manufacturer row expands into multiple rows listing all the brands sold. This is the essence of drilling down.

Incidentally, we often call a row header a “grouping column” because everything in the select list that's not aggregated with an operator such as SUM must be mentioned in the SQL GROUP BY clause. So the GROUP BY clause in the second query reads GROUP BY MANUFACTURER, BRAND. Row headers and grouping columns are the same thing.

This example is particularly simple because it's very likely that, in a dimensional schema, both the manufacturer and brand attributes exist in the same product dimension table. So, after running the first query at the manufacturer level, you look at the list of attributes in the product dimension and opportunistically drag the brand attribute into the query. Then you run it again, thereby drilling down in an ad hoc way. If the brand attribute is indeed in the same dimension table as the manufacturer attribute, then the only adjustments to the original SQL are to add brand to the select list and the GROUP BY clause.

You could just as well have selected the color attribute for drilling down rather than the brand attribute. In fact, if you substitute the color attribute for brand in the preceding paragraphs, they would be just as valid. This exercise powerfully illustrates the fact that drilling down has nothing to do with descending a predetermined hierarchy. In fact, once you understand this concept, you see that you can drill down using any attribute drawn from any dimension! You could just as well have drilled down on the weekday from the time dimension; the preceding discussion of the select list and the GROUP BY clause would still be identical.

The idea that you can expand any report row to show more detail simply by adding a new row header is one of the powerful ideas that form the soul of a data warehouse. A good data warehouse designer should always be thinking of additional drill-down paths to add to an existing environment. An example of this out-of-the-box thinking is to add an audit dimension to a fact table, as described in article 11.20, Indicators of Quality: The Audit Dimension. The audit dimension contains indicators of data quality in the fact table, such as “data element out of bounds.” But this audit dimension can be part of the drill-down process! Now you can devise a standard report to drill down to issues of data quality, including the proportion of questionable data. By drilling down on data quality, each row of the original report would appear as multiple rows, each with a different data quality indicator. Hopefully, most of the report results should cluster under the “normal” row headers.

Finally, it's even possible to drill down with a calculation, as long as you're careful not to use an aggregate operator such as SUM in the calculated quantity. You could drill down on the price point of the manufacturer's sales by adding SALES/QUANTITY to the select list, where this price point calculation contains no aggregate operator. SALES and QUANTITY are both numeric facts in the fact table. The select list now reads SELECT MANUFACTURER, SALES/QUANTITY, SUM(SALES). You must also add SALES/QUANTITY to the GROUP BY clause. This expression replaces each original manufacturer row with the sales of each manufacturer's products at each observed price in the marketplace. Each row shows the price point as well as the total sales at that price point.

I can now make some precise technical comments about drilling down:

  1. Drilling down is the most basic user maneuver in the data warehouse, which must support it in as general and flexible a manner as possible because there's no way to predict the user's drill-down path. In other words, every drill-down path must be available and supported with the same user interface gestures because the users see little conceptual difference between the various forms of drilling down described in the preceding examples.
  2. The data warehouse must therefore support drilling down at the user interface level, at all times, with the most atomic data possible because the most atomic data is the most dimensional. The most atomic data is the most expressive; more dimensions are attached to atomic data than to any form of aggregated or rolled-up data.
  3. Combining the first two points means that for all practical purposes, the atomic data must be in the same schema format as any aggregated form of the data; the atomic data must be a smoothly accessible target for drill-down paths using standard ad hoc query tools. Failure in this area is the showstopper for an architecture in which atomic data is hidden in the back room in a normalized physical format and somehow is accessed after “drilling through” aggregated dimensional data marts. The proponents of this architecture have never explained how this magic occurs. Fortunately, this crisis evaporates if you use the same data structures at all levels of aggregation including the atomic level.
  4. To build a practical system for drilling down, you want standard ad hoc query tools to present the drill-down choices without special schema-dependent programming, and you want these tools to emit the correct resulting SQL without schema-dependent programming. Schema-dependent programming is the kiss of death for a data warehouse shop because it means that each schema requires custom-built applications. This problem was a crisis in the 1980s; there's no excuse for it to remain a problem now. Avoiding schema-dependent programming means choosing a standard schema methodology for all user-facing data sets in the presentation layer of the data warehouse.
  5. Only one standard schema methodology exists that's capable of expressing data in a single, uniform format that looks the same at the atomic layers as in all aggregated layers, and at the same time requires no schema dependent programming: the star schema, otherwise known as the dimensional model. Dimensional models support all forms of drilling down described in this column. All possible many-to-one and many-to-many data relationships are capable of representation in a dimensional model; thus, the dimensional model is the ideal platform for ad hoc querying.
  6. The dimensional design in the presentation layer smoothly supports prebuilt aggregation tables. An aggregated fact table is a mechanically derived table of summary records. The most common reason to build aggregated fact tables is that they offer immense performance advantages compared to using the large, atomic fact tables. But you get this performance boost only when the user asks for an aggregated result! The first example query asking for the manufacturer sales of products was a good example of an aggregated result.

A modern data warehouse environment uses a query rewrite facility called an aggregate navigator to choose a prebuilt aggregate table whenever possible. Oracle's materialized views and IBM DB2's automatic summary tables are examples of aggregate navigators. Each time the user asks for a new drill-down path, the aggregate navigator decides in real time which aggregate fact table will support the query most efficiently. Whenever the user asks for a sufficiently precise and unexpected drill down, the aggregate navigator gracefully defaults to the atomic data layer.

Drilling down is probably the most basic capability that a data warehouse needs to support. Drilling down most directly addresses the natural business user need to see more detail in an interesting result.

6.4 The Soul of the Data Warehouse, Part Two: Drilling Across

Ralph Kimball, Intelligent Enterprise, Apr 5, 2003

If drilling down is the most fundamental maneuver in a data warehouse, drilling across is a close second. From the perspective of an answer set, drilling across adds more data to an existing row. Note that this result isn't what you get from a UNION of rows from separate queries. It's better described as the column accretion from separate queries.

Drilling across by adding another measured fact to the SELECT list from the existing fact table mentioned in the query is a trivial accomplishment. What's more interesting and important is adding another measured fact from a new fact table. The issues raised by this simple view of drilling across are at the heart of data warehouse architecture. These issues boil down to an observation and a choice.

The observation about drill across is that the new fact table called for in the drill-across operation must share certain dimensions with the fact table in the original query. Certain dimensions will be named in the original query because they contribute row headers. Remember that these row headers are the basis of the grouping that creates the answer set row. These dimensions will appear in the FROM clause of the SQL code and will be joined to the fact table through the relationship of a foreign key to the primary key. The new fact table must also support exactly these same row headers, or the context of the answer set row is meaningless.

The drill-across choice is that you can either send a single, simultaneous SQL request to the two fact tables or send two separate requests. Although sending a single SQL request to the two fact tables seems cleaner, this choice can become a showstopper. Sending a single request means mentioning both fact tables in the FROM clause of the SQL code and joining both fact tables in some way to the common dimension tables I just discussed. This commingling of two fact tables in the same SQL statement causes these problems:

  • Because the two fact tables will be joined together either directly or through the common dimensions, the query must specify whether the many-to-many relationship between the two fact tables is handled with inner or outer joins. This fundamental challenge arises from the relational model. It's effectively impossible to get this right, even if you're an expert SQL programmer. Depending on the relative cardinality of the two fact tables, your aggregated numeric totals can either be too low or too high, or both! Even if you don't believe me, you have to deal with the next bullet point.
  • The vast majority of queries the relational database receives are generated by powerful query tools and report writers, and you have no direct control over the SQL they emit. You don't want control over the SQL. Some of these tools generate mind-boggling reams of SQL, and you can't effectively intervene.
  • Emitting a single SQL statement precludes you from requesting data from separate table spaces, separate machines, or separate database vendors. You're stuck in the same table space on the same machine talking to one database vendor. If you can easily avoid this problem, why take on these restrictions?
  • Finally, if you emit a single SQL statement involving both fact tables, you'll almost certainly be unable to use any of the powerful query-rewrite tools that perform aggregate navigation. Aggregate navigation is the most cost-effective way to make dramatic gains in data warehouse performance. For more on aggregate navigation, see article 12.18, Aggregate Navigation with (Almost) No Metadata.

Implementing Drill Across

If you've followed the logic of the observation and the choice, the architecture to support drill across begins to emerge.

  1. All fact tables in a drill-across query must use conformed dimensions.
  2. The actual drill-across query consists of a multi-pass set of separate requests to the target fact tables followed by a simple sort-merge on the identical row headers returned from each request.

The simplest definition of conformed dimensions is that two instances of a conformed dimension are identical. So if two fact tables have a customer dimension, then customer is conformed if the two dimensions are exactly the same. But this definition is unnecessarily restrictive. Here's the precise definition of conformed dimensions: Two dimensions are conformed if the fields that you use as common row headers have the same domains.

When you bring two separate queries together in a drill-across operation, both queries must have the same number of row headers, arranged from left to right in the same order. All the rest of the columns (the computed facts) in the two queries, by definition, are not row headers. In other words, an independent examination of both queries shows that neither query has rows that duplicate the same row headers. To put it another way, the row headers form a unique key for each row of the answer set.

To sort-merge (also called merge-sort) the two queries, you must sort them the same way. At this point, it becomes possible to merge the rows of the two queries together in a single pass. The resulting merged answer set has a single set of row headers plus the combined set of computed facts returned from both queries. Because traditional sort-merge is the same as an outer join, it is possible for a row in the final merged answer set to have nulls for either the first set of computed facts or the second set, but not both!

Once you've visualized the sort-merge step in drilling across, you really understand conformed dimensions. With conformed dimensions, the only thing you care about is matching row headers. If the contents of the respective fields you're using for the sort-merge are drawn from the same domains, then the match makes sense. If you try to match row headers from two dimensions that aren't conformed, you're guaranteed to get garbage. The sort-merge will fail, and the SQL engine will post the results from the two queries on separate lines—and probably in separate sorting locations in the merged answer set.

Amazing Magic

In my classes, I sometimes describe conformed dimensions as either dimensions that are exactly equal (the trivial case) or dimensions where “one is a subset of the other.” For example, a brand dimension may be a subset of a more detailed product dimension. In this case, you can drill across two fact tables, one at the brand level with a brand dimension (such as a forecast), and the other at a detailed product level with a product dimension (such as sales transactions). Assume that the product dimension is a nice flat table containing the low cardinality brand attribute.

If the row header of the two queries is simply “brand,” then some amazing magic takes place. The engine automatically aggregates both data sets to the brand level, which is exactly the right level for the drill across. If the names of the brands are drawn from the same domain, you can complete the drill across (of forecast versus actual) by confidently merging the rows with the same brand names. Many commercial query and report writing tools perform this drill-across operation. Please see the rather technical articles 13.21, Simple Drill-Across in SQL, and 13.22, An Excel Macro for Drilling Across, for deeper dives into implementing drill-across.

You can see that it's possible to conform two dimensions even when they have some incompatible fields. You just need to be careful to avoid using these incompatible fields as row headers in drill-across queries. Not avoiding it lets a dimension contain some private fields that are meaningful only to a local user group.

Finally, it's worth pointing out that the decision to physically centralize a data warehouse has very little to do with conformed dimensions. If you combine two data sets in a drill-across operation, you have to label them the same way, whether the data sets are tightly administered on a single piece of hardware by one DBA or are loosely administered by remote IT organizations that merely agree to create a set of overlapping labels.

6.5 The Soul of the Data Warehouse, Part Three: Handling Time

Ralph Kimball, Intelligent Enterprise, Apr 22, 2003

The three most fundamental maneuvers in every data warehouse are drilling down, drilling across, and handling time. The third, handling time, makes good on a pledge that every data warehouse provider implicitly takes: The data warehouse shall preserve history. In practice, this pledge generates three main requirements for the data warehouse:

First, every piece of data in the data warehouse must have clearly understood time validity. In other words, when did the data become valid, and when did it cease to be valid?

Second, if the detailed description of a data warehouse entity has changed over time, you must correctly associate each version of that entity with the contemporary versions of other measurements and entities in the data warehouse. In other words, if a customer made a purchase a year ago, the description of the customer attached to that purchase must be correct for that time frame.

Last, the data warehouse must support the natural ways people have of viewing data over time. These natural ways include seeing instantaneous events, regular periodic reports, and latest status.

Dimensional modeling provides a convenient framework for dealing with each of these requirements. Remember that dimensional models are organized around measurements. Measurements, which are usually numeric, occupy fact tables in a dimensional model. The contexts of the measurements are in dimension tables, which surround the fact tables and connect to them through a series of simple relationships between foreign keys and primary keys.

Time Validity

A measurement is usually a physical act that takes place at a specific time, so it's natural, even irresistible, to attach a time stamp to each fact table record. Every fact table has a time dimension.

Time stamps are commonly recorded at a daily grain because many legacy systems don't record time of day when posting a measurement. In a dimensional schema, the daily time stamp consists of a surrogate integer foreign key in the fact table joined to a corresponding primary key in the daily time dimension table. You want the time stamp in the fact table to be a surrogate key rather than a real date for three reasons: First, the rare time stamp that is inapplicable, corrupted, or hasn't happened yet needs a value that cannot be a real date. Second, most user calendar navigation constraints, such as fiscal periods, end-of-periods, holidays, day numbers, and week numbers aren't supported by database time stamps. Therefore, they need to come from a table with a verbose time dimension, rather than computed in the requesting query tool. Third, integer time keys take up less disk space than full dates.

When the source system provides a detailed time stamp for the measurement down to the minute or the second, the time of day needs to be a separate dimension or a full date/time stamp. Otherwise, a combined day and time-of-day dimension would be impractically large.

In multinational applications, there are often two time stamp perspectives: the remote party's and home office's. Certainly, when recording the time of day, it usually makes sense to include two pairs of time stamps (calendar day and full date/time stamp for both remote and local) rather than leaving it up to the query tool to work out time zones. Mainly because of daylight savings time rules, the calculation of time zone differences is horrendously complicated.

Correct Association

Dimensional modeling assumes that dimensions are largely independent. This assumption, combined with the fact that time is its own dimension, implies that other entities, such as customer and product, are independent of time. In the real world, this inference isn't quite true. Entities such as customer and product slowly change over time, usually in episodic, unpredictable ways.

When the data warehouse encounters a legitimate revised description of, for example, a customer, there are three fundamental choices for handling this slowly changing dimension (SCD):

  1. Overwrite the changed attribute, thereby destroying previous history. This approach is justifiable when correcting an error, if you're living by the pledge.
  2. Issue a new record for the customer, keeping the customer natural key, but creating (by necessity) a new surrogate primary key.
  3. Create an additional field in the existing customer record, and store the old value of the attribute in the additional field. Overwrite the original attribute field. This strategy is called for when the attribute can have simultaneous “alternate realities.”

All three choices need at least one embedded time stamp stating when the record was updated, as well as a companion field describing that change. For the primary type 2 SCD, where a new record is created, you need a pair of time stamps as well as a change description field. The pair of time stamps define a span of time from the begin effective time to end effective time when the complete customer description remains valid. The most sophisticated treatment of a type 2 SCD record involves five fields:

  • Begin effective date/time stamp (not a surrogate key pointer)
  • End effective date/time stamp
  • Effective date surrogate key (daily grain) connecting to date dimension as a snowflake
  • Change description field (text)
  • Most recent flag

The first two fields are what conventional BETWEEN constraints use to profile the dimension at specific points in time. They need to be single fields with full date and time stamps in order to make the BETWEEN machinery work. The third field constrains specific records in the dimension that changed on days that can be described only via the organization's calendar date table (such as employee demotions that occurred the day before payday). The fourth field lets you find all changes in the dimension meeting a particular description. The fifth field is a quick way to find all the current records in a dimension without using BETWEEN.

Natural Grains

In 30 years of analyzing and modeling data, I've found that fact table measurements all fall into just three classes. These types correspond to instantaneous events, regular periodic reports, and latest status. In dimensional modeling, these three fact table types are the transaction grain, the periodic snapshot grain, and the accumulating snapshot grain.

The transaction grain represents a point in space and time, and is meant for a measurement event defined at a particular instant. A scanner event at a grocery store is the classic example of a transaction event. In this case, the time stamp in the fact table is very simple. It's either a single daily grain foreign key or a pair consisting of a daily grain foreign key together with a time-of-day date/time stamp, depending on what the source system provides. The facts in this transaction grain table must be true to the grain and should describe only what took place in that instant.

The periodic snapshot grain represents a regular repeating measurement, like a bank account monthly statement. This fact table also has a single time stamp, representing the overall period. Usually the time stamp is the end of the period, and often is expressed at the daily grain, even if it's understood to represent a month or a fiscal period. The facts in this periodic snapshot grain table must be true to the grain and should describe only measures appropriate to the specific period.

The accumulating snapshot grain represents the current evolving status of a process that has a finite beginning and end. Usually these processes are of short duration and therefore don't lend themselves to the periodic snapshot. Order processing is the classic example of an accumulating snapshot.

The design and administration of the accumulating snapshot is quite different from the first two fact table types. All accumulating snapshot fact tables have a set of as many as four to 12 dates describing the typical scenario of the process being modeled. For instance, an order has a set of characteristic dates: original order date, actual ship date, delivery date, final payment date, and return date. In this example, these five dates appear as five separate foreign (surrogate) keys. When the order record is first created, the first of these dates is well defined, but perhaps none of the others have yet happened. This same fact record is subsequently revisited as the order wends its way through the pipeline. Each time something happens, the accumulating snapshot fact record is destructively modified. The date foreign keys are overwritten, and various facts are updated. Often the first date remains inviolate because that describes when the record was created, but all the other dates may well be overwritten, sometimes more than once.

Have You Lived Up to Your Pledges?

This article has been a brief overview of the central techniques of handling time in a data warehouse. If you've systematically employed surrogate keys for all your connections to your master time dimensions, faithfully tracked changes in dimension entities with the three types of SCDs, and supported your users' reporting needs with transaction grain, periodic snapshot grain, and accumulating snapshot grain fact tables, then you have indeed lived up to your pledges.

6.6 Graceful Modifications to Existing Fact and Dimension Tables

Ralph Kimball, Design Tip #29, Oct 15, 2001

Despite the best plans and intentions, the data warehouse designer must often face the problem of adding new data types or altering the relationships among data after the data warehouse is up and running. In an ideal world we would like such changes to be graceful so that existing query and reporting applications continue to run without being recoded, and existing user interfaces “wake up” to the new data and allow the data to be added to queries and reports.

Obviously, there are some changes that can never be handled gracefully. If a data source ceases to be available and there is no compatible substitute, then the applications depending on this source will stop working. But can we describe a class of situations where changes to our data environment can be handled gracefully?

The predictable symmetry of our dimensional models comes to our rescue. Dimensional models are able to absorb some significant changes in the source data without invalidating existing applications. Let's list as many of these changes as we can, starting with the simplest.

  1. New dimensional attributes. If, for example, we discover new textual descriptors of a product or customer, we add these attributes to the dimension as new fields. All existing applications will be oblivious to the new attributes and will continue to function. Most user interfaces should notice the new attributes at query time. Conceptually, the list of attributes available for constraining and grouping should be displayed in a query or reporting tool via an underlying query of the form SELECT COLUMN_NAME FROM SYS_TABLES WHERE TABLE_NAME = 'PRODUCT'. This kind of user interface will continuously adjust when new dimension attributes are added to the schema. In a slowly changing dimension environment, where slightly changed versions of the dimension are being maintained, care must be taken to assign the values of the new attributes correctly to the various versions of the dimension records. If the new attributes are available only after a specific point in time, then “NA” (not available) or its equivalent must be supplied for old dimension records.
  2. New types of measured facts. Similarly, if new measured facts become available we can add them to the fact table gracefully. The simplest case is when the new facts are available in the same measurement event and at the same grain as the existing facts. In this case, the fact table is altered to add the new fact fields, and the values are populated into the table. In an ideal world, an ALTER TABLE statement can be issued against the existing fact table to add the new fields. If that is not possible, then a second fact table must be defined with the new fields and the records copied from the first. If the new facts are only available from a point in time forward, then true null values need to be placed in the older fact records. If we have done all of this, old applications will continue to run undisturbed. New applications using the new facts should behave reasonably even if the null values are encountered. The users may have to be trained that the new facts are only available from a specific point in time forward.

    A more complex situation arises when new measured facts are not available in the same measurement event as the old facts, or if the new facts occur naturally at a different grain. If the new facts cannot be allocated or assigned to the original grain of the fact table, it is very likely that the new facts belong in their own fact table. It is a mistake to mix grains of measurements or disjoint kinds of measurements in the same fact table. If you have this situation, you need to bite the bullet and find a query tool or report writer that is capable of drill-across SQL so that it can access multiple fact tables in the same user request.

  3. New dimensions. A dimension can be added to an existing fact table by adding a new foreign key field and populating it correctly with values of the primary key from the new dimension. For example, a weather dimension can be added to a retail sales fact table if a source describing the weather is available at each selling location each day. Note that we are not changing the grain of the fact table. If the weather information is only available from a point in time forward, then the foreign key value for the weather dimension must point to a record in the weather dimension whose description is “weather unavailable.”
  4. More granular dimension. Sometimes it is desirable to increase the granularity of a dimension. For instance, a retail sales fact table with a store dimension could be modified to replace the store dimension with an individual cash register dimension. If we had 100 stores, each with an average of 10 cash registers, the new cash register dimension would have 1000 records. All of the original store attributes would be included in the cash register dimension because cash registers roll up perfectly in a many-to-one relationship to stores.
  5. Addition of a completely new data source involving new and existing dimensions. Almost always, a new source of data has its own granularity and its own dimensions. All of you dimensional designers know the answer to this one. We sprout a brand new fact table. Because any existing fact tables and dimension tables are untouched, by definition, all the existing applications keep chugging along. Although this case seems almost trivial, the point here is to avoid cramming the new measurements into the existing fact tables. A single fact table always owns a single kind of measurement expressed with a uniform grain.

This article has tried to define a taxonomy of unexpected changes to your data environment to give you a way to sort through various responses, and recognize those situations where a graceful change is possible. Because redoing queries and reports is hugely expensive and probably disruptive to the business users, our goal is to stay on the graceful side of the line.

Dos and Don'ts

In this section, we provide dimensional modeling guidelines to follow, as well as common traps to avoid.

1 6.7 Kimball's Ten Essential Rules of Dimensional Modeling

Margy Ross, Intelligent Enterprise, May 29, 2009

A student recently asked me for a list of “Kimball's Commandments” for dimensional modeling. Due to the religious connotations, we'll refrain from calling these commandments, but the following is a checklist of not-to-be-broken Kimball rules, along with less stringent rule-of-thumb recommendations.

Rule #1: Load detailed atomic data into dimensional structures. Dimensional models should be populated with bedrock atomic details to support the unpredictable filtering and grouping required by business user queries. Users typically don't need to see a single record at a time, but you can't predict the somewhat arbitrary ways they'll want to screen and roll up the details. If only summarized data is available, then you've already made assumptions about data usage patterns that will cause users to run into a brick wall when they want to dig deeper into the details. Of course, the atomic details can be complemented by summary dimensional models that provide performance advantages for common queries of aggregated data, but the business users cannot live on summary data alone; they need the gory details to answer their ever-changing questions.

Rule #2: Structure dimensional models around business processes. Business processes are the activities performed by your organization; they represent measurement events, like taking an order or billing a customer. Business processes typically capture or generate unique performance metrics associated with each event. These metrics translate into facts, with each business process represented by a single atomic fact table. In addition to single process fact tables, consolidated fact tables are sometimes created that combine metrics from multiple processes into one fact table at a common level of detail; consolidated fact tables are a complement to the detailed single process fact tables, not a substitute for them.

Rule #3: Ensure every fact table has a date dimension table associated with it. The measurement events described in rule #2 always have a date stamp of some variety associated with them, whether it's a monthly balance snapshot or a monetary transaction captured to the hundredth of a second. Every fact table should have at least one foreign key to an associated date dimension table, whose grain is a single day, with calendar attributes and nonstandard characteristics about the measurement event date, such as the fiscal month and corporate holiday indicator. Sometimes multiple date foreign keys are represented in a fact table.

Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail. There are three fundamental grains to categorize all fact tables: transactional, periodic snapshot, or accumulating snapshot. Regardless of its grain type, every measurement within a fact table must be at the exact same level of detail. When you mix facts representing multiple levels of granularity in the same fact table, you are setting yourself up for business user confusion and making the BI applications vulnerable to overstated or otherwise erroneous results.

Rule #5: Resolve many-to-many relationships in fact tables. Because a fact table stores the results of a business process event, there's inherently a many-to-many (M:M) relationship between its foreign keys, such as multiple products being sold in multiple stores on multiple days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, such as the multiple diagnoses associated with a health care encounter or multiple customers with a bank account; in these cases, it's unreasonable to resolve the many-valued dimensions directly in the fact table because this would violate the natural grain of the measurement event, so we use a many-to-many dual keyed bridge table in conjunction with the fact table.

Rule #6: Resolve many-to-one relationships in dimension tables. Hierarchical, fixed depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you've spent most of your career designing normalized entity-relationship models for transaction processing systems, you'll need to resist your instinctive tendencies to normalize or snowflake the M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling. It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll up attributes are frequently changing; however, using the fact table to resolve M:1 relationships should be done sparingly.

Rule #7: Store report labels and filter domain values in dimension tables. The codes and, more importantly, associated decodes and descriptors used for labeling and query filtering should be captured in dimension tables. Avoid storing cryptic code fields or bulky descriptive fields in the fact table itself; likewise, don't just store the code in the dimension table and assume that users don't need descriptive decodes or that they'll be handled in the BI application. If it's a row/column label or pull-down menu filter, it should be handled as a dimension attribute. While we stated in rule #5 that fact table foreign keys should never be null, it's also advisable to avoid nulls in the dimension tables' attribute fields by replacing the null value with “NA” (not applicable) or another default value determined by the data steward to reduce user confusion, if possible.

Rule #8: Make sure dimension tables use a surrogate key. Meaningless, sequentially assigned surrogate keys (except for the date dimension where chronologically assigned and even more meaningful keys are acceptable) deliver a number of operational benefits, including smaller keys, which mean smaller fact tables, smaller indexes, and improved performance. Surrogate keys are absolutely required if you're tracking dimension attribute changes with a new dimension record for each profile change; even if your business users don't initially visualize the value of tracking attribute changes, using surrogates will make a downstream policy change less onerous. The surrogates also allow you to map multiple operational keys to a common profile, plus buffer you from unexpected operational activities, like the recycling of an obsolete product number or acquisition of another company with its own coding schemes.

Rule #9: Create conformed dimensions to integrate data across the enterprise. Conformed dimensions (otherwise known as common, master, standard, or reference dimensions) are essential for enterprise data warehousing. Managed once in the ETL system and then reused across multiple fact tables, conformed dimensions deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes. The enterprise data warehouse bus matrix is the key architecture blueprint for representing the organization's core business processes and associated dimensionality. Reusing conformed dimensions ultimately shortens the DW/BI system's time-to-market by eliminating redundant design and development efforts; however, conformed dimensions require a commitment and investment in data stewardship and governance, even if you don't need everyone to agree on every dimension attribute to leverage conformity.

Rule #10: Continuously balance the requirements and realities to deliver a DW/BI solution that's accepted by the business users and supports their decision making. Dimensional modelers must constantly straddle the business users' requirements along with the underlying realities of the associated source data to deliver a design that is both reasonably implementable, and more importantly, stands a reasonable chance of business adoption. The requirements versus realities balancing act is a fact of life for DW/BI practitioners, whether you're focused on the dimensional model, project strategy, technical/ETL/BI architectures, or deployment/maintenance plan.

If you've read our Intelligent Enterprise articles regularly or our Toolkit books and monthly Design Tips (all of which are included in this Kimball Reader!), the rules discussed in this article shouldn't be news to you; however, we've tried to consolidate them into a single rulebook that you can easily reference as you're gathered to design (or review) your own models. Good luck!

6.8 What Not to Do

Ralph Kimball, Intelligent Enterprise, Oct 24, 2001

In nearly all the articles I have written for Intelligent Enterprise and its predecessor, DBMS, I described design techniques needed to build a data warehouse. But despite all those columns, something is missing. The tone of the columns has almost always been imperative: “In situation A, use design techniques X, Y, and Z.” I realize that data warehouse designers also need boundaries, so this column is devoted to dimensional modeling design techniques NOT to use.

The 12 dimensional modeling techniques to avoid are listed in reverse order of importance. But even the first few mistakes I list can be enough to seriously compromise your data warehouse.

Mistake 12: Place text attributes in a fact table if you mean to use them as the basis of constraining and grouping. Creating a dimensional model is a kind of triage. Start by identifying the numeric measurements delivered from an operational source; those go in the fact table. Then identify the descriptive textual attributes from the context of the measurements; these go in the dimensions. Finally, make a case-by-case decision about the leftover codes and pseudo numeric items, placing them in the fact table if they are more like measurements, and in the dimension table if they are more like physical descriptions of something. But don't lose your nerve and leave true text in the fact table, especially comment fields. Get these text attributes off the main runway of your data warehouse and into dimension tables.

Mistake 11: Limit the use of verbose descriptive attributes in dimensions to save space. You might think that you're being a good, conservative designer by keeping the size of your dimensions under control. But in virtually every data warehouse, the dimension tables are geometrically smaller than the fact tables. So what if you have a 100MB product dimension table, if the fact table is 100 times as large! To design an easy-to-use data warehouse, you must supply as much verbose descriptive context in each dimension as you can. Make sure every code is augmented with readable descriptive text. Remember that the textual attributes in the dimensions “implement” the user interface for browsing your data, provide the entry points for constraining, and supply the content for the row and column headers in the final reports.

Mistake 10: Split hierarchies and hierarchy levels into multiple dimensions. A hierarchy is a cascaded series of many-to-one relationships. Many products roll up to a single brand. Many brands roll up to a single category, and so on. If your dimension is expressed at the lowest level of granularity (such as product), then all the higher levels of the hierarchy can be expressed as unique values in the product record. Users understand hierarchies, and your job is to present them in the most natural and efficient way. A hierarchy belongs in a single, physical flat dimension table. Resist the urge to “snowflake” a hierarchy by generating a set of progressively smaller subdimension tables. Don't confuse back room data cleaning with front room data presenting! And finally, if you have more than one hierarchy existing simultaneously, in most cases it makes sense to include all the hierarchies in the same dimension, if the dimension has been defined at the lowest possible grain.

Mistake 9: Delay dealing with a slowly changing dimension (SCD). Too many data warehouses are designed to regularly overwrite the most important dimensions, such as customer and product, from the underlying data sources. This goes against a basic data warehouse oath: The data warehouse will represent history accurately, even if the underlying data source does not. SCDs are an essential design element of every data warehouse.

Mistake 8: Use smart keys to join a dimension table to a fact table. Beginning data warehouse designers tend to be somewhat too literal minded when designing the primary keys in dimension tables that must necessarily connect to the foreign keys of the fact table. It is counterproductive to declare a whole suite of dimension attributes as the dimension table key and then use them all as the basis of the physical join to the fact table. All sorts of ugly problems eventually arise. Replace the smart physical key with a simple integer surrogate key that is numbered sequentially from 1 to N (the number of records in the dimension table).

Mistake 7: Add dimensions to a fact table before declaring its grain. All dimensional designs should start with the numeric measurements and work outward. First, identify the source of the measurements. Second, specify the exact granularity and meaning of the measurements. Third, surround these measurements with dimensions that are true to that grain. Staying true to the grain is a crucial step in the design of a dimensional data model.

Mistake 6: Declare that a dimensional model is “based on a specific report.” A dimensional model has nothing to do with an intended report! A dimensional model is a model of a measurement process. A numeric measurement is a solid physical reality. Numeric measurements form the basis of fact tables. The dimensions appropriate for a given fact table are the physical context that describe the circumstances of the measurements. A dimensional model is solidly based on the physics of a measurement process and is quite independent from how a business user chooses to define a report.

Mistake 5: Mix facts of differing grain in the same fact table. A serious error in a dimensional design is to add “helpful” facts to a fact table, such as records that describe totals for an extended time span or rolled up geographic area. Although these extra facts are well known at the time of the individual measurement and would seem to make some applications simpler, they cause havoc because all the automatic summations across dimensions double and triple count these higher level facts, producing incorrect results. Each different measurement grain demands its own fact table.

Mistake 4: Leave lowest level atomic data in normalized format. The lowest level data is the most dimensional and should be the physical foundation of your dimensional design. Aggregated data has been deprived of some of its dimensions. If you build a dimensional model from aggregated data and expect your user query tools to drill down to normalized atomic data that you have left in your staging area, then you're dreaming. Build all your dimensional models on the most atomic data. Make all atomic data part of the presentation portion of your data warehouse. Then your user tools will gracefully resist the “ad hoc attack.”

Mistake 3: Eschew aggregate fact tables and shrunken dimension tables when faced with query performance concerns; solve performance problems by adding more parallel processing hardware. Aggregates (such as Oracle's materialized views and IBM DB2's automatic summary tables) are the single most cost-effective way to improve query performance. Most query tool vendors explicitly support aggregates, and all of these depend on dimensional modeling constructs. The addition of parallel processing hardware, which is expensive, should be done as part of a balanced program that consists also of building aggregates, choosing query-efficient DBMS software, building lots of indexes, increasing real memory size, and increasing CPU speed.

Mistake 2: Fail to conform facts across separate fact tables. It would be a shame to get this far and then build stovepipes. This is called snatching defeat from the jaws of victory. If you have a numeric measured fact called, for example, revenue, in two or more of your dimensional models sourced from different underlying systems, then you need to take special care to make sure that the technical definitions of these facts match exactly. You want to be able to add and divide these separate revenue facts freely in your applications. This act is called conforming the facts.

Mistake 1: Fail to conform dimensions across separate fact tables. This is the biggest mistake because the single most important design technique in the dimensional modeling arsenal is conforming your dimensions. If two or more fact tables have the same dimension, you must be a fanatic about making these dimensions identical or carefully chosen subsets of each other. When you conform your dimensions across fact tables, you will be able to drill across separate data sources because the constraints and row headers will mean the same thing and match at the data level. Conformed dimensions are the secret sauce needed for building distributed data warehouses, adding unexpected new data sources to an existing warehouse, and making multiple incompatible technologies function together harmoniously.

Myths about Dimensional Modeling

The final articles in this chapter will help you differentiate between dimensional modeling truths versus fiction.

6.9 Dangerous Preconceptions

Ralph Kimball, DBMS, Aug 1996

We've retained the references to data marts in this early article rather than substituting the more current business process dimensional model nomenclature because Ralph addressed common misunderstandings surrounding the Kimball definition of a data mart.

This month I look at some preconceptions about data warehouses that are not only false, but are dangerous to the success of your projects. By eliminating these preconceptions, you simplify the design of your data warehouse and reduce the implementation time. The first set of dangerous preconceptions concerns a current hot topic: data marts.

Dangerous Preconceptions: The data mart is a quick and dirty data warehouse. You can bring up a data mart without going to the trouble of developing an overall architectural plan for the enterprise. It's too much trouble to develop an overall architecture, and there is no way that you have the perspective to try that now.

The Liberating Truth: The data mart must not be a quick and dirty data warehouse; rather it should focus on a business process subject area implemented within the framework of an overall plan. A data mart can be loaded with data extracted directly from legacy sources. A data mart does not have to be downloaded formally from a larger centralized enterprise data warehouse.

The key to a successful data mart strategy is simple. For any two data marts or business process dimension models in an enterprise, the common dimensions must be conformed. Dimensions are conformed when they share attributes (fields) that are based on the same underlying values. Thus in a grocery store chain, if the “back door” purchase orders database is one data mart and the “front door” retail sales database is another data mart, the two dimensional models will form a coherent part of an overall enterprise data warehouse if their common dimensions (say, time and product) conform.

The beauty of conformed dimensions is that the two data marts don't have to be on the same machine and don't need to be created at the same time. Once both data marts are running, an overarching application can request data simultaneously from both (in separate queries) and the answer set is likely to make sense. Logically, the only valid “row headers” in a joint report must come from common dimensions such as time and product in our grocery store example. But we have guaranteed that at least some of the row headers from these two data marts will be in common because the dimensions are conformed. Any of these common row headers can produce a valid report, as shown in Figure 6.5. In this example, the purchase order data is at the individual day level, but the sales forecast data is at the week level. Because days roll up to weeks, the two date dimensions are conformed since they share all the attributes that roll upward from individual weeks.

Purchase Order and Sales Forecasts fact tables link to Daily and Weekly Date dimension tables. A downward curve arrow from Daily Date to Weekly Date dimension tables indicate days roll up to weeks.

Figure 6.5 Two fact tables with conformed dimensions.

The idea of developing an overall data warehouse architecture is daunting, but the key step in that architecture plan is simple: Identify the common dimensions. In virtually every company, the most important common dimensions are customers, products, geographies, and time frames.

Once the common dimensions have been identified, the development of separate data marts must be managed under this common dimensional framework. When two data marts use the same dimension (for example, customer), they must share a set of attributes that are based on the same values.

The second set of dangerous preconceptions concerns dimensional models and whether they are “robust.”

Dangerous Preconceptions: The dimensional data model is a specific high level summary type of design that is not extensible and cannot readily accommodate changes in database design requirements.

The Liberating Truth: The dimensional data model is extremely robust. It can withstand serious changes to the content of the database without requiring existing applications to be rewritten. New dimensions can be added to the design. Existing dimensions can be made more granular. New unanticipated facts and dimensional attributes can also be added, as shown in Figure 6.6.

Retail Sales fact  table and Daily Date dimension table with arrows labeling the new dimension, possibly more granular dimension, new dimension attribute, and new fact.

Figure 6.6 Potential changes to an existing dimensional model.

The secret of an extensible dimensional database is building the fact table at the most granular level. In Figure 6.6, the fact table represents daily sales of individual products in individual stores. Because all three primary dimensions (date, product, and store) are expressed in low level atomic units, they can roll up to any conceivable grouping requested by a user in the future. If you had already aggregated the database up to weeks, you cannot provide reliable monthly data without returning to a primary database extract and building a new database incompatible with the old. With daily data, however, you can accommodate both the weekly view and the monthly view without compatibility problems and without re-extracting the original data.

Figure 6.6 shows how the dimensional schema provides standard, convenient hooks for extending the database to meet new requirements. You can add new dimensions, add new facts, add new attributes to a dimension, and even make a dimension more granular. All of the extensions shown can be implemented without changing any previous application. No SQL must be rewritten. No applications must be rebuilt. This is the essence of extendibility.

6.10 Fables and Facts

Margy Ross, Intelligent Enterprise, Oct 16, 2004

According to Merriam-Webster, fables are fictitious statements. Unfortunately, fables about dimensional modeling circulate throughout our industry. These false claims and assertions are a distraction, especially if you're trying to align a team. In this article, we describe the root misunderstandings that perpetuate these myths so you understand why they're as unfounded as fairy tales about two-headed creatures.

Not All Dimensional Models Are Created Equal

We review a lot of dimensional models in our work. They often illustrate best practice design principles from our Toolkits and articles. However, not all supposed dimensional models are designed appropriately. Some blatantly violate core dimensional modeling tenets. Given the abysmal sample star schemas found in seemingly authoritative books and training presentations, this comes as no surprise. However, you shouldn't lump all dimensional models into a “bad” category based on misguided imposters.

Most of the fabled assertions are rooted in several basic mistakes regarding dimensional modeling best practices. Dimensional modeling can't be blamed if its fundamental concepts aren't embraced. Likewise, criticisms lobbed by individuals who don't understand its key premises need to be taken with a grain of salt. Once we clarify these misunderstandings, you'll be prepared to distinguish fables from facts for yourself.

Focus on Measurement Processes, Not Departmental Reports

We advocate a four-step approach for designing dimensional models. The first step is to identify the business process, followed by declaring the grain, then selecting the dimensions and facts. Nowhere do we recommend specifying the business's top 10 reports or queries.

If requirements are gathered by focusing exclusively on report or query templates, you're susceptible to modeling data to produce a specific report, rather than capturing the key metrics and related dimensions for analysis. Obviously, it's important to consider business usage when designing dimensional models. The dimension attributes must support the BI environment's filtering and labeling requirements. Robust dimension attributes translate into nearly endless analytic slicing and dicing combinations. However, don't blindly focus on a top 10 list in isolation because priorities and hot reports will inevitably evolve.

Instead of concentrating on specific reports or departmental needs in a vacuum, we suggest focusing the dimensional design on the most critical performance measurement process. In doing so, you can put the following fables to rest.

FABLE: Dimensional databases are built to address a specific business report or application. When the business needs a new report, another dimensional star schema is built.

FACT: Dimensional models should be built around physical measurement processes or events. A fact table row is created when a measurement occurs. The associated dimension attributes reflect contextual characteristics and hierarchies. If the business identifies a new report based on the same measurement process, there's no need to build a new mart, model, or schema. Measurement processes are relatively stable in most organizations; the analytics performed against these metrics are more fluid.

FABLE: Dimensional models are departmental solutions. When a different department needs access to the data, a new model is built and labeled with the department's vocabulary. Dimensional models require multiple extracts from the same source data repeatedly.

FACT: Dimensional models shouldn't be departmentally bound. A fact table representing a fundamental measurement process need only have one physical instance that's shared across business functions or departments. There's no reason to create multiple extracts from the same source. Metrics resulting from the invoicing business process, for example, are made available in a single dimensional model for access across the enterprise; there's no reason to replicate invoice performance metrics in separate departmental solutions for finance, marketing, and sales. Even if these departmental solutions were sourced from the same repository, they likely use similar, but slightly different naming conventions, definitions, and business rules, defeating the promise of a single version of the truth. The departmental approach is highly vulnerable to inconsistent, nonintegrated point solutions. We've never advocated this approach.

FABLE: You can't incorporate new data sources without rebuilding the original star schema or creating separate fact tables or data marts.

FACT: If the new data source is another capture system for an existing measurement process in the BI environment, then the new data can be gracefully combined with the original data without altering any existing reporting applications, presuming the granularity is the same. If the new data source is at a different grain representing a new measurement process, then a new fact table must be created. This has nothing to do with dimensional modeling. Any data representation would create a new entity when a new table with different keys is introduced.

FABLE: With dimensional modeling, the fact table is forced to a single grain that is inflexible.

FACT: Having the discipline to create fact tables with a single level of detail assures that measurements aren't inappropriately double-counted. A table with mixed grain facts can only be queried by a custom application knowledgeable about the varying levels of detail, effectively ruling out ad hoc exploration. If measurements naturally exist at different grains, then the most foolproof design establishes a fact table for each level. Far from being inflexible, this approach protects existing applications from breaking or recoding as changes occur.

Begin with Atomic Details, Not Summarized Data

Some claim that dimensional models are intended for managerial, strategic analysis and, therefore, should be populated with summarized data, not operational details. We strongly disagree. Dimensional models should be populated with atomic data so business users can ask very precise questions. Even if users don't care about the details of a single transaction, their question of the moment involves summarizing the details in unpredictable ways. Database administrators may presummarize some information, either physically or via materialized views, to avoid on-the-fly summarization with every query. However, these aggregates are performance tuning complements to the atomic level, not replacements. If you create dimensional models with atomic details, the following fables are nonissues.

FABLE: Star schemas and dimensional models presuppose the business question. When the requirements change, the model must be modified.

FACT: When you presummarize information, you've presupposed the business question. However, dimensional models with atomic data are independent of the business question because users can roll up or drill down ad infinitum. They answer new, previously unspecified questions without database changes.

FABLE: Star schemas and dimensional models are only appropriate when there's a predictable pattern of usage. Dimensional models aren't appropriate for exploratory queries.

FACT: Both normalized and dimensional models contain the same information and data relationships; both are capable of answering exactly the same questions, albeit with varying difficulty. Dimensional models naturally represent the physics of a measurement event; fact tables contain the measurements, and dimension tables contain the context. A single dimensional model based on the most atomic data is capable of answering all possible questions against that data.

FABLE: Dimensional models aren't scalable. If detailed data is stored in a dimensional data mart, performance will be degraded. Data marts only contain recent information and are restricted from storing history.

FACT: Dimensional star schemas are extremely scalable. It isn't unusual for modern fact tables to have billions of rows corresponding to the billions of measurement transactions captured. Million row dimension tables are common. Dimensional models should contain as much history as required to address the business requirements. There's nothing about dimensional modeling that prohibits the storage of substantial history.

FABLE: Dimensional models aren't extensible and are unable to address future needs of the data warehouse.

FACT: Dimensional models that express data at the lowest level of detail deliver maximum flexibility and extensibility. Users can summarize the atomic data any which way. Likewise, atomic data can be extended with additional attributes, measures, or dimensions without disrupting existing reports and queries.

FABLE: A dimensional model can't support complex data. It eliminates many-to-many relationships between entities, allowing only many-to-one relationships.

FACT: The logical content of dimensional models and normalized models are identical. Every data relationship expressed in one model can be accurately expressed in the other model. Dimensional models are always based on fact tables, which are completely general many-to-many relationships. A dimensional model is a form of an entity-relationship model with unnecessary snowflaking (normalization of dimension attributes) suppressed.

Integration Is the Goal, Not Normalization

Some people believe normalization solves the data integration challenge. Normalizing data contributes nothing to integration, except forcing data analysts to confront the inconsistencies across data sources.

Data integration is a process apart from any specific modeling approach. It requires identifying incompatible labels and measures used by the organization, then reaching consensus to establish and administer common labels and measures enterprise-wide. In dimensional modeling, these labels and measures reside in conformed dimensions and conformed facts, respectively. As represented in the enterprise data warehouse bus architecture, conformed dimensions are the integration glue across measurement business processes. Conformed dimensions are typically built and maintained as centralized persistent master data during ETL, then reused across dimensional models to enable data integration and ensure semantic consistency.

FABLE: Dimensional modeling concepts like conformed dimensions put an undue burden on the ETL effort.

FACT: Data integration depends on standardized labels, values, and definitions. It's hard work to reach organizational consensus and implement the corresponding ETL system rules, but you can't dodge the effort, regardless of whether you're dealing with a normalized or dimensional model.

FABLE: Dimensional modeling isn't appropriate when there are more than two unique source systems due to the complexities of integrating data from multiple sources.

FACT: The challenges of data integration have nothing to do with the modeling approach. Paradoxically, dimensional modeling and the bus architecture reveal the labels and measures of a business so clearly that an organization has no choice but to address the integration problems directly.

FABLE: Changes to dimension attributes are only an issue for dimensional models.

FACT: Every data warehouse must deal with time variance. When the characteristic of an entity like customer or product changes, we need a systematic approach for recording the change. Dimensional modeling uses a standard technique known as slowly changing dimensions (SCDs). When normalized models step up to the issue of time variance, they typically add time stamps to the entities. These time stamps serve to capture every entity change (just like a type 2 SCD does), but without using a surrogate key for each new row, the query interface must issue a double-barreled join that constrains both the natural key and time stamp between every pair of joined tables, putting an unnecessary, unfriendly burden on every reporting application or query.

FABLE: Multiple dimensional models can't be integrated. They're built bottoms up, catering to the needs of an individual department, not the needs of an enterprise. Data mart chaos is the inevitable outcome.

FACT: It's definitely a struggle to integrate databases of any flavor that have been built as departmental, standalone solutions that haven't been architected with conformed dimensions. That's precisely why we advise against this approach! Chaos won't result if you use the bus architecture for the enterprise framework of conformed dimensions, then tackle incremental development based on business measurement processes. Organizational and cultural obstacles are inevitable as consistent definitions, business rules, and practices are established across the enterprise. The technology is the easy part.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset