Chapter 7
Dimensional Modeling Tasks and Responsibilities

With a basic understanding of dimensional modeling concepts in hand, we turn our attention to focus on the dimensional modeling process itself—how do we go about tackling a design, who's involved, and what do we need to worry about during the design activities? This chapter is divided into two sections; the first deals with issues surrounding an initial design starting from a blank sheet of paper, and the second half deals with the tasks surrounding the review of an existing dimensional model or system implementation.

Design Activities

The articles in this section focus on the tasks and players involved during the design of a new dimensional model, beginning with a historical perspective that is still remarkably valid.

7.1 Letting the Users Sleep

Ralph Kimball, DBMS, Dec 1996 and Jan 1997

This content was originally published as two consecutive articles.

The job of a data warehouse designer is a daunting one. Often the newly appointed data warehouse designer is drawn to the job because of the visibility and importance of the data warehouse function. In effect, management says to the designer: “Take all of the enterprise data and make it available to management so that they can answer all of their questions and sleep at night. And please do it very quickly, and we're sorry but we can't add any more staff until the proof of concept is successful.”

This responsibility is exciting and very visible, but most designers feel overwhelmed by the sheer enormity of the task. Something real needs to be accomplished, and fast. Where do you start? Which data should be brought up first? Which management needs are most important? Does the design depend on the details of the most recent interview, or are there some underlying and more constant design guidelines that you can depend on? How do you scope the project down to something manageable, yet at the same time build an extensible architecture that will gradually let you build a comprehensive data warehouse environment?

These questions are close to causing a crisis in the data warehouse industry. Much of the recent surge in the industry toward “data marts” is a reaction to these very issues. Designers want to do something simple and achievable. No one is willing to sign up for a galactic design. Everyone hopes that in the rush to simplification, the long term coherence and extendibility of the design will not be compromised.

Fortunately, a pathway through this design challenge achieves an implementable immediate result, and at the same time, continuously augments the design so that eventually a true enterprise data warehouse is built. This divide-and-conquer approach is based on business process subject areas (originally referred to as data marts). Thus, an enterprise data warehouse is revealed as the union of a set of separate business process subject areas implemented over a period of time, possibly by different design teams, and possibly on different hardware and software platforms.

Each business process subject area is designed using a nine-step design methodology:

  1. Choose the process.
  2. Choose the grain.
  3. Identify and conform the dimensions.
  4. Choose the facts.
  5. Store precalculations in the fact table.
  6. Round out the dimension tables.
  7. Choose the duration of the database.
  8. Determine the need to track slowly changing dimensions.
  9. Decide the physical design.

As a result of interviewing marketing users, finance users, sales force users, operations users, middle management, and senior management, a picture emerges of what is keeping these people awake at night. You can list and prioritize the primary business issues facing the enterprise. At the same time, you should conduct a set of interviews with the operational systems' DBAs who will reveal which data sources are clean, which contain valid and consistent data, and which will remain supported over the next few years.

Preparing for the design with a proper set of interviews is crucial. Interviewing is also one of the hardest things to teach people. I find it helpful to reduce the interviewing process to a tactic and an objective. Crudely put, the tactic is to make the business users talk about what they do, and the objective is to gain insights that feed the nine design decisions. The tricky part is that the interviewer can't pose the design questions directly to the users. Business users don't have opinions about data warehouse system design issues; they have opinions about what is important in their business lives. Business users are intimidated by system design questions, and they are quite right when they insist that system design is IT's province, not theirs. Thus, the challenge of the data warehouse designer is to meet the users far more than halfway.

In any event, armed with the top-down view (what keeps management awake) and the bottom-up view (which data sources are available), the data warehouse designer is ready to tackle the nine design decisions:

  1. Choose the process. By process, I mean the content of a particular operational activity. The first business process subject area you build should be the one with the most bang for the buck. It should simultaneously answer the most important business questions and be the most accessible from a data extraction point of view. A great place to start in most enterprises is to model the process consisting of customer invoices or monthly statements, as shown in Figure 7.1. This data source is probably fairly accessible and of fairly high quality. One of Kimball's laws is that the best data source in any enterprise is the record of “how much money they owe us.” Unless costs and profitability metrics are readily available, it's best to avoid adding these items to this first project; nothing drags down a data warehouse implementation faster than a heroic or impossible mission to provide activity-based costing as part of the first deliverable.
    Customer Invoice Line Item fact table with Time, Product, Shipper, Customer, Promotion, and Status shrunken dimension tables linking to Key items.

    Figure 7.1 Customer invoice line item fact table.

  2. Choose the grain. This second step seems like a technical detail at this early point, but it is actually the secret to making progress on the design. Choosing the grain means deciding exactly what a fact table record represents. Recall that the fact table is the large central table in the dimensional design that has a multipart key. Each of the components of the multipart key is a foreign key to an individual dimension table. In the customer invoices example in Figure 7.1, the grain of the fact table is the individual line item on the customer invoice. In other words, a line item on an invoice is a single fact table record, and vice versa. Only when you have chosen the grain can you have a coherent discussion of the business process's corresponding dimensions.
  3. Identify and conform the dimensions. The dimensions are the source of both the query constraints and row headers in the user's reports; they carry the enterprise's vocabulary to the users. A well-architected set of dimensions makes the model understandable and easy to use. A poorly presented or incomplete set of dimensions robs the model of its usefulness.

    Dimensions should be chosen with the enterprise data warehouse in mind. This choice presents the primary moment at which the data warehouse architect must lift up his or her gaze from the project details and consider the longer range plans. If any dimension occurs in two business processes, they must contain an overlapping set of attributes (fields) that are drawn from the same domains. Only in this way can two processes share such a dimension in the same application. When a dimension is used in two business process subject areas, this dimension is said to be conformed. Good examples of dimensions that absolutely must be conformed are the customer and product dimensions in an enterprise. If these dimensions are allowed to drift out of synchronization across processes, the overall data warehouse will fail because the business process subject areas will not be able to be used together.

    The requirement to conform dimensions across business processes is very strong. Careful thought must be given to this requirement before the first business process is implemented. The DW/BI team must figure out what an enterprise customer ID is and what an enterprise product ID is. If this task is done correctly, successive business processes can be built at different times, on different machines, and by different development teams, and these process subject areas will merge coherently into an overall data warehouse. In particular, if the dimensions of two processes are conformed, it is easy to implement drill across by sending separate queries to the two subject areas, and then sort-merging the two answer sets on a set of common row headers. The row headers can be made to be common only if they are drawn from a conformed dimension common to the two processes, as shown in Figure 7.2.

    Invoices fact table depicts 1 record per invoice line item while Customer Communications fact table, 1 record per promotions response. Bottom: Table comparing responses and invoices by 2 promotions of Product A.

    Figure 7.2 Two subject areas with conformed dimensions.

  4. Choose the facts. The grain of the fact table determines which facts you can use for the given business process. All of the facts must be expressed at the uniform level implied by the grain. In other words, if the grain of the fact table is an individual line item on a customer bill, then all of the numerical facts must refer to this particular line item. Also, as I have said many times before, the facts should be as additive as possible.

    Figure 7.3 shows a “bad” fact table with a horrible mixture of non-numeric facts, non-additive facts, and facts at the wrong grain. This design is unusable.

    A “bad” fact table labeled Poorly Designed Sales listing items Unit Price and Gross Margin as non-additive facts and Year-to-Date Sales and Last Year Year-to-Date as wrong granularity facts.

    Figure 7.3 An example of a “bad” fact table.

    Figure 7.4 shows a “good” fact table in which all of the data elements in the first design have been correctly recast so that they are as numeric and additive as possible.

    Notice that additional facts can be added to the fact table at any time, as long as they are consistent with the grain of the table. These additional facts do not invalidate any previously functioning applications.

    A “good” fact table labeled Correctly Designed Sales listing items Daily Sales as correct grain and aggregates in separate tables, Extended Unit Price and Extended Cost as fully additive.

    Figure 7.4 An example of a “good” fact table addressing the issues in Figure 7.3.

  5. Store precalculations in the fact table. A common example of the need to store precalculations occurs when the facts comprise a profit and loss statement. This situation will often arise when the fact table is based on a customer bill. Figure 7.5 shows the fact table starting off with the quantity sold and extended list price. Both are beautiful additive quantities from which you can always derive the average unit list price after you have added up some number of fact table records. Of course, the customer doesn't usually pay the list price. You need to subtract any allowances and discounts to arrive at the extended net price. Because the extended net price can always be derived from the extended list price minus the allowances and discounts, do you need to store the extended net price explicitly?

    Image described by surrounding text.

    Figure 7.5 A customer billing fact table.

    The answer is a resounding YES! This example is complicated enough that if there is even the smallest chance that a user will derive the extended net price incorrectly, you should put it into the underlying physical schema, even though it takes up space. The cost of a user incorrectly representing the extended net price, which after all is the primary revenue number of your whole enterprise, overwhelms the minor cost of a little redundant data storage. Note that a view that simply calculates the extended net price is somewhat dangerous if there is any chance that a user with an ad hoc query tool can sneak around the view to get at the physical table. In the long run, views are a good way to achieve a balance between eliminating user blunders and saving on storage, but the DBA must allow no exceptions to the users always accessing the data through the view.

  6. Round out the dimension tables. At this point the fact table is complete, and you understand the roles of the dimension tables in providing the entry points into the fact table through constraints on the dimensional attributes. The grain decision in step 2 also determined the grain of each dimension table. For instance, if the grain is an individual customer bill line item, then the customer dimension grain is probably the customer bill-to address, and the product/service dimension grain is the lowest level of the product/service hierarchy at which you do billing. In step 3, you should have identified the dimensions in sufficient detail to describe such things as customers and products at the correct grain.

    In this step, you can return to the dimension tables and exhaustively add as many text-like descriptors to the dimensions as possible. In my data warehouse design consulting, I insist that clients identify a minimum of 50 text-like attributes for such important dimensions as customer and product. Even inherently small dimensions such as transaction type should be graced with good text descriptions of what each transaction type means. Chances are these transactions can also be arranged in groups. The transaction group should be another text attribute.

    All of the text attributes should consist of real words. Cryptic abbreviations are extremely undesirable. Remember that these text attributes are both the user interface to the application and the row and column headers in printed reports. IT shops must do a very professional job of quality assurance on the dimension table attributes.

  7. Choose the duration of the database. The duration measures how far back in time the fact table goes. In many businesses, there is a natural need to look at the same time period a year ago. This need usually mandates at least five calendar quarters of data. Near the end of a calendar year, this implies two full years' worth of data. These arguments can be repeated with less intensity for two-year-old data, three-year-old data, and so on.

    Insurance companies and organizations with regulatory reporting requirements may have very long fact table durations, often extending back seven or more years. These long duration fact tables raise at least two very significant data warehouse design issues. First, it is often increasingly difficult to source old data as you go backward in time. The older the data, the more likely there will be problems in reading and interpreting the old files. Second, it is mandatory that the old versions of the important dimensions be used, not the most current versions.

  8. Determine the need to track slowly changing dimensions. The proper description of the old product and old customer must be used with the old transaction history. The data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of customers and products over a span of time. I discussed these design issues in article 1.8, Slowly Changing Dimensions.
  9. Decide the physical design. After the first eight steps, you have a complete logical design of your business process subject area. You are ready to turn your attention to physical design issues. In this step, I restrict attention to the biggest physical design issues affecting the user's perception of the subject: the physical sort order of the fact table on the disk and the presence of pre-stored summaries or aggregations. Beyond these issues, there are a host of additional physical design issues affecting administration, backup, indexing performance, and security.

    The physical sort order on the disk can be a significant design tool in a data warehouse. In my data warehouse design classes, I discuss a “headquarters sort” and a “field sales sort.” The data warehouse designer usually must choose one of these sorts at the expense of the other. In extreme circumstances, the designer can affect performance by a factor of four or more, depending on this choice.

If you systematically perform the nine steps, you will not only end up with a complete and detailed design that drives the implementation, but you will understand how to tie your separate business process subject areas together with conformed dimensions so that over time you will end up with an enterprise data warehouse.

7.2 Practical Steps for Designing a Dimensional Model

Margy Ross, Intelligent Enterprise, Sep 29, 2008

The Kimball Group has written hundreds of articles and Design Tips about dimensional modeling techniques, but we haven't written much about the dimensional modeling process itself. What are the tasks and deliverables required to create a robust design?

Before embarking on a dimensional design project, you need a solid understanding of the business's requirements along with a reasonable assessment of the underlying source data. It's tempting to skip the requirements review, but resist this urge because doing so increases the risk of developing a source-driven model that falls short of business needs in numerous small but significant ways. Optimally, the requirements have been thoroughly researched and documented in a user-approved requirements finding document with the top priority needs clearly identified. This deliverable often also includes a preliminary data warehouse bus matrix, as described in article 5.6, The Matrix: Revisited.

In addition to the business requirements and bus matrix, you'll also want to review any source data profiling insights uncovered by the project team to date. The final item on your required reading list is your organization's naming convention standards document. If you don't already have naming conventions to adapt for data warehousing and business intelligence, you'll need to establish them as you develop the dimensional model (see article 7.7, The Naming Game).

Join the Party

Everyone knows that the key to a successful party is inviting the right people. The same holds true for the dimensional modeling festivities. Unfortunately, too many data modelers view the development of a dimensional model as a solo, independent activity; they retreat to their ivory cubicle to ponder the pros and cons of modeling alternatives on their own, emerging weeks later to unfurl their masterpiece for others to review. While the data modeler should be leading the charge and retain primary responsibility for the deliverables, the best way to arrive at a sound dimensional design is through a collaborative team effort because no one person likely has detailed knowledge of both the business requirements and source system idiosyncrasies.

The modeling team should include people who can accurately represent the business users' analytic needs, such as the business analyst, power user, BI application developer, or all of the above. In particular, power user involvement is valuable as they've likely already identified business rules to convert source data into more meaningful information for decision making; their insights often result in a richer, more analytically complete design. The designated data stewards should be involved in the modeling process to drive to organizational agreement on names, definitions, and business rules. If you haven't already established a data stewardship program, there's no time like the present to do so.

It's also beneficial to have some source experts at least intermittently involved in the process to quickly answer questions and resolve issues regarding data timing and content nuances. Finally, you should invite members of the ETL team to the party to gain early insights about the model and its source-to-target mapping. ETL team members typically have more to gain than to contribute to the modeling process, but getting their buy-in to the design saves time and avoids tire spinning down the road. For the same reason, you should also invite the DBAs who will implement the physical design.

With the team assembled and prerequisite reading completed, you'll need to briefly introduce dimensional modeling so everyone on the team understands its core principles (and appreciates that denormalization is not always evil).

The team's first objective is to reach agreement on a high level model diagram, or bubble chart, as illustrated in Figure 7.6. The bubble chart represents a fact table corresponding to a single business process; it includes a clearly articulated grain declaration and identification of core dimensions. The bubble chart can often be derived largely from the preliminary bus matrix. It serves to get everyone on the same page regarding the scope of the model without getting unnecessarily mired in the details. Because it's easy to understand, the high level diagram also facilitates discussion between the design team and business partners.

A high-level bubble chart diagram wherein a box labeled Orders links to circles labeled Currency, Promotion, Channel, Customer, Sales Person, Order Indicators, Product, and Order Date.

Figure 7.6 Example high level bubble chart diagram.

Dive into the Details

With consensus on the bubble chart, the team then launches into table-by-table and column-by-column discussions, drilling into more detail about the requisite attributes and metrics, including definitions, transformation rules, and data quality concerns. The dimensional model will unfold through a series of design sessions, with each pass producing a more detailed design that's been repeatedly tested against your understanding of the business needs.

Don't attempt to schedule all-day design meetings; plan for a number of two hour morning and afternoon sessions so the lead modeler has time to update the documentation before the next meeting (and so team members can deal with the demands of their day jobs). Your initial sessions should focus on a more straightforward dimension table so the team experiences a quick win before dealing with the more controversial dimensions.

As you're iteratively fleshing out the design, team members will need to further interrogate the data. Thus, ongoing access to a profiling tool (or less sophisticated method) will be critical. Ferreting out the good, bad, and ugly realities of your source data during the design process will minimize the surprises and corresponding overruns during the ETL design and development activity.

Throughout the design sessions, the data modeler or designated scribe should be filling in a detailed worksheet for each table with information such as the attribute or fact name, column description, sample values and decodes, change tracking rules for dimension attributes, and preliminary transformation business rules. These worksheets form the basis for the source-to-target mapping, which is further embellished by the physical designer and then ultimately handed off to the ETL team. In addition to the detailed worksheets, a member of the design team should also be logging open issues so they're captured in a single document to facilitate review and assignment at the end of every session.

Review the Results

The last phase of the modeling process involves reviewing and validating the model with interested parties, starting with the project team, extending to those in IT with intimate knowledge of the source systems, and concluding with the broader business community. Approval from this last group is critical before time and money are invested in the data implementation. Plan to devote much of this review to illustrating how the model will address sample questions from the requirements findings.

A typical design effort usually takes three to four weeks for a single business process dimensional model, but the time required will vary depending on the complexity of the business process, availability of preexisting conformed dimensions, experience of the modeling team, existence of well documented business requirements, and the difficulty of reaching consensus.

Designing a dimensional model with interested parties representing diverse skills requires commitment and cooperation, but the end result is a robust model that has been rigorously tested against both the business needs and data realities. That's exactly what you want before you move ahead with the implementation.

7.3 Staffing the Dimensional Modeling Team

Bob Becker, Design Tip #103, Jul 1, 2008

It's surprising the number of DW/BI teams that confine the responsibility for designing dimensional models to a single data modeler or perhaps a small team of dedicated data modelers. This is clearly shortsighted. The best dimensional models result from a collaborative team effort. No single individual is likely to have the detailed knowledge of the business requirements and the idiosyncrasies of all the source systems to effectively create the model themselves.

In the most effective teams, a core modeling team of two or three people does most of the detailed work with help from an extended team. The core modeling team is led by a data modeler with strong dimensional modeling skills and excellent facilitation skills. The core team should also include a business analyst who brings a solid understanding of the business requirements, the types of analysis to be supported, and an appreciation for making data more useful and accessible. Ideally, the core team will include at least one representative from the ETL team with extensive source systems development experience and an interest in learning.

We recommend including one or more analytic business users as part of the core modeling team. These power users add significant insight and help speed the design process. They are particularly valuable to the modeling process because they understand the source systems from a business point of view.

The core modeling team needs to work closely with source system developers to understand the contents, meaning, business rules, timing, and other intricacies of the source systems involved in populating the dimensional model. If you're lucky, the people who actually built or originally designed the source systems are still around. For any given dimensional model, there are usually several source system people you need to pull into the modeling process. There might be a DBA, a developer, and someone who works with the data input process. Each of these folks does things to the data that the other two don't know about.

The DBA implementing the physical database, the ETL architect/developer, and the BI architect/developer should also be included in the modeling process. Being actively engaged in the design process will help these individuals better understand the business reasons behind the model and facilitate their buy-in to the final design. Often the DBA comes from a transaction system background and may not understand the rationale for dimensional modeling. The DBA may naturally want to model the data using more familiar normalization design rules, physically defeating your dimensional design. ETL designers often have a similar tendency. Without a solid appreciation of the business requirements and justification for the dimensional design, the ETL designer will want to streamline the ETL process by shifting responsibility for calculations to the BI tool, skipping a description lookup step, or taking other shortcuts. Though these changes may save ETL development time, the trade-off may be an increase in effort or decrease in query performance for hundreds of business users. BI designers can often provide important input into the models that improve the effectiveness of the final BI applications.

Before jumping into the modeling process, take time to consider the ongoing management and stewardship implications of the DW/BI environment. If your organization has an active data stewardship initiative, it is time to tap into that function. If there is no stewardship program, it's time to initiate the process. An enterprise DW/BI effort committed to dimensional modeling as an implementation approach must also be committed to a conformed dimension strategy to assure consistency across business processes. An active data stewardship program can help an organization achieve its conformed dimension strategy.

Although involving more people in the design process increases the risk of slowing down the process, the improved richness and completeness of the design is well worth the additional overhead.

7.4 Involve Business Representatives in Dimensional Modeling

Bob Becker, Design Tip #157, Jun 28, 2013

The Kimball Group has always stressed the importance of keeping a keen eye on the business requirements when designing dimensional data models for the data warehouse/business intelligence (DW/BI) environment. Gathering business requirements is typically undertaken just prior to beginning the dimensional data model design process. Article 4.2, More Business Requirements Gathering Dos and Don'ts, is a reminder of requirements gathering dos and don'ts. We also believe it's extremely important to include key business representatives in the design process itself. Unfortunately, many organizations balk at including business representatives in the design activities. They view dimensional modeling as a technical exercise focused on modeling data elements and fail to appreciate the value of involving business subject matter experts.

My manta regarding the dimensional data modeling process is “Remember—data warehousing is not about the data. It's about serving the business requirements!” Okay, I'll admit that data warehousing is about the data to a significant extent, but when it comes to the design process, I'll stand by my mantra. Worrying primarily about the data and failing to focus on the business requirements is a critical mistake. Including business representatives in the design process and keeping them engaged will result in a significantly better design. Often it's hard to appreciate the value of business involvement until you are deep into the design process. But at the end of the day, the resultant data model must support the business requirements or the DW/BI initiative will fail. Article 4.12, Using the Dimensional Model to Validate Business Requirements, describes how to leverage the dimensional data model to validate and capture business requirements.

To effectively leverage the participation of business users, it is important to constantly keep them engaged. If discussions turn away from modeling the requirements and turn to detailed ETL design topics or other technical issues, the business users will tune out and ultimately stop participating. To keep them involved, maintain the focus on the design and how it supports the business requirements. The technical discussions can wait for another time.

Business users will actively participate in the design process if given half a chance. At the beginning they won't be sure about how to participate as this is new to them. Fortunately, there are countless opportunities during a design session to engage the business. Every key dimensional attribute or metric is ripe for discussion. Key questions include: “Why is this data element important? How do you use it? What will you do with it in a report or analysis? What other attributes or metrics do you combine it with? Why?” Typically, these discussions lead to a deeper understanding of the business requirements that may not have surfaced earlier.

For example, many business processes (such as tracking call segments in a call center) result in transaction fact tables that include Begin Time and End Time dimensions. An inexperienced data-oriented design team that does not include business representation will be satisfied with a fact table with both these dimensions. A more experienced design team might query their business partners regarding the use of these time dimensions and discover they are used to calculate the call's duration. With this newly understood requirement, the design team will enhance the fact table to include call duration as a metric. This enhancement results in a more compelling design for the business users who can now measure average call durations by any of the associated dimension attributes, enabling them to understand service levels without performing the duration calculations at query time. From a business perspective, this represents a significant usage improvement over the initial design.

Continued discussions with the business users regarding this metric might uncover the existence of standard service levels that have been defined by the business. For example, you may discover the business has recently established a new baseline for classifying call durations: a call duration of 2 to 5 minutes is considered “Normal,” 1 to 2 minutes is considered “Normal – Short,” less than 1 minute is considered “Abnormal – Short,” and likewise for “Normal – Long” and “Abnormal – Long.” The users explain that this classification will serve as the basis for several planned dashboards and scorecards based on the new data warehouse environment. Aha! Here is another opportunity to embellish the design to better support the business requirements. Clearly, you will want to create a Duration Type dimension that includes this classification, plus a rollup attribute to aggregate all “Normal” durations and “Abnormal” durations.

Due to the active participation of your business partners, the resulting design will better support their underlying business requirements. While the initial design would have included all the data elements required to support the requirements, it would have fallen short on ease of use and richness from a business perspective. Thus, the mantra that data warehousing is not just about the data; it's about serving the business requirements. The active, engaged participation of your business partners in the design process will result in a design that is far more effective than solely focusing on modeling the data.

7.5 Managing Large Dimensional Design Teams

Bob Becker, Design Tip #161, Nov 5, 2013

Regular readers know we stress the importance of focusing on business requirements when designing dimensional data models to support the data warehouse/business intelligence (DW/BI) environment. It is critical to include business partners in the dimensional design process.

But including business representatives on the design team obviously increases the size of the group. In many organizations, the resulting team will be a small group of four to eight individuals. In these situations, managing the design process is relatively straightforward. The team needs to gather on a regular basis, focus on the effort at hand, and follow a defined process to complete the modeling effort.

However, in larger organizations, especially when the scope includes tackling enterprise-wide conformed dimensions, the design team may be considerably larger. In recent years we've participated in design projects with over 20 participants representing different departments. Large design teams introduce several additional complications that need to be overcome.

The first obstacle is ensuring the consistent participation of team members in all the design sessions. Everyone involved has normal day-to-day job responsibilities, in addition to their design team involvement. Inevitably participants will face pressing issues outside the design process that require their presence. The larger the group, the more frequently these absences will occur. When an individual misses important deliberations regarding key issues, the team will need to circle back with the individual, revisit the discussion and design options, and then perhaps reconsider earlier decisions. This discourse may be important to the overall design, but it negatively impacts the team's productivity. Excessive backtracking and rehashing is frustrating and draining to the group.

With large design teams, you should avoid overly aggressive scheduling to ensure the highest level of consistent participation. Don't schedule full weeks or even full days for design sessions. We suggest limiting design sessions to no more than three days in a week; Tuesday through Thursday seems to work best. Instead of full-day sessions, schedule two sessions per day, each two and a half hours in duration. Start the morning session a little later than the normal start time, take a two-hour midday break, and finish before the normal end time. In addition to Mondays and Fridays, this schedule gives participants time at the beginning, middle, and end of each day to schedule meetings, deal with email, and other daily responsibilities. Each participant only needs to allocate fifteen hours per week to the design sessions. In exchange, each participant is expected to firmly commit this time to the design team. The goal is full participation in all design sessions resulting in greater overall productivity and minimal backtracking.

When the focus of the design effort is on core conformed dimensions, it is important that all the business representatives participate since the goal is enterprise agreement on the key attributes that must be conformed across the organization. However, once the team's attention turns to specific business processes and the associated fact table designs, it is often possible to excuse some of the business representatives not involved/interested in a particular business process for several design sessions.

Remember that any effort to define core conformed dimensions across business processes requires a clear and urgent message from senior management that they expect the effort to produce results. IT by itself cannot “herd the cats through the door.” Make sure you have clear and visible guidance from senior management before you start the dimension conforming process or you will be wasting your time.

Occasionally, gnarly design challenges will arise. Often these issues are relevant or only thoroughly understood by a small group of participants. Trying to work through these very specific issues may be counterproductive for most members of the larger design team. In these situations, it makes more sense to table the discussions during the general design sessions and assign a smaller workgroup to work through the issues and then bring the conclusions/recommendations back to the larger group.

Effective facilitation is often another large design team challenge. Ideally, the lead dimensional data modeler has the required skills to facilitate the group. However, it is sometimes necessary to team a skilled facilitator alongside the dimensional data modeler. In either case, make sure the facilitator and/or modeler possess the key skills required to guide a large team effort:

  • Deep knowledge of dimensional data modeling concepts and techniques, including the pros and cons of design alternatives
  • Understanding of the organization's business processes and the associated business requirements surrounding those processes within the design effort's scope
  • Self-confidence to appreciate when to remain neutral on an issue and when to push back. Occasionally, the facilitator/modeler needs to take a contrary position to help participants clearly articulate their requirements and concerns.
  • Keen listening skills. Some participants will not be well versed in dimensional modeling, yet will be communicating key requirements which they're unable to express in modeling terms.
  • Strong facilitation skills to draw out participants, adequately debate key issues, control wandering discussions, retain focus on the goal, and ultimately, ensure success

We also suggest that one team member be assigned to documenting the design and outstanding issues during the sessions. In large group designs, the facilitator/modeler is primarily focused on understanding the requirements and translating those requirements into an optimal dimensional model. Their work requires considerable discussion and evaluation of design options. It's a productivity gain if the facilitator/modeler doesn't need to slow down the process to capture design decisions.

7.6 Use a Design Charter to Keep Dimensional Modeling Activities on Track

Bob Becker, Design Tip #138, Sep 7, 2011

The logical dimensional model should be developed jointly by representatives from all interested groups: business users, reporting teams, and the DW/BI project team. It is important that the appropriate individuals are represented on the dimensional data model design team as described in article 7.3, Staffing the Dimensional Modeling Team, in order to achieve an effective design. The best dimensional models result from a collaborative team effort. No single individual is likely to have the detailed knowledge of the business requirements and the idiosyncrasies of all the source systems to effectively create the model themselves.

However, involving more people in the design process increases the risk of slowing down the process. With so many individuals involved, it's important that the lead designer/facilitator keeps the group on track. The team may find itself spiraling into deep, complex discussions of data elements only to determine that the data element in question isn't within the design scope or perhaps isn't a reasonable candidate to be included in the design.

A helpful strategy for limiting long resource-draining discussions is to establish a “design charter” early in the design process. The goal of the design charter is to keep the team focused on the key issues and avoid runaway discussion on ancillary topics. The design charter is simply the mantra the design team looks to when in doubt to help guide the process. As data elements are discussed and it is not readily apparent whether the item is in scope or out of scope, the design team can run the data element through the project's design charter. While each design team should develop its own specific charter, the following are examples of charters we've used in the past:

  • Does the data element help support the business requirements that are in scope?

    For example, one insurance client's core business requirements included supporting the organizations' financial, management, and regulatory reporting requirements. All proposed data elements were passed through that filter.

    Another client's goal was to build an analytic platform enabling the business to support world class analytics. In that case the filter became: Is the data element analytically interesting? Will it be used to support analytic requests? Or does it exist only to support a need of the operational system?

  • What does the data element describe? Is it a dimension attribute used for slicing/dicing/grouping/constraining? Or is it a metric? If the data element is an amount, is it a metric or dimensional attribute? If it is an amount that behaves like an attribute, can it be range valued?
  • If it is a dimension attribute, can it change? If it can change, will analysis based on the value at the time of the measurement be required? If it can change, will analysis based on the current value be required? Or both?

It is easy for the design team to get bogged down trying to incorporate the seemingly endless data elements available. To stay on track, keep the design charter in mind at all times during the design process. The first bullet is critical: Does the data element support the business requirements? Always keep in mind that the goal is not to “model the data.” The goal is to create a data model that will support the business need. It's okay to leave data elements behind. Be especially leery of data elements that primarily support operational capabilities. Constantly ask “How will the data element be used analytically? What value does it provide?” in an effort to fully understand the data in question. Often it's not enough to merely include the data element in the design; it may need to be tweaked, enriched, or ranged to make it as valuable as possible to the business community.

7.7 The Naming Game

Warren Thornthwaite, Design Tip #71, Sep 2, 2005

The issue of field naming rears its ugly head while you're creating the dimensional data model. Naming is complex because different people have different meanings for the same name, and different names with the same meaning. The difficulty comes from human nature: Most of us don't want to give up what we know and learn a new way. The unenviable task of determining names typically falls on the data steward. If you are responsible for dealing with this political beast, you will find the following three-step approach helpful. Steps 1 and 2 generally happen before the model is presented to the business users. Step 3 usually happens after business users have seen and understand the model.

Step 1: Preparation

Begin by developing skills at thinking up succinct, descriptive, unique names for data elements. Learn your organization's (and team's) naming conventions. Study the table and column names in the various systems. If you don't have established naming conventions, now's a good time to create them. A common approach is to use a column name standard with three parts: PrimeWord_ZeroOrMoreQualifiers_ClassWord. The prime word is a categorization word that often corresponds to the entity the column is from, and in some cases, qualifiers may not be necessary. So the field in the sales fact table that represents the amount sold might be Sales_Dollar_Amount. You can research different data naming conventions and standards on the internet.

Step 2: Creating an Initial Name Set

During the modeling process, work with the modeling team (including a representative or two from the business) to draft an initial set of names and the rationale. Once the model is near completion, hold a review session with the modeling team to make sure the names make sense.

In addition to the review session, it helps to have meetings with the key stakeholders. This typically includes the core business users and any senior managers whom you have a sense might have an opinion. If their preferred name for any given column is different from your suggested name, try to figure out why. Help them be clear on their definition of the data element by asking them to explain what the term means to them. Look for missing qualifiers and class words to clarify the meaning. For example, a sales analyst would be interested in sales numbers, but it turns out that this sales number is really Sales_Commissionable_Amount, which is different from Sales_Gross_Amount and Sales_Net_Amount.

The resulting name set should be used by the data modeling team to update the current version of the data model. Keep track of the alternative names for each field and the reasons people offered for their preferred choices. This will be helpful in explaining the derivations of the final name set.

Step 3: Building Consensus

Once you have a solid, tested name set, and the core users have seen the data model presentation, gather all the stakeholders in a conference room for at least half a day (count on more if you have a lot of columns or a contentious culture) and work through it. Start from the high level model and progress through all the columns, table by table. Generally, there have been enough model reviews and naming discussions so that many of the issues have already been resolved and the remaining issues are reasonably well understood.

The goal of this session is to reach consensus on the final name set. Often this means someone has to accept the will of the majority and let go of their favorite name for a given column. It is surprising how emotional this can be. These names represent how we view the business, and people feel pretty strongly about getting them “right.” Don't let people get out of the room without reaching agreement if it is at all possible. If you have to reconvene on the same issues, it will take extra time to rehash the various arguments.

Once you have reached agreement on the final name set, document it carefully and take it back to the data modelers so they can work it into the final data model.

7.8 What's in a Name?

Joy Mundy, Design Tip #168, Jul 21, 2014

It seems like a small thing, but names are important. Good names for tables and columns are particularly important for ad hoc users of the DW/BI system who need to find the objects they're looking for. Object names should be oriented to the business users, not the technical staff.

As much as possible, strive to have the names in the DW/BI system be unchanged in the semantic layer and unchanged by report designers. More challenging, your users should be discouraged from changing object names once they've pulled the information to their desktop. We usually can't prevent them from doing that, but attractive and sensible names will reduce the temptation.

Here are my top ten suggestions for naming objects in your DW/BI system:

  1. Follow naming conventions. If you don't have them, create (and document!) naming conventions that follow the rules in this Design Tip. If your organization already has naming conventions, you may be faced with a problem: Most existing naming conventions were developed for technical people. But names in the DW/BI environment should be oriented to the business users. They become row and column names in ad hoc analyses and predefined reports. We will return to this issue later.
  2. Each object has one name. Let's not perpetuate the confusion around data definitions that already exists in our organizations. It is not OK to say that the sales team can call a column Geography and the marketing group calls the same entity Region. If it's the same column, with the same content, it has to have the same name. If you can't get agreement organization-wide on object names, enlist the help of your business sponsor.
  3. Object names are descriptive. Users should not need 20 years' tenure at your organization to decipher what a name means. This rule forbids a lot of silliness, like RBVLSPCD (we have more than 8 characters to work with!). It also forbids column names like NAME, which is non-descriptive outside the context of the table you are examining.
  4. Abbreviations and acronyms are discouraged. Abbreviations and acronyms are endemic in the corporate world, and the non-corporate world is even worse. A lot of information can be encoded in an acronym, but it places a huge burden on newcomers. The most effective approach is to maintain a list of approved abbreviations, and try not to add to them without a good reason. You may even want to document that reason in the list. Examples include:

    Abbreviation Replaces Reason
    Amt Amount Extremely common
    Desc Description Extremely common
    Corp Corporation Common
    FDIC Federal Deposit Insurance Corporation Common; familiar to all users

    For most organizations, a reasonable list has dozens of approved abbreviations, not hundreds.

  5. Object names are pretty. Remember that object names become headers in reports and analyses. Although beauty is in the eye of the beholder, I find all caps to be particularly annoying. The object names should contain a visual clue for where the words end:

    1. Spaces: [Column Name]
    2. Camel case: ColumnName
    3. Underscore: Column_Name or COLUMN_NAME

    I recommend using spaces. They look the best when displayed in reports. And I scoff at the argument that developers have to type square brackets when they type the column name. I am confident that any developer who actually types SQL can figure out where the brackets keys are located, and they can develop the requisite finger muscles.

  6. Object names are unique. This rule is a corollary to the rule that each object has one name. If two objects are different, they should have different names. This rule forbids column names such as [City]. A better name is [Customer Mailing Address City]. This rule is especially important for ad hoc use of the DW/BI system. Although the context of [City] may be obvious during the analytic process, once that analysis is saved and shared, that context is lost. Are we looking at the customer's city or the store's city, the mailing address or the shipping address? Although we can't prevent users from changing object names once they export to Excel, we don't want to force them to do so in order to be clear.
  7. Object names are not too long. This rule conflicts with rules 3, 4, and 6. If we have unique, unabbreviated, descriptive object names, the odds are that some column names will be very long. Consider [Mandatory Second Payer Letter Opt Out Code] or [Vocational Provider Referral Category]. These are reasonably descriptive column names for someone in the insurance business, but what will happen when the user or report designer drags that column into the body of a report or analysis? The name will wrap unattractively, making the header row very fat. Or it will shrink down to a font so tiny that no one can read it. And then the user will rename the column, violating our key rule that each object has one name.

    I try to limit column names to 30 characters, though sometimes I go to 35. In order to achieve this goal, I have to register more abbreviations or acronyms than I would like.

  8. Consider prepending column names with an abbreviated table name. I hate to make this recommendation, because it violates several of my previous rules about abbreviations and short names. But I am finding myself following this practice with increasing frequency in order to guarantee consistency and uniqueness.
  9. Change names in the view layer if needed. We have always recommended a set of views in the DW/BI system which sit atop the physical tables and to which all queries are directed. The primary purpose of the view layer is to provide a layer of insulation between the BI applications and the physical database, providing a bit of flexibility to smoothly migrate change to a system already in production. Additionally, the view layer can also provide a place to put the business-oriented names into the database:
    • Our first recommendation is always to name the physical tables and columns with business-oriented names. Failing that, use the view layer. We dislike changing the names in the BI tools for several reasons:
      • Most organizations have several BI tools; the names should be consistent across all BI tools.
      • The more business logic you put into the BI tool, the more challenging it will be to migrate to a different tool.
      • If the names are only in the BI tool, there is a barrier to communication between users, front room support team, and back room database people.
  10. Be consistent! It's only a foolish consistency that's the hobgoblin of little minds. Consistency in naming is hugely valuable to your users.

7.9 When Is the Dimensional Design Done?

Bob Becker, Design Tip #108, Dec 3, 2008

There is a tendency for data warehouse project teams to jump immediately into implementation tasks as the dimensional data model design is finalized. But we'd like to remind you that you're not quite done when you think you might be. The last major design activity that needs to be completed is a review and validation of the dimensional data model before moving forward with implementation activities.

We suggest you engage in a review and validation process with several successive audiences, each with different levels of technical expertise and business understanding. The goal is to solicit feedback from interested people across the organization. The entire DW/BI team benefits from these reviews because they result in a more informed and engaged business user community. At a minimum, the design team should plan on talking to three groups:

  • Source system developers and DBAs who can often spot errors in the model very quickly
  • Core business or power users who were not directly involved in the model development process
  • Broader user community

Typically the first public design review of the detailed dimensional model is with your peers in the IT organization. This audience is often comprised of reviewers who are intimately familiar with the target business process because they wrote or manage the system that runs it. Likely, they are partly familiar with the target dimensional model because you've already been pestering them with source data questions.

The IT review can be challenging because the reviewers often lack an understanding of dimensional modeling. In fact, most of them probably fancy themselves as proficient third normal form modelers. Their tendency will be to apply transaction processing-oriented modeling rules to the dimensional model. Rather than spending the bulk of your time debating the merits of different modeling disciplines, it is best to be prepared to provide some dimensional modeling education as part of the review process.

When everyone has the basic dimensional modeling concepts down, begin with a review of the bus matrix. This will give everyone a sense for the project scope and overall data architecture, demonstrate the role of conformed dimensions, and show the relative business process priorities. Next, illustrate how the selected row on the matrix translates directly into the dimensional model. Most of the IT review session should then be spent going through the individual dimension and fact tables.

Often, the core business users are members of the modeling team and are already intimately knowledgeable about the data model, so a review session with them is not required. However, if they have not been involved in the modeling process, a similar, detailed design review should be performed with the core business users. The core users are more technical than typical business users and can handle more detail about the model. Often, especially in smaller organizations, you can combine the IT review and core user review into one session. This works if the participants already know each other well and work together on a regular basis.

Finally, the dimensional data model should be shared with the broader community of business users. Often, this is a relatively large audience. In such cases a representative subsection of the users can be selected. This session is as much education as it is design review. You want to educate people without overwhelming them, while at the same time illustrating how the dimensional model supports their business requirements. In addition, you want them to think closely about how they will use the data so they can help highlight any shortcomings in the model.

Create a presentation that starts with basic dimensional concepts and definitions, and then describe the bus matrix as your enterprise DW/BI data roadmap. Review the high level model, and finally, review the important dimensions, like customer and product.

During the broader user review, you should allocate about a third of the time to illustrate how the model can be used to answer a range of questions about the business process. Pull some interesting examples from the requirements documentation and walk through them. More analytical users will get this immediately. Reassure the rest of your audience that most of this complexity will be hidden behind a set of structured reports. The point is to show you can answer just about every question they might ask about this business process.

There are usually only minor adjustments to the model once you get to this point. After working so hard to develop the model, the users may not show what you consider to be appropriate enthusiasm. The model may seem obvious to the users and makes sense; after all, it is a reflection of their business. This is a good thing; it means you have done your job well!

Design Review Activities

The second half of the chapter focuses on the review and assessment of existing dimensional models and dimensional DW/BI systems.

1 7.10 Design Review Dos and Don'ts

Margy Ross, Design Tip #120, Feb 2, 2010

Over the years, we've described common dimensional modeling mistakes, such as the next article, 7.11, Fistful of Flaws. And we've recommended dimensional modeling best practices countless times; article 6.7, Kimball's Ten Essential Rules of Dimensional Modeling, has been widely read.

While we've identified frequently observed errors and suggested patterns, we haven't provided much guidance about the process of conducting design reviews on existing dimensional models. Kimball Group consultants perform numerous design reviews like this for clients as it's a cost-effective way to leverage our experience; here are some practical dos and don'ts for conducting a design review yourself.

Before the design review…

  • Do invite the right players. Obviously, the modeling team needs to participate, but you'll also want representatives from the BI development team (to ensure that proposed changes enhance usability) and ETL development team (to ensure that the changes can be populated). Perhaps most importantly, it's critical that folks who are really knowledgeable about the business and their needs are sitting at the table. While diverse perspectives should participate in a review, don't invite 25 people to the party.
  • Do designate someone to facilitate the review. Group dynamics, politics, and the design challenges themselves will drive whether the facilitator should be a neutral resource or involved party. Regardless, their role is to keep the team on track to achieving a common goal. Effective facilitators need the right mix of intelligence, enthusiasm, confidence, empathy, flexibility, assertiveness, and the list goes on. Don't forget a sense of humor.
  • Do agree upon the scope of the review (e.g., dimensional models focused on several tightly coupled business processes). Ancillary topics will inevitably arise during the review, but agreeing in advance on the scope makes it easier to stay focused on the task at hand.
  • Do block off time on everyone's calendar. We typically conduct dimensional model reviews as a focused 2-day effort. The entire review team needs to be present for the full two days. Don't allow players to float in and out to accommodate other commitments. Design reviews require undivided attention; it's disruptive when participants leave intermittently.
  • Do reserve the right space. The same conference room should be blocked for the full two days. Optimally, the room has a large white board; it's especially helpful if the white board drawings can be saved or printed. If a white board is unavailable, have flip charts on hand. Don't forget markers and tape; drinks and food never hurt.
  • Do assign homework. For example, ask everyone involved to make a list of their top five concerns, problem areas, or opportunities for improvement with the existing design. Encourage participants to use complete sentences when making their list so it's meaningful to others. These lists should be emailed to the facilitator in advance of the design review for consolidation. Soliciting advance input gets people engaged and helps avoid “group think” during the review.

During the design review…

  • Do check attitudes at the door. While it's easier said than done, don't be defensive about prior design decisions. Do embark on the review thinking change is possible; don't go in resigned to believing nothing can be done to improve the situation.
  • Do check laptops and smartphones at the door (at least figuratively), unless needed to support the review process. Allowing participants to check email during the sessions is no different than having them leave to attend an alternative meeting.
  • Do exhibit strong facilitation skills. Review ground rules. Ensure that everyone is participating and communicating openly. Do keep the group on track; ban side conversations and table discussions that are out of scope or spiral into the death zone. There are tomes written on facilitation best practices, so we won't go into detail here.
  • Do ensure a common understanding of the current model before delving into potential improvements. Don't presume everyone around the table already has a comprehensive perspective. It may be worthwhile to dedicate the first hour to walking through the current design and reviewing objectives. Don't be afraid to restate the obvious.
  • Do designate someone to act as scribe, taking copious notes about both the discussions and decisions being made.
  • Do start with the big picture. Just as when you're designing from a blank slate, begin with the bus matrix, then focus on a single high priority business process, starting with the granularity then moving out to the corresponding dimensions. Follow this same “peeling back the layers of the onion” method with your design review, starting with the fact table, then tackling dimension-related issues. Do undertake the meatiest issues first; don't defer the tough stuff to the afternoon of the second day.
  • Do remind everyone that business acceptance is the ultimate measure of DW/BI success; the review should focus on improving the business users' experience.
  • Do sketch out sample rows with data values during the review sessions to ensure everyone has a common understanding of the recommended changes.
  • Do close the meeting with a recap; don't let participants leave the room without clear expectations about their assignments and due dates. Do establish a time for the next follow-up.

Following the design review…

  • Do anticipate that some open issues will remain after the 2-day review. Commit to wrestling these issues to the ground, even though this can be challenging without an authoritative party involved. Don't fall victim to analysis paralysis.
  • Don't let your hard work gather dust. Do evaluate the cost/benefit for the potential improvements; some changes will be more painless (or painful) than others. Then develop action plans for implementing the improvements.
  • Do anticipate similar reviews in the future. Plan to reevaluate every 12 to 24 months. Do try to view inevitable changes to your design as signs of success, rather than failure.

Good luck with your review!

7.11 Fistful of Flaws

Margy Ross, Intelligent Enterprise, Oct 10, 2003

People often engage the Kimball Group to conduct dimensional model design reviews. In this article, we provide a laundry list of common design flaws to scout for when performing a review. We encourage you to use this list to critically review your own schemas in search of potential improvement opportunities.

What's the Grain?

When a data warehouse team proudly unrolls its draft dimensional modeling masterpiece, one of our first questions is “What's the grain of the fact table?” We need to know the specific level of detail captured in the fact table. Surprisingly, we often get inconsistent answers to this inquiry. Declaring a clear and concise definition of the fact table grain is critical to a productive modeling effort. Without agreement, the design team and business liaisons will spin in circles.

For maximum flexibility and extensibility, you should build your fact table at the lowest level of granularity possible. You can always roll up the granular details. On the other hand, there's no way to drill down into the details if you've only loaded pre-aggregated, summary information. Obviously, the lowest level of granularity depends on the business process being modeled.

Mixed Grain or Textual Facts?

Once you have established the fact table granularity, identify the facts that are consistent with this grain. If some facts are line item level metrics, while others exist only at the header level, you must either allocate the header facts to the line item grain or create a separate fact table with the granularity of one row per header.

Fact tables typically consist of foreign keys plus numeric counts and amounts measuring business performance. Optimally, the facts are additive, meaning they can be summed across any dimension. In an effort to improve performance or reduce query complexity, aggregated facts such as year-to-date totals sometimes sneak into the fact row. These totals are dangerous because they aren't perfectly additive. A year-to-date total may reduce the complexity and run time of a few queries, but having it in the fact table invites other queries to double count the year-to-date column (or worse) when more than one date is included.

You should also prohibit text fields, including cryptic indicators and flags, from entering the fact table. They almost always take up more space in the fact table than a surrogate key. More important, business users generally want to query, constrain, and report against these text fields. You can provide quicker responses and more flexible access by handling these textual values in a dimension table, along with additional descriptive roll-up attributes often associated with the indicators or flags.

Dimension Descriptors and Decodes?

Any identifiers and codes in the dimension tables should be accompanied by descriptive decodes. It's time for us to dismiss the misperception that business users prefer to work with codes. If you need to convince yourself, just stroll down to their offices to see the decode listings filling their bulletin boards or lining their monitors. Adding descriptive names makes the data more legible to business users. If deemed appropriate by the business, operational codes can accompany the descriptors as dimension attributes, but they shouldn't be the dimension table primary keys.

Design teams sometimes opt to embed complex filtering or labeling logic in the data access application rather than supporting it via a dimension table. Although query and reporting tools may let you decode within the application, we recommend that decodes be stored as data instead. Applications should be data-driven in order to minimize the impact of decode additions and changes. Placing decodes in the dimensions ensures greater report labeling consistency.

Handling of Hierarchies?

Each dimension associated with a fact table should take on a single value with each fact row. Similarly, each dimension attribute should take on one value for each dimension row. If the attributes have a many-to-one relationship, then this hierarchical relationship can be represented within a single dimension. You generally should look for opportunities to collapse dimension hierarchies whenever possible, except in the case of really large dimensions with highly volatile attribute changes. It isn't uncommon to represent multiple hierarchies in a single dimension.

Designers sometimes attempt to deal with the dimension hierarchies within the fact table. For example, rather than having a single foreign key to the product dimension, they include fact table foreign keys for the key elements of the product hierarchy, such as brand, category, and department. Before you know it, a compact fact table turns into an unruly monster, joining to dozens of dimension tables. This example is a severe case of having “too many dimensions.” If the fact table has more than 20 foreign keys, you should look for opportunities to combine or collapse them into dimensions.

In general, we discourage snowflaking, or normalizing, dimension tables. Snowflaking may reduce the disk space needed for dimension tables, but the savings are usually insignificant when compared with the entire data warehouse storage needs and seldom offset the disadvantages in ease of use or query performance.

Outriggers are a variation of the snowflake theme. Rather than normalizing the entire dimension, a cluster of relatively low cardinality or frequently reused attributes is placed in an outrigger joined to the dimension. In most cases, dimensions should be a single join away from the fact table. Be careful to avoid abusing the outrigger technique; outriggers should be the exception rather than the rule. Similarly, if your design is riddled with bridge tables to capture many valued dimension relationships, you need to go back to the drawing board. Chances are that you have an issue with the fact table's granularity.

Explicit Date Dimension?

Every fact table should have at least one foreign key to an explicit date dimension. The SQL date function doesn't support date attributes such as fiscal periods, seasons, and holidays. Rather than trying to determine these nonstandard calendar calculations in a query, you should store them in a date dimension table.

Designers sometimes avoid a date dimension altogether by representing a series of monthly buckets of facts on a single fact table row. These fixed time slots often become an access and maintenance nightmare; the recurring time buckets should be presented as separate rows in the fact table instead.

Control Numbers as Degenerate Dimensions?

In transaction-oriented fact tables, treat the operational control numbers (such as the purchase order or invoice number) as degenerate dimensions. They reside as dimension keys on the fact table, but don't join to a corresponding dimension table.

Teams are sometimes tempted to create a dimension table with information from the operational header record, such as the transaction number, transaction date, transaction type, or transaction terms. In many cases, you'd end up with a dimension table that has nearly as many rows as the fact table. A dimension table growing at nearly the same pace as the fact table is a warning sign that a degenerate dimension may be lurking within it.

Surrogate Keys?

Instead of relying on operational keys or identifiers, you should use meaningless surrogate keys for all the dimension primary keys and fact table foreign keys. The administrative overhead to manage surrogate keys is minor, while the benefits are multifold. They isolate the warehouse from operational changes (such as recycling closed account numbers), while letting the warehouse handle “not applicable” or “to be determined” conditions. Because surrogate keys are typically 4-byte integers, performance is improved due to the smaller fact keys, smaller fact tables, and smaller indices.

Surrogate keys also let you integrate data with multiple operational keys from multiple sources. Finally, they are required to support the dominant technique for tracking changes to dimension table attributes.

Slowly Changing Dimension Strategies?

Your dimensional design isn't complete until you have identified a slowly changing dimension strategy for each dimension attribute. You may opt to overwrite the column, add a new row, add a new column, or even add a new dimension to track changes.

It's important that the strategy, or combination of strategies, is well thought out before development occurs.

Well Understood Business Requirements?

Not to sound like a broken record, but there's no way to effectively conduct a design review without first having a solid understanding of your business requirements. You need to be keenly aware of both the business requirements and data realities to review a dimensional model with any sense of confidence. Business subject matter experts or liaisons are typically excellent guides along this path; you can't expect to take any shortcuts.

7.12 Rating Your Dimensional Data Warehouse

Ralph Kimball, Intelligent Enterprise, Apr 28, 2000 and May 15, 2000

This content was originally published as two consecutive articles.

Over the past two decades, data warehouses have evolved their own design techniques distinct from transaction processing systems. Dimensional design has emerged as the dominant theme for most of our data warehouses. For some years we have had a fairly stable vocabulary that includes slowly changing dimensions, surrogate keys, aggregate navigation, and conformed dimensions and facts. Yet in spite of the growing awareness of this body of practice, we still don't have good metrics for what makes a system more dimensional or less dimensional.

This article attempts to fill this gap. I'll propose 20 criteria for what makes a data warehouse system dimensional. Besides naming each of the 20 criteria, I will define the criteria in decisive ways that let you decide whether your system complies. I want you to assign a 0 (non-compliant) or a 1 (compliant) to each criterion, and then add up the zeros and ones. Your data warehouse system should then measure somewhere between a score of zero, representing a system completely unsupportive of a dimensional approach, and 20, representing a system as completely supportive as I can imagine.

Architecture Criteria

The architecture criteria are fundamental characteristics of the overall system that are not only “features” but are central to the whole way the system is organized. Architectural criteria usually extend from the back room, through the DBMS, all the way to the front room and the user's desktop.

  1. Explicit declaration. The system provides explicit database declarations that distinguish a dimensional entity from a measurement (fact) entity. These declarations are stored in the system metadata. The declarations are visible to administrators and users and affect query strategy, query performance, grouping logic, and physical storage. Facts can be declared as fully additive, semi-additive, and nonadditive. Default (automatic) aggregation techniques other than summation can be associated with facts. The default association between dimensions and facts is declared in the metadata so that the user can omit specifying the link between them. A dimension attribute included in a query is automatically the basis of a dynamic aggregation. A fact included in a query is by default summed within the context of all aggregations. Semi-additive facts and nonadditive facts are prohibited from being summed across the wrong dimensions.
  2. Conformed dimensions and facts. The system uses conformed dimensions and facts to implement drill-across queries where answer sets from different databases, different locations, and possibly different technologies can be combined into a higher level answer set by matching on the row headers supplied by the conformed dimensions. The system detects and warns against the attempted uses of unconformed facts. This is the most fundamental and profound architecture criterion. It is the basis for implementing distributed data warehouses.
  3. Dimensional integrity. The system guarantees that the dimensions and the facts maintain referential integrity. In particular, a fact may not exist unless it is in a valid framework of all its dimensions. However, a dimensional entry may exist without any corresponding facts.
  4. Open aggregate navigation. The system uses physically stored aggregates as a way to enhance performance of common queries. These aggregates, like indexes, are chosen silently by the database if they are physically present. Users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate. All query processes accessing the data, even those from different application vendors, realize the full benefit of aggregate navigation.
  5. Dimensional symmetry. All dimensions allow comparison calculations that constrain two or more disjoint values of a single attribute from a dimension in computations such as ratios or differences. Also, the underlying database engine supports an indexing scheme that allows a single indexing strategy to efficiently support query constraints on an arbitrary and unpredictable subset of the dimensions in a highly dimensional database.
  6. Dimensional scalability. The system places no fundamental constraints on either the number of members or the number of attributes within a single dimension. Dimensions with 100 million members or 1,000 textual attributes are practical. Dimensions with a billion members are possible.
  7. Sparsity tolerance. Any single measurement can exist within a space of many dimensions. Such a space can be viewed as extraordinarily sparse. The system imposes no practical limit on the degree of sparsity. A 20-dimension fact table, each of whose dimensions has a million or more members, is practical.

Administration Criteria

Administration criteria are certainly more tactical than architectural criteria, but have been chosen for this list because they are showstoppers if they are missing from a dimensional data warehouse. Administration criteria generally affect IT personnel who are building and maintaining the data warehouse.

  1.  Graceful modification. The system must allow the following modifications to be made in place without dropping or reloading the primary database:
    • Adding an attribute to a dimension
    • Adding a new kind of fact to a measurement set, possibly beginning at a specific point in time
    • Adding a whole new dimension to a set of existing measurements
    • Splitting an existing dimension into two or more new dimensions
  2.  Dimensional replication. The system supports the explicit replication of a conformed dimension outward from a dimension authority to all the relevant dimensional models, in such a way that you can perform drill-across queries on business process subject areas if they have consistent versions of the dimensions. Aggregates that are affected by changes to the content of a dimension are automatically taken offline until you can make them consistent with the revised dimension and atomic fact table.
  3.  Changed dimension notification. The system delivers upon request all the records from a production source of a dimension that have changed since the last such request. In addition, a reason code is supplied with this dimension notification that allows the data warehouse to distinguish between type 1 and type 3 slowly changing dimensions (overwrites) and type 2 slowly changing dimensions (true physical changes at a point in time).
  4.  Surrogate key administration. The system implements a surrogate key pipeline process for: a) assigning new keys when the system encounters a type 2 slowly changing dimension; and b) replacing the natural keys in a fact table record with the correct surrogate keys before loading into the fact table. In other words, the cardinality of a dimension can be made independent from the definition of the original production key. Surrogate keys, by definition, must have no semantics or ordering that makes their individual values relevant to an application. Surrogate keys must support not applicable, nonexistent, and corrupted measurement data. A surrogate key may not be visible to a user application.
  5.  International consistency. The system supports the administration of international language versions of dimensions by guaranteeing that a translated dimension possesses the same grouping cardinality as the original dimension. The system supports the UNICODE character set, as well as all common international numerical punctuation and formatting alternatives. Incompatible, language-specific collating sequences are allowed.

Expression Criteria

The final eight expression criteria are common analytic capabilities needed in most real life data warehouse situations. The business user community experiences all expression criteria directly. These expression criteria for dimensional systems are not the only features users look for in a data warehouse, but they are all capabilities needed to exploit the power of a dimensional system.

  1.  Multiple dimension hierarchies. The system allows a single dimension to contain multiple independent hierarchies. No practical limit exists to the number of hierarchies in a single dimension. Hierarchies may be complete (encompassing all the members of a dimension) or partial (encompassing only a select subset of the members of a dimension). Two hierarchies may have common levels or common attributes (fields), and may have different numbers of levels. Two hierarchies may also share one or more common levels, but otherwise have no correlation.
  2. Ragged dimension hierarchies. The system allows dimension hierarchies of indeterminate depth, such as organization charts and parts explosions, where records in the dimension can play the roles of parents as well as children. Using this terminology, a parent may have any number of children, and these children may have other children, to an arbitrary depth limited only by the number of records in the dimension. A child may have multiple parents, where these parents' “total ownership” of the child is explicitly represented and adds up to 100 percent. With a single command the system must be able to summarize a numeric measure from a fact table (or cube) on a ragged hierarchy for all members:

    • Starting with a specified parent and descending to all the lowest possible levels summarizing all intermediate levels
    • Starting with a specified parent and summarizing only children exactly N levels down from the parent or N levels up from the lowest child of any branch of the hierarchy, where N is equal to or greater than zero
    • Starting with a specified child and summarizing all the parents from that child to the supreme parent in that child's hierarchy
    • Starting with a specified child and summarizing all the parents exactly N levels upward in the hierarchy from that child.
    • Starting with a specified child and summarizing only that child's unique supreme parent

    A given ragged dimension hierarchy may contain an arbitrary number of independent families (independent supreme parents with no common children). Conversely, independent supreme parents may share some children, as stated when discussing total ownership.

  3.  Multiple valued dimensions. A single atomic measure in a fact table (or cube) may have multiple members from a dimension associated with that measure. If more than one member from a dimension is associated with a measure, an explicit allocation factor is provided that optionally lets the numeric measure spread across the dimension's associated members. In such a case, the allocation factors for a given atomic measure and a given multiple-valued dimension must add up to 100 percent.
  4.  Slowly changing dimensions. The system must explicitly support the three basic types of slowly changing dimensions: type 1, where a changed dimension attribute is overwritten; type 2, where a changed dimension attribute causes a new dimension member to be created; and type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension member record. Support for slowly changing dimensions must be system wide, as the following requirements imply:
    • Changes to a dimension that invalidate any physically stored aggregate must automatically disqualify that aggregate from use.
    • A type 2 change must trigger the automatic assignment of a new surrogate key for the new dimension member, and that key must apply for all concurrent fact records loaded into the system. In other words, the creation of a new type 2 dimension member must automatically link to the associated concurrent facts without the user or application developer needing to track beginning and ending effective dates.
    • If the system supports ragged hierarchy dimensions and/or multiple-valued dimensions, then these types of dimensions must support all three types of slowly changing dimensions.
  5.  Multiple dimension roles. A single dimension must be associated with a set of facts via multiple roles. For instance, a set of facts may have several independent time stamps that you can simultaneously apply to the facts. In this case, a single underlying time dimension must be able to attach to these facts multiple times, where each instance is semantically independent. A given set of facts may have several different kinds of dimensions, each playing multiple roles.
  6.  Hot-swappable dimensions. The system must allow an alternate instance of a dimension to swap in at query time. For example, if two clients of an investment firm wish to view the same stock market data through their own proprietary “stock ticker” dimensions, then the two clients must be able to use their versions of the dimension at query time, without requiring the fundamental fact table (or cube) of stock market facts to be duplicated. Another example of this capability would let a bank attach an extended account dimension to a specific query if the user restricts the query to a cluster of accounts of the same type.
  7.  On-the-fly fact range dimensions. The system provides direct support for dynamic value banding queries on numeric measures in a fact table (or cube). In other words, at query time the user can specify a set of value ranges and use these ranges as the grouping criteria in a query. All the normal summarizing functions (count, sum, min, max, and average) can apply within each group. The sizes of the value bands needn't be equal.
  8.  On-the-fly behavior dimensions. The system supports constraining a dimension via a simple list of that dimension. For the sake of vocabulary, this list of members is called a “behavior dimension.” The support of behavior dimensions must be system wide, as the following requirements imply:
    • A behavior dimension can be captured from a report showing on the user's screen; from a list of keys or attributes appearing in a file extracted from a production source; directly from a constraint specification; or from a union, intersection, or set difference of other behavior dimensions.
    • A user may have a library of many behavior dimensions and can attach a behavior dimension to a fact table (or cube) at query time.
    • The use of a behavior dimension in a query restricts the fact table (or cube) to the members in the study, but in no way otherwise limits the ability to select and constrain attributes of any regular dimension, including the one the behavior dimension affects directly.
    • A behavior dimension may be of unlimited size.
    • A behavior dimension may have an optional date stamp associated with each element of the list in such a way that two behavior dimensions can be merged where membership in the combined behavior dimension requires a specific time ordering.

Are You Dimensional?

A system that supported most or all these dimensional criteria would be adaptable, easier to administer, and able to address many real-world analytic challenges. The whole point of dimensional systems is that they are business issue and business user driven. I urge you to apply these criteria against your data warehouse to see how it does.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset