必须把所有的细节数据都装载到维结构中来,也许有些数据在当时看来并没有太大用处,也许用户可能只关心聚集了之后的数据,但是谁也不能保证用户的需求不会增长或者变化。
通常情况下,一个事实表对于一个业务流程,比如说,给用户开账单,可能涉及到时间,地理,产品类型多个维度,涉及到一个事实表:金额表。另外,除了简单事实表以外,可能在细节数据水平上,有多个业务流程使用统一的一个事实表,统一的事实表是简单事实表的一个补充,而不是它的替代品。
确保每一个事实表都有一个与之关联的时间维表,有时候还可能有多个时间维对应到一个事实表,比如说重要事件的时间表,重大的节日等等。
确保在一个单一事实表中的所有事实都具有相同的粒度或细节水平
比如说,你把按照月份统一以后的数据和按天统计的数据放在一个事实表中,这是不对的。
避免多对多关系出现在事实表中,这样损坏事实表本身的粒度。由于事实表和维表之间本身就是多对多关系,如果在事实表中有多对多关系要记录,请把这些关系用事实表和维表的维度关系来表示。
避免多对一的关系出现在维表中。比如说:在学生表中,有班主任属性,学生和班主任属性是多对一的关系,当在建立学生维时,这样的关系就应该分解,产生雪花模型,将班主任属性提取到比较小的子维去。
但是有时候,多对一关系可以出现在事实表中,比如说维表里的数据很多,而且维表的ROLL-UP属性经常发生变化的时候,可以考虑把这种多对一关系到事实表中处理。
尽量避免大块的描述信息出现在事实表中,比如说某个字段在report中显示的label,如果是必不可少的描述信息,请在维表中添加某个属性来记录。
确保每一个维表(除了时间维)是使用一个代理的主键,也就是没有业务意义的主键,比如sequence自增。
首先建立一些标准的维来集成企业数据,比如说:制造企业的产品维,这个维几乎在各个数据集市中都需要用到,先把这些维建立起来,可以减少以后的冗余设计。
在维度建模的过程中,不断的做到需求和实现之间的平衡,使得你的设计能能够不断的适应用户不断变化的需求。
Rule #1: Load detailed atomic data into dimensional structures.
Dimensional models should be populated with bedrock atomic details to support the unpredictable filtering and grouping required by business user queries. Users typically don't need to see a single record at a time, but you can't predict the somewhat arbitrary ways they'll want to screen and roll up the details. If only summarized data is available, then you've already made assumptions about data usage patterns that will cause users to run into a brick wall when they want to dig deeper into the details. Of course, atomic details can be complemented by summary dimensional models that provide performance advantages for common queries of aggregated data, but business users cannot live on summary data alone; they need the gory details to answer their ever-changing questions.
Rule #2: Structure dimensional models around business processes.
Business processes are the activities performed by your organization; they represent measurement events, like taking an order or billing a customer. Business processes typically capture or generate unique performance metrics associated with each event. These metrics translate into facts, with each business process represented by a single atomic fact table. In addition to single process fact tables, consolidated fact tables are sometimes created that combine metrics from multiple processes into one fact table at a common level of detail. Again, consolidated fact tables are a complement to the detailed single-process fact tables, not a substitute for them.
Rule #3: Ensure that every fact table has an associated date dimension table.
The measurement events described in Rule #2 always have a date stamp of some variety associated with them, whether it's a monthly balance snapshot or a monetary transfer captured to the hundredth of a second. Every fact table should have at least one foreign key to an associated date dimension table, whose grain is a single day, with calendar attributes and nonstandard characteristics about the measurement event date, such as the fiscal month and corporate holiday indicator. Sometimes multiple date foreign keys are represented in a fact table.
Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail.
There are three fundamental grains to categorize all fact tables: transactional, periodic snapshot, or accumulating snapshot. Regardless of its grain type, every measurement within a fact table must be at the exact same level of detail. When you mix facts representing multiple levels of granularity in the same fact table, you are setting yourself up for business user confusion and making the BI applications vulnerable to overstated or otherwise erroneous results.
Rule #5: Resove many-to-many relationships in fact tables.
Since a fact table stores the results of a business process event, there's inherently a many-to-many (M:M) relationship between its foreign keys, such as multiple products being sold in multiple stores on multiple days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, such as the multiple diagnoses associated with a health care encounter or multiple customers with a bank account. In these cases, it's unreasonable to resolve the many-valued dimensions directly in the fact table, as this would violate the natural grain of the measurement event. Thus, we use a many-to-many, dual-keyed bridge table in conjunction with the fact table.
Rule #6: Resolve many-to-one relationships in dimension tables.
Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you've spent most of your career designing entity-relationship models for transaction processing systems, you'll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
Rule #7: Store report labels and filter domain values in dimension tables.
The codes and, more importantly, associated decodes and descriptors used for labeling and query filtering should be captured in dimension tables. Avoid storing cryptic code fields or bulky descriptive fields in the fact table itself; likewise, don't just store the code in the dimension table and assume that users don't need descriptive decodes or that they'll be handled in the BI application. If it's a row/column label or pull-down menu filter, then it should be handled as a dimension attribute.
Though we stated in Rule #5 that fact table foreign keys should never be null, it's also advisable to avoid nulls in the dimension tables' attribute fields by replacing the null value with "NA" (not applicable) or another default value, determined by the data steward, to reduce user confusion if possible.
Rule #8: Make certain that dimension tables use a surrogate key.
Meaningless, sequentially assigned surrogate keys (except for the date dimension, where chronologically assigned and even more meaningful keys are acceptable) deliver a number of operational benefits, including smaller keys which mean smaller fact tables, smaller indexes, and improved performance. Surrogate keys are absolutely required if you're tracking dimension attribute changes with a new dimension record for each profile change. Even if your business users don't initially visualize the value of tracking attribute changes, using surrogates will make a downstream policy change less onerous. The surrogates also allow you to map multiple operational keys to a common profile, plus buffer you from unexpected operational activities, like the recycling of an obsolete product number or acquisition of another company with its own coding schemes.
Rule #9: Create conformed dimensions to integrate data across the enterprise.
Conformed dimensions (otherwise known as common, master, standard or reference dimensions) are essential for enterprise data warehousing. Managed once in the ETL system and then reused across multiple fact tables, conformed dimensions deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes. The Enterprise Data Warehouse Bus Matrix is the key architecture blueprint for representing the organization's core business processes and associated dimensionality. Reusing conformed dimensions ultimately shortens the time-to-market by eliminating redundant design and development efforts; however, conformed dimensions require a commitment and investment in data stewardship and governance, even if you don't need everyone to agree on every dimension attribute to leverage conformity.
Rule #10: Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision-making.
Dimensional modelers must constantly straddle business user requirements along with the underlying realities of the associated source data to deliver a design that can be implemented and that, more importantly, stands a reasonable chance of business adoption. The requirements-versus-realities balancing act is a fact of life for DW/BI practitioners, whether you're focused on the dimensional model, project strategy, technical/ETL/BI architectures or deployment/maintenance plan.
If you've read our Intelligent Enterprise articles, Toolkit books or monthly Design Tips regularly, these rules shouldn't be news to you, but here we've consolidated our rules into a single rulebook that you can refer to when you are gathered to design or review your models.
Good luck!