Time-Stamped Dimensions
If there is any uncertainty about requirements for historic data, the most common response to changes in source data is the type 2 slowly changing dimension.It is the safe choice because it preserves the association of historic dimension values with facts that have been recorded in fact tables. No information is discarded.
The type 2 response has one glaring shortcoming: it cannot tell you what the dimension looked like at any point in time. This is a particular concern if you have a dimensional data warehouse architecture or stand-alone data mart. In these architectures, the dimensional model doubles as the integrated repository of granular data. In a Corporate Information Factory, maintaining a full history in the dimensional data mart is a lesser concern since there is also an enterprise data warehouse repository to hold it. Although not configured for direct access, at least the information is not thrown away.
An additional fact table can come to the rescue and be used to track the history of changes to the dimension. This fact table has the odd characteristic of having exactly the same number of rows as the dimension table, but it does the job. Many designers instinctively gravitate to a more flexible alternative: supplementing the type 2 response with time stamps. The time-stamped dimension permits three forms of point-in-time analysis within the dimension table itself:
- Easily order a chronological history of changes
- Quickly select dimension rows that were in effect for a particular date
- Easily identify the dimension rows currently in effect
The time-stamped dimension has an unusual property. Joined to a fact table, it behaves like any other dimension table. Used on its own, it also exhibits some of the characteristics of a fact table. The time-stamped approach can also be tremendously useful to ETL developers charged with loading data into fact tables.
Point-in-Time Status of a Dimension
Often, one or more dimension tables in the data warehouse represent closely watched entities. The history of attribute values is significant and is often monitored irrespective of any associated transactions. Documents, contracts, customers, and even employees may be subjected to this deep scrutiny. When it is necessary to support point-in-time analysis within a dimension table, type 2 changes alone will not do the job.
Type 2 Not Sufficient
A type 2 slowly changing dimension preserves the history of values of an attribute and allows each fact to be associated with the correct version. Although this preserves the history of facts, it is not sufficient to provide for point-in-time analysis. What version was current on a particular date? Unless a fact exists for the date in question, it is impossible to know. This is best understood via an example. The star in below has a policy dimension. The table includes numerous attributes describing significant characteristics of a health insurance policy.
The history of these attributes is carefully followed by the business, so they are designated as type 2. The policy dimension table is associated with a fact table that tracks policy payments that have been made by the policy holder.
The slow changes that accumulate for one particular policy are illustrated in the lower part of the figure. Policy number 40111 is held by someone named Hal Smith and apparently has been active for quite some time. You can see that the policy has undergone several changes. Initially, Hal was single and his policy covered himself alone. Later, he married, but coverage was not added for his spouse. Subsequently, his spouse did become covered, and still later coverage was added for a child. When coverage was added for a second child, you can see that Hal also increased his deductible.
The insurance company needs to be able to understand what each policy looked like at any given point in time. For example, users might want to know how many policy holders were married versus how many were single on a particular date, or what the total number of covered parties was at the close of a fiscal period. Policy payments are completely irrelevant to this analysis.
Unfortunately, the design in Figure above is not able to answer these questions. Although the dimension table records all the changes to policies, it does not associate them with specific time periods. Was Hal married on November 1, 2005? The dimension table tells us that at different times he has been single and married, but not when. Unless there happens to be a row recorded for the policy in payment_facts for November 1, 2005, there is no way to know what the policy looked like on that date.
Tracking Change History Through a Fact Table
Point-in-time analysis of a closely watched dimension can be supported by creating a fact table expressly for the purpose. A row is recorded in the fact table each time the dimension changes. Each row in this fact table contains a foreign key identifying the new row in the dimension table, and one identifying the date it became effective. An additional foreign key can be maintained to indicate the date on which the row expired, which will help produce point-in-time analysis. If each change occurs for a particular reason, this may be captured via an additional dimension. An example appears in Figure below.
The fact table in the figure, policy_change_facts, logs changes to the policy dimension. Its grain is one row for each policy change. Each row contains a policy_key representing the changed policy. A transaction type dimension contains reference information indicating the reason for the change—a new policy, a policy change, or policy cancellation. Two keys refer to the day dimension. Day_key_effective indicates the day on which the policy change went into effect. Day_key_expired will be used to indicate when it was superseded by a new version.
You may notice that this fact table does not contain any facts. That is okay; it is still useful. For example, it can be used to count the number of policies in effect on a particular date with married versus single policy holders. Fact tables like this are known as factless fact tables.
The dates associated with each row merit some additional scrutiny. As noted, the fact table includes a day_key_effective, indicating when the change went into effect, as well as a day_key_expired, indicating when the change was superseded by a new version. It is useful to avoid overlapping dates, in order to avoid any confusion when building queries. For each row in the fact table, the effective and expiration dates are inclusive. If a policy changes today, today’s date is the effective date for the new version of the policy. Yesterday’s date is the expiration date for the previous version. The pair of dates can be used to determine the policy’s status at any point in time. Add one filter looking for an effective date that is before or equal to the date in question, and another for an expiration date that is greater than or equal to the date in question.
The current version of a policy has not expired, and it has no expiration date. Instead of recording a NULL key, it refers to a special row in the day dimension. This row typically contains the largest date value supported by the relational database management system (RDBMS), such as 12/31/9999. This technique avoids the need to test for null values when building queries. When the row does expire, the date will be replaced with the actual expiration date.
It is also necessary to look a bit more closely at the grain of the fact table, described earlier as “one row for each policy change.” It is important to document what exactly a “policy change” is. If Hal Smith’s family size and number of children change on the same day, is that one change or two? In the case of policy changes, it is likely to represent a set of changes that are logged in a source system as part of a single transaction. In other cases, it may be necessary to log each change individually. This may require the ability to store multiple changes for a single day. To support this, you can add a time dimension and a pair of time_keys to supplement the day_keys. The time_keys refer to the time of day the record became effective (time_key_effective) and the time of day it expired (time_key_expired.)
The policy_change_facts star effectively captures the history of changes to the policy dimension, so that people can identify what policies looked like at any particular point in time. This information can now be accessed, even for days when there are no payments. You may have noticed something peculiar about this fact table. Since it contains a row for each change to a policy, it has the same number of rows as the policy dimension itself. This is not necessarily a bad thing, but it suggests that there may be a more effective way to gather the same information.
A Time-Stamped Dimension in Action
A time-stamped version of the policy dimension table is illustrated in below Figure. The first block of attributes contains the usual dimension columns: a surrogate key, the natural key, and a set of dimension attributes. Four additional columns have been added: transaction_type, effective_date, expiration_date, and most_recent_version.
To the right of the table diagram, a grid illustrates the rows recorded in this table for policy 40111, which belongs to our old friend Hal Smith. The first row shows that Hal’s policy went into effect on February 14, 2005. At the time, he was single. The policy remained in this state through February 11, 2006. On the next day, February 12, a policy change caused the next row to become effective. This row updated the policy to show that Hal married, with his family size increasing from one family member to two. The row shows that there was only one covered party, which means that Hal’s spouse was not covered by the policy. Additional changes to the policy are reflected in the subsequent rows. Study the table closely, and you will see that the effective_date and expiration_date for each changed row line up closely; there are no gaps.
The last row in the illustration shows the current status of Hal Smith’s policy. Since it has not expired, the expiration_date is set to 12/31/9999. This date has been specifically designated for use when the policy has not expired. When we are querying the table, this value allows us to avoid the additional SQL syntax that would be necessitated if a NULL had been used.
To filter a query for policy versions that were in effect on 12/31/2006, for example, we only need to add the following to the query predicate:
WHERE
12/31/2006 >= effective_date AND
12/31/2006 <= expiration_date
The most_recent_version flag is added as a convenience, allowing browse queries to filter a dimension table for current records only. In keeping with guidelines from Chapter 3, this column contains descriptive text rather than Boolean values or “Y” and “N.” To filter the dimension for the current policy versions, a simple constraint is added:
WHERE most_recent_version = "Current"
It is also quite simple to single out a single policy and produce an ordered list of changes:
WHERE policy_number = 40111
ORDER_BY effective_date
TIP:A time-stamped dimension adds non-overlapping effective and expiration dates for each row, allowing each type 2 change to be localized to a particular point in time. The effective date can be used to order a transaction history; the effective and expiration dates allow filtering for a specific point in time; a most_recent_version column can simplify filtering for current status.
This technique can be elaborated upon for situations where multiple discrete changes must be captured during a single day. While it represents additional work to maintain a time-stamped dimension, it pays dividends when it comes to maintaining associated fact tables.
Multiple Changes on a Single Day
The policy example only allows for one version of a policy on a given day. If it is necessary to track changes at a finer grain, this can be achieved by adding columns to capture the time of day at which the record became effective and expired: effective_time and expiration_time. Depending on the business case, these columns may represent hours (24 possible values), minutes (1,440 values), or seconds (8,640 possible values). Together with the effective_date and expiration_date, they allow for multiple versions within a given day.
As with the day columns in the prior example, the time columns must allow for no overlap and leave no gaps. For policy changes recorded at the minute level, for example, if one row becomes effective at 12:48 PM, the prior row must expire at 12:47 PM. With the time columns in place, it is possible to have multiple versions of a policy on a given day. This means that qualifying a query for a particular date may pick up multiple versions of the same policy. This is not desirable if, for example, you are counting the number of covered parties across all policies on a particular date. You must now supplement the date qualifications with time qualifications, as in:
WHERE
12/31/2006 >= effective_date AND
12/31/2006 <= expiration_date AND
24:00 >= effective_time AND
24:00 <= expiration_time
You can add a last_change_of_day flag to simplify this kind of query. It will be set to “Final” for the last change to a given policy on a day. End-of-day status for a particular date can now be captured by using this SQL:
WHERE
12/31/2006 >= effective_date AND
12/31/2006 <= expiration_date AND
last_change_of_day = "Final"
Time-Stamped Dimensions and the ETL Process
Construction of a time-stamped dimension places an extra burden on the developers of the process that loads and maintains the dimension table. When a type 2 change occurs, it will be necessary to identify the prior row to update its expiration_date and most_recent_version columns. Extract, transform, load (ETL) developers may also be charged with grouping a set of changes that occur on a single day into a single time-stamped dimension record. This additional work is not trivial, but ETL developers will find benefits in other areas.
When loading transactions into a fact table, such as above policy_payment_facts, the ETL process must determine what foreign key to use for each dimension. If a payment is recorded for policy 40111, for example, which policy_key is used in the fact table? Without time stamps, it would be necessary to compare the policy characteristics in the source system with the type 2 attributes in the dimension table. If a payment comes in for policy 40111, with a family size of four and a deductible amount of 500, for example, then surrogate key 14922 should be used.
When dimension rows are time-stamped, this work is greatly simplified. To choose the correct surrogate key value, the ETL process need only know the natural key and the date of the transaction. This date can be compared to the effective_date and expiration_date columns in the dimension table to identify the correct key value for use in the fact table. This characteristic of a time-stamped dimension is particularly valuable when fact tables are implemented incrementally. Suppose, for example, that a year after the policy payment star is implemented, another star is added to track claims against policies. Like the policy payments star, policy_claim_facts will involve the policy dimension table. If each row in the policy dimension has been time-stamped, it will be easy to load the last year’s worth of claims into the new fact table. For each historic claim, the date of the transaction is used to single out the appropriate row in the policy table. Without these time stamps, it would be necessary for the source system to supply not only the date of each claim but also a complete picture of the policy, including the policy_number and each of the type 2 attributes. This set of values would be used to identify the associated row in the policy table.
It is not necessary to time-stamp every dimension. The additional work may not be necessary if point-in-time analysis is not required. Dimensions that contain reference data, such as the transaction_type table in above figure, do not require time stamps. For core dimensions that conform across multiple subject areas, however, time-stamping is a useful enhancement.
Dimension and Fact
As with any other dimension table, the time-stamped dimension can be joined to a fact table to analyze facts. From above figure, for example, can be joined to a policy_payment_facts table to study payments in the usual way. Need to know what were the total January payments on policies covering two or more people? Join the policy dimension to the fact table and constrain on covered_parties. Want to compare the payment habits of single versus married policy holders? Group the query results by marital_status.
When analyzing the dimension table without a fact table, something interesting happens. Some of the dimension attributes take on the characteristics of facts, exhibiting the property of additivity. For example, the following query captures the number of covered parties by state on December 31, 2006:
SELECT
state,
sum(covered_parties)
FROM
policy
WHERE
12/31/2006 >= effective_date AND
12/31/2006 <= expiration_date
GROUP BY
state
In this query, the dimension column covered_parties is behaving as a fact. Some people like to call it a hybrid attribute since it can be used either as a dimension or as a fact. This is a common phenomenon in time-stamped dimensions.
参考至:《Star Schema The Complete Reference》
本文原创,转载请注明出处、作者
如有错误,欢迎指正