《精通数据仓库设计》(Mastering Data Warehouse Design)中英对照——第1章

《精通数据仓库设计》(Mastering Data Warehouse Design)中英对照——第1

第一部分 基本概念

我们发现,理解为什么采纳某个具体的方法,能帮助我们理解这个方法的价值并应用这个方法。因此,这一节的开始,我们先介绍企业信息工厂(Corporate Information Factory CIF),这种已经被证明的、稳定的体系结构。在这种体系结构下,商业智能(BI),包含两种形式的数据存贮,每一种都有一个BI环境下具体的角色。第一类数据存贮是数据仓库,数据仓库主要的角色是担当数据知识库,存贮来自不同数据源的数据,使它能被另一类数据存贮访问。另一类数据存贮就是数据集市。总的来说,设计数据仓库最有效的方法是基于实体-关系数据模型和范式技术(由Code   Date 最初在19709090年代为关系数据库创建)。

PA数据集市的主要角色是提供企业用户一个容易的访问优良的、集成的信息的方法。在第1章描述有几种类型的数据集市,最常用的数据集市是创建联机分析处理(OLAP),OLAP最有效的设计方法是维度数据模型。

在第2章,我们继续这个基本的主题,解释最重要的关系建模技术,介绍所需要的不同类型的模型,提供建立关系模型的过程,同时,我们解释为企业构建一个坚固的基础时,商业数据型、系统数据、技术数据等模型等各类数据模型之间的关系,并解释他们之间是如何互相共享或继承特性。

1 介绍

欢迎阅读本书,这是第一本彻底描述构建一个多用途的、稳定的、可持续的,支持商业智能的数据仓库建模技术的书。这一章介绍BI及数据仓库的目标,解释他们如何组合成一个整体的企业信息工厂体系结构,讨论数据仓库建设的迭代性,论证数据仓库数据模型的重要性,以及采用这种数据模型形式的理由。我们讨论这种模型形式为什么应该基于关系设计技术,阐明是为了满足最小冗余,最大稳定性和可维护性的需要。这一章的另一节列出了可维护的数据仓库环境的特点。最后讨论这种建模方法对最终交付数据集市的影响。这一章,让读者理解后续章节的基本原理,后续章节会描述创建数据仓库模型的细节。

 

Chapter 1 Introduction CHAPTE

Welcome to the first book that thoroughly describes the data modeling techniques used in constructing a multipurpose, stable, and sustainable data warehouse used to support business intelligence (BI). This chapter introduces the data warehouse by describing the objectives of BI and the data warehouse and by explaining how these fit into the overall Corporate Information Factory (CIF) architecture. It discusses the iterative nature of the data warehouse construction

and demonstrates the importance of the data warehouse data model and the justification for the type of data model format suggested in this book. We discuss why the format of the model should be based on relational design techniques, illustrating the need to maximize nonredundancy, stability, and maintainability. Another section of the chapter outlines the characteristics of a maintainable data warehouse environment. The chapter ends with a discussion of the impact of this modeling approach on the ultimate delivery of the data marts. This chapter sets up the reader to understand the rationale behind the ensuing chapters, which describe in detail how to create the data warehouse data model.

 

1.1商业智能概述

商业智能,在数据仓库领域,指的是一个企业学习过去的行为与活动,理解组织的过去,确定组织的现状,预计或者改变将来会发生的事情的能力。BI的概念已经提出20年了,让我们简短的回顾过去令人兴奋的、不断创新的10年。

Overview of Business Intelligence

BI, in the context of the data warehouse, is the ability of an enterprise to study past behaviors and actions in order to understand where the organization has been, determine its current situation, and predict or change what will happen in the future. BI has been maturing for more than 20 years. Let’s briefly go over the past decade of this fascinating and innovative history.

也许你熟悉技术采纳曲线,最早采用新技术的公司叫创新者,下一类叫作早期采纳者,然后有前半数成员、后半数成员,最后是落伍者。这个曲线是传统的钟型曲线,在开始的时候成指数增长,在后半周期市场缓慢下降。新技术一旦被引进,往往价钱昂贵且不完善,而很难应用;经过一段时间,性价比可以接受。手机(蜂窝电话)就是一个很好的例子。曾经,只有革新者(医生和律师?)带着手机,又笨重又昂贵,信号不连续,经常丢失通话。现在,你只要花60美元,随处可以拥有一个手机,且服务非常的可靠。

You’re probably familiar with the technology adoption curve. The first companies to adopt the new technology are called innovators. The next category is known as the early adopters, then there are members of the early majority, members of the late majority, and finally the laggards. The curve is a traditional bell curve, with exponential growth in the beginning and a slowdown in market growth occurring during the late majority period. When new technology is introduced, it is usually hard to get, expensive, and imperfect. Over time, its availability, cost, and features improve to the point where just about anyone can benefit from ownership. Cell phones are a good example of this. Once, only the innovators (doctors and lawyers?) carried them. The phones were big, heavy, and expensive. The service was spotty at best, and you got “dropped” a lot. Now, there are deals where you can obtain a cell phone for about $60, the service providers throw in $25 of airtime, and there are no monthly fees, and service is quite reliable.

数据仓库是这种采纳曲线另一个很好的例子。事实上,如果你还没有开始你的第一个数据仓库项目,那没有比现在更好的开始时间了。今天管理人期望得到大多数好的,及时的信息,用于领导企业进入下一个年代的、基于知识的决策,他们经常做到了,然而,并不是每次都这样。

Data warehousing is another good example of the adoption curve. In fact, if you haven’t started your first data warehouse project, there has never been a better time. Executives today expect, and often get, most of the good, timely information they need to make informed decisions to lead their companies into the next decade. But this wasn’t always the case.

就在在10年前,同样的管理者批准开发决策信息系统(Executive information systems EIS)来满足他们的需要。EIS发起人后面的基本概念是合理的:以实时的方式,提供给管理者容易访问的关键性能信息。然而,很多这类系统没有实现它们目标,大多数是因为基本的体系结构不能快速响应企业环境的变化。早期EIS系统另一个显著的缺点是需要花费大量的精力去提供管理者所需要的数据。数据获取,即提取、转换、装载(ETL)过程是一系列复杂的活动,它们的唯一目的是获取最准确的、集成的数据,然后通过数据仓库或者操作型数据存贮(ODS)让企业访问。

Just a decade ago, these same executives sanctioned the development of executive information systems (EIS) to meet their needs. The concept behind EIS initiatives was sound—to provide executives with easily accessible key performance information in a timely manner. However, many of these systems fell short of their objectives, largely because the underlying architecture could not respond fast enough to the enterprise’s changing environment. Another significant shortcoming of the early EIS days was the enormous effort required to provide the executives with the data they desired. Data acquisition or the extract, transform, and load (ETL) process is a complex set of activities whose sole purpose is to attain the most accurate and integrated data possible and make it accessible to the enterprise through the data warehouse or operational data store (ODS).

整个过程以手工密集的活动开始:硬编码“数据吸管”是唯一从操作型系统获取数据的方法,用于商业分析师的访问。这有点类似于早期的电话,穿着轮滑来回穿梭的操作员很难通过插入正确的线绳,连接你呼叫的电话。

The entire process began as a manually intensive set of activities. Hard-coded “data suckers” were the only means of getting data out of the operational systems for access by business analysts. This is similar to the early days of telephony, when operators on skates had to connect your phone with the one you were calling by racing back and forth and manually plugging in the appropriate cords.

 

幸运的是,我们已经比那个年代前进了很多,数据仓库行业已经开发了太多的工具和技术支持数据的获取过程。现在,大多数ETL过程都已经自动化,就像今天的电话系统。同时,类似于电话的发展,这个过程保留了一些困难的,或者说本身决定的,复杂的问题。没有两个公司有同样数据获取过程,甚至不会有同样的问题。今天,大多数拥有重要数据仓库的大公司,严重依赖于 ETL工具,用于设计,构建和维护他们的BI环境。

过去十年,另一个主要的改变是建模技术和工具的引入,带到了“容易使用”的阶段。由RalphKimball博士等人提出的维度建模概念,对全球的支持联机分析处理(OLAP)多维模型数据集市造成很大影响。

Fortunately, we have come a long way from those days, and the data warehouse industry has developed a plethora of tools and technologies to support the data acquisition process. Now, progress has allowed most of this process to be automated, as it has in today’s telephony world. Also, similar to telephony advances, this process remains a difficult, if not temperamental and complicated, one. No two companies will ever have the same data acquisition activities or even the same set of problems. Today, most major corporations with significant data warehousing efforts rely heavily on their ETL tools for design, construction, and maintenance of their BI environments.

Another major change during the last decade is the introduction of tools and modeling techniques that bring the phrase “easy to use” to life. The dimensional modeling concepts developed by Dr. Ralph Kimball and others are largely responsible for the widespread use of multidimensional data marts to support online analytical processing.

 

除了多维分析,还开发了其它一些复杂的技术用于支持数据挖掘、统计分析、探索等需要。现在,一个成熟的BI环境需要比星型模式多得多:平文件、无偏数据统计子集,规范化数据结构模式等,除了星形模式,所有这些都属数据仓库必须支持的、重要的数据需求。

当然,我们不能低估互联网对数据仓库的影响。互联网消除了计算机的神秘性,管理者在日常生活中使用互联网,不再对触摸键盘心存芥蒂。终端用户工具公司认识到了互联网的影响,且大多数都利用了这种成就:它们的界面都复制了流行的互联网浏览器与搜索引擎的视觉特性。这些工具的强大及直观,导致商业分析师和管理者广乏使用BI

In addition to multidimensional analyses, other sophisticated technologies have evolved to support data mining, statistical analysis, and exploration needs. Now mature BI environments require much more than star schemas— flat files, statistical subsets of unbiased data, normalized data structures, in addition to star schemas, are all significant data requirements that must be supported by your data warehouse.

Of course, we shouldn’t underestimate the impact of the Internet on data warehousing. The Internet helped remove the mystique of the computer. Executives use the Internet in their daily lives and are no longer wary of touching the keyboard. The end-user tool vendors recognized the impact of the Internet, and most of them seized upon that realization: to design their interface such

that it replicated some of the look-and-feel features of the popular Internet browsers and search engines. The sophistication—and simplicity—of these tools has led to a widespread use of BI by business analysts and executives.

发生最近几年的另一个重要事件是:发生了从技术追赶业务到业务驱使技术的转变。在BI的早期,信息技术(IT)部门认识到了BI的价值,并努力向商业团体兜售这些价值。不幸的是,有时IT伙计向商业团体兜售的是构建数据仓库的希望。今天,复杂的决策支持环境的价值在商业界得到广发的认同。例如,一个有效的客户关系管理程序不能离开战略(含有相关数据集市的数据仓库)和战术(操作型数据存贮和操作型集市)的决策支持能力。(见图1.1):

Another important event taking place in the last few years is the transformation from technology chasing the business to the business demanding technology. In the early days of BI, the information technology (IT) group recognized its value and tried to sell its merits to the business community. In some unfortunate cases, the IT folks set out to build a data warehouse with the hope that the business community would use it. Today, the value of a sophisticated decision support environment is widely recognized throughout the business. As an example, an effective customer relationship management program could not exist without strategic (data warehouse with associated marts) and a tactical (operational data store and oper mart) decision-making capabilities. (See Figure 1.1)

 

BI体系结构

过去十年最重要的发展是提出了广为接受的BI体系结构,支持所有的技术需求。这种体系结构认识到EIS方法有不少重大缺陷,最严重的缺陷是EIS数据结构常常从源系统直接获取数据,导致需要非常复杂的数据获取环境,需要大量的人力和计算机资源去维护。CIF(见图1.2)体系,现在已经有大多数决策支持系统使用,通过把数据隔离成主要的5个数据库(操作型系统,数据仓库,操作型数据存贮,数据集市,操作集市)来解决这个问题,把从源系统到商业用户的数据移动过程合并为一个高效的过程。

rBI Architecture

One of the most significant developments during the last 10 years has been the introduction of a widely accepted architecture to support all BI technological demands. This architecture recognized that the EIS approach had several major flaws, the most significant of which was that the EIS data structures were often fed directly from source systems, resulting in a very complex data

acquisition environment that required significant human and computer resources to maintain. The Corporate Information Factory (CIF) (see Figure 1.2), the architecture used in most decision support environments today, addressed that deficiency by segregating data into five major databases (operational systems, data warehouse, operational data store, data marts, and oper marts) and incorporating processes to effectively and efficiently move data from the source systems to the business users.

(翻转90度之后的图:)

这些组件进一步分为两个主要的组。

■■“取数据入”组从操作型系统获取数据,集成,清洗并推入数据库,以方便使用。在CIF中包含如下组件:

■■操作型系统数据库(源系统)包含公司日常的商业数据,这仍然是决策支持系统最主要的数据来源。

■■ 数据仓库是集成的、包含明细的、包含历史数据的数据集合,用于支持战略决策。

■■操作型数据存贮是集成的,明细的,现在的数据集合,用于支持战术决策。

These components were further separated into two major groupings of components and processes:

■■ Getting data in consists of the processes and databases involved in acquiring data from the operational systems, integrating it, cleaning it up, and putting it into a database for easy usage. The components of the CIF that are found in this function:

■■ The operational system databases (source systems) contain the data used to run the day-to-day business of the company. These are still the major source of data for the decision support environment.

■■ The data warehouse is a collection or repository of integrated, detailed, historical data to support strategic decision-making.

■■  The operational data store is a collection of integrated, detailed, current data to support tactical decision making.

■■“数据获取”组是一系列的过程和程序,用于从操作型系统抽取数据到数据仓库和操作型数据存贮。数据获取过程执行数据集成、清洗功能,把数据转换为企业统一的格式。这种企业级的格式,反映了一个企业商业规则的集成的集合。数据获取层是CIP体系中最复杂的一部份。除了清洗和转换外,数据获取层还包含审计和控制过程,保证进入数据仓库或操作型数据存贮系统数据的完整性。

■■“取信息出”由一系列过程和数据库组成,用于把BI交付给最终的企业用户和分析师,在CIF中包括如下组件:

■■从数据仓库分离出的数据集市,用于提供商业团体各种各样的决策分析支持。

■■ODS 分离出的操作集市,用于提供商业团体对现在的操作型数据进行多维访问。

■■把数据从数据仓库转移到操作集市的过程叫数据交付。类似于数据获取层,在移动数据的同时也制造数据。只是在数据交付时,来源是数据仓库或ODS,这里已经包含了高质量的,集成的数据,且数据符合企业的商业规则。

■■  Data acquisition is a set of processes and programs that extracts data for the data warehouse and operational data store from the operational systems. The data acquisition programs perform the cleansing as well as the integration of the data and transformation into an enterprise format. This enterprise format reflects an integrated set of enterprise business rules that usually causes the data acquisition layer to be the most complex component in the CIF. In addition to programs that transform and clean up data, the data acquisition layer also includes audit and control processes and programs to ensure the integrity of the data as it enters the data warehouse or operational data store.

■■ Getting information out consists of the processes and databases involved in delivering BI to the ultimate business consumer or analyst. The components of the CIF that are found in this function:

■■ The data marts are derivatives from the data warehouse used to provide the business community with access to various types of strategic analysis.

■■ The oper marts are derivatives of the ODS used to provide the business community with dimensional access to current operational data.

■■ Data delivery is the process that moves data from the data warehouse into data and oper marts. Like the data acquisition layer, it manipulates the data as it moves it. In the case of data delivery, however, the origin is the data warehouse or ODS, which already contains high quality, integrated data that conforms to the enterprise business rules.

CIF体系并不是一开始就如此。一开始,它由数据仓库和一些轻量级的汇总数据、高度汇总数据组成——最开始,需要历史数据的集合用来支持战略决策。一段时间后,产生了操作型数据存贮,用于支持战术决策支持系统;轻量级与高度汇总的数据存放在现在所谓的数据集市里。

让我们看看CIF的运转情况。客户关系管理(CRM)是一个普通的需求驱动器,驱动了战术信息部件(操作型系统,操作型数据存贮,操作型集市),战略信息部件(数据仓库和各种类型的数据集市)。当然,对CRM来说,这些技术是必须的,但远远不止这些技术,除了为客户和组织提供长期价值外,它还需要商业策略,企业文化与架构,客户信息等。CIF提供的架构非常适合CRM环境,在这个体系架构里,每一个部件都有专门的设计和功能。

The CIF didn’t just happen. In the beginning, it consisted of the data warehouse and sets of lightly summarized and highly summarized data—initially a collection of the historical data needed to support strategic decisions. Over time, it spawned the operational data store with a focus on the tactical decision support requirements as well. The lightly and highly summarized sets of data evolved into what we now know are data marts.

Let’s look at the CIF in action. Customer Relationship Management (CRM) is a highly popular initiative that needs the components for tactical information (operational systems, operational data store, and oper marts) and for strategic information (data warehouse and various types of data marts). Certainly this technology is necessary for CRM, but CRM requires more than just the technology —it also requires alignment of the business strategy, corporate culture and organization, and customer information in addition to technology to provide long-term value to both the customer and the organization. An architecture such as that provided by the CIF fits very well within the CRM environment, and each component has a specific design and function within this architecture.

在这一章,我们会更详细的描述每个部件。虽然CRM是数据仓库和操作型数据存贮常见的应用,但是还有很多其他的应用,如企业资源计划系统(ERP)的提供商,如SAPORACLEPeopleSoft等公司都有数据仓库产品,并增加新的工具套件提供需要的功能。许多软件公司现在都提供各种插件,包含一般的分析应用,例如,如利率分析、关键绩效指标分析(KPI)等。我们会在本章的后面章节详细的介绍CIF组件。

数据仓库的改进非常重要的帮助公司对客户提供更好的服务及提高公司效益。数据仓库在技术不断变化的同时,拥有一个稳定的体系结构。构建数据仓库环境的工具已经发展了很长时间,他们非常复杂,对企业必需的数据提供设计、实现、维护、访问等极大的便利。CIF架构利用这些技术和工具的革新,创建了一个环境,把数据分成5个不同的存贮,每一种存贮担当一特定的角色,以正确的时间、正确的地点、正确的格式提供给企业团体正确的信息。想一想,你想成为数据仓库建设的后半部分还是落伍者?这值得等待。

We describe each component in more detail later in this chapter. CRM is a popular application of the data warehouse and operational data store but there are many other applications. For example, the enterprise resource planning (ERP) vendors such as SAP, Oracle, and PeopleSoft have embraced data warehousing and augmented their tool suites to provide the needed capabilities. Many software vendors are now offering various plug-ins containing generic analytical applications such as profitability or key performance indicator (KPI) analyses. We will cover the components of the CIF in far greater detail in the following sections of this chapter.

The evolution of data warehousing has been critical in helping companies better serve their customers and improve their profitability. It took a combination of technological changes and a sustainable architecture. The tools for building this environment have certainly come a long way. They are quite sophisticated and offer great benefit in the design, implementation, maintenance, and access to critical corporate data. The CIF architecture capitalizes on these technology

and tool innovations. It creates an environment that segregates data into five distinct stores, each of which has a key role in providing the business community with the right information at the right time, in the right place, and in the right form. So, if you’re a data warehousing late majority or even a laggard, take heart. It was worth the wait.

什么是数据仓库

在我们开始描述建模技术前,我们先统一一些术语的定义:什么叫数据仓库,它在BI中的角色和用途,支持它的构造和使用的各种部件

数据仓库的角色和用途

我们在本章的第一节已经看到,BI 体系结构在过去的十年发生了极大的变化,从简单的报表和EIS系统,到多维分析,到数据挖掘,到数据探索。现在又引进了可定制的分析应用,这些技术是一个强壮的、成熟的BI环境的一部份。图1.3显示了这些技术发展的时间框架。考虑这些重要的、明显不同的技术和数据格式的需求,很明显,必须从一开始就有一个贮藏室,用于存贮高质量的、可信任的、灵活的、可重用的格式的数据,这些数据用于支持和维护BI环境。从一开始,数据仓库就是BI体系结构的一部份,不同的方法学及数据仓库大师给与这个部件不同的名字,如:

筹备区:一个数据仓库的变种是“后勤”筹备区,在这里从操作型系统来的数据首先被带到一起,是数据一种不正式的设计和维护分组,唯一的目的是给多维数据集市提供数据。

信息仓库:IBM公司早期对数据仓库的命名,不象筹备区定义那样清晰,在它的定义里,不仅包含历史数据仓库,还包含数据集市。

What Is a Data Warehouse?

Before we get started with the actual description of the modeling techniques, we need to make sure that all of us are on the same page in terms of what we mean by a data warehouse, its role and purpose in BI, and the architectural components that support its construction and usage.

Role and Purpose of the Data Warehouse

As we see in the first section of this chapter, the overall BI architecture has evolved considerably over the past decade. From simple reporting and EIS systems to multidimensional analyses to statistical and data mining requirements to exploration capabilities, and now the introduction of customizable analytical applications, these technologies are part of a robust and mature BI environment. See Figure 1.3 for the general timeframe for each of these technological advances.

Given these important but significantly different technologies and data format requirements, it should be obvious that a repository of quality, trusted data in a flexible, reusable format must be the starting point to support and maintain any BI environment. The data warehouse has been a part of the BI architecture from the very beginning. Different methodologies and data warehouse gurus

have given this component various names such as:

A staging area. A variation on the data warehouse is the “back office” staging area where data from the operational systems is first brought together. It is an informally designed and maintained grouping of data whose only purpose is to feed multidimensional data marts.

The information warehouse. This was an early name for the data warehouse used by IBM and other vendors. It was not as clearly defined as the staging area and, in many cases, encompassed not only the repository of historical data but also the various data marts in its definition.

数据仓库环境必须包含各种技巧、功能和技术。因此,在设计时必须考虑两个问题:

首先,必要使用合适的粒度,或者说细节程度,以满足所有的数据集市。也就是说,它必须包含最普通的全部细节数据,既能给数据集市提供聚合的、汇总数据,同时也能个探索系统与数据挖掘提供事务级的数据。

其次,设计数据仓库时不能向数据集市使用的各种工具妥协,除了考虑多维集市,还要容纳统计、挖掘、探索需求。另外,必须能容纳新的分析应用,为支持随时出现的新技术作准备。因此,它支持模式必须包含:星形模式、平文件、规范化的统计子集,及将来随时会带进BI的东西。考虑这些目标,让我们看看数据仓库怎样适合这些复杂的体系结构,以支持成熟的BI环境。

The data warehouse environment must align varying skill sets, functionality, and technologies. Therefore it must be designed with two ideas in mind.

First, it must be at the proper level of grain, or detail, to satisfy all the data marts. That is, it must contain the least common denominator of detailed data to supply aggregated, summarized marts as well as transaction-level exploration and mining warehouses.

Second, its design must not compromise the ability to use the various technologies for the data marts. The design must accommodate multidimensional marts as well as statistical, mining, and exploration warehouses. In addition, it must accommodate the new analytical applications being offered and be prepared to support any new technology coming down the pike. Thus the schemas it must support consist of star schemas, flat files, statistical subsets of normalized data, and whatever the future brings to BI. Given these goals, let’s look at how the data warehouse fits into a comprehensive architecture supporting this mature BI environment.

企业信息工厂(CIF

CIF是一个广为接受的概念,对信息的存贮进行描述和分类,用于操作和管理一个成功的、强壮的BI基础架构。这些信息存贮支持三个高级的组织过程:

■■业务操作:即企业正在发生的、日常的业务处理,我们在操作型事务处理系统与外部数据中找到这些功能。这些系统帮助运行业务,且常常用于提高度业务自动化。支持这些功能的过程相对静态,仅仅在季度级改变。也就是说,操作型过程基本上保持稳定,除非企业有意识的改变。

■■商业智能:为更好的了解公司及其产品、客户而进行的不断研究。业务操作过程是静态的,而商业智能除了静态数过程,还包含一个持续改进的过程。当业务分析师和知识工作者探索这些信息时,使用这些信息帮助他们开发新产品、衡量客户黏度、评估潜在新市场及其他种种任务时,这些过程会发生改变。商业智能支持企业的战略决策过程。

The Corporate Information Factory

The Corporate Information Factory (CIF) is a widely accepted conceptual architecture that describes and categorizes the information stores used to operate and manage a successful and robust BI infrastructure. These information stores support three high-level organizational processes:

■■Business operations are concerned with the ongoing day-to-day operations of the business. It is within this function that we find the operational transaction-processing systems and external data. These systems help run the business, and they are usually highly automated. The processes that support this function are fairly static, and they change only in quantum leaps. That is, the operational processes remain constant from day to day, and only change through a conscious effort by the company.

■■ Business intelligence is concerned with the ongoing search for a better understanding of the company, of its products, and of its customers. Whereas business operations processes are static, business intelligence includes processes that are constantly evolving, in addition to static processes. These processes can change as business analysts and knowledge workers explore the information available to them, using that information to help them develop new products, measure customer retention, evaluate potential new markets, and perform countless other tasks. The business intelligence function supports the organization’s strategic decision-making process.

■■商业管理:指对知识及在商业智能中发现的新知识制度化,并且带到整个企业日常的商业操作中。商业管理包括企业为满足战略决策而执行战术决策的系统。

作为一个整体,CIF能够用来使一个机构所有的信息管理活动保持一致。操作型系统连接企业的骨干,运行实时的业务。数据仓库收集集成的、历史的数据,支持客户分析与分割。操作数据存贮及与它相连的操作集市支持近实时的集成客户信息与管理,提供个性化的客户服务。让我们更详细的讨论CIF的各组件。

■■ Business management is the function in which the knowledge and new insights developed in business intelligence are institutionalized and introduced into the daily business operations throughout the enterprise. Business management encompasses the tactical decisions that an organization makes as it carries out its strategies.

Taken as a whole, the CIF can be used to identify all of the information management activities that an organization conducts. The operational systems continue to be the backbone of the enterprise, running the day-to-day business. The data warehouse collects the integrated, historical data supporting customer analysis and segmentation, and the data marts provide the business community with the capabilities to perform these analyses. The operational data store and associated oper marts support the near-real-time capture of integrated customer information and the management of actions to provide personalized customer service. Let’s examine each component of the CIF in a bit more detail.

操作型系统

操作型系统支持日常企业活动,关注于事务处理,从订单管理到人力资源系统。在一个典型的组织内,操作型系统使用各种各样的技术和体系结构,除了企业内部定制开发的软件,还使用一些软件公司打包的系统。操作型系在统是静态的,仅仅业务政策或流程有意识的改变,或者出于技术的原因,如系统维护或性能调整等情况下,才发生改变。

CIF架构内,操作型系统是大多数电子数据的来源,因为这些系统支持时间敏感的、实时的事务处理,常常为性能或事务吞吐量而优化。操作型系统的数据可能在不同的系统中复制,而且常常不同步。操作型系统代表了企业最早的业务规则应用,数据的质量直接影响所有其他信息系统的质量。

 

Operational Systems

Operational systems are the ones supporting the day-to-day activities of the enterprise. They are focused on processing transactions, ranging from order entry to billing to human resources transactions. In a typical organization, the operational systems use a wide variety of technologies and architectures, and they may include some vendor-packaged systems in addition to in-house

custom-developed software. Operational systems are static by nature; they change only in response to an intentional change in business policies or processes, or for technical reasons, such as system maintenance or performance tuning.

These operational systems are the source of most of the electronically maintained data within the CIF. Because these systems support time-sensitive real time transaction processing, they have usually been optimized for performance and transaction throughput. Data in the operational systems environment may be duplicated across several systems, and is often not synchronized. These operational systems represent the first application of business rules to an organization’s data, and the quality of data in the operational systems has a direct impact on the quality of all other information used in the organization.

数据获取

许多公司被引诱跳过至关重要的数据集成阶段,而直接发布一系列不一致的、非集成的数据集市。没有数据获取层转换一致的业务规则,这些公司以建立一些孤立的、基于用户或部门的数据集市。这些集市常常不能结合起来产生有效的信息,也不能在企业内共享。跳过单一的、集成的数据获取层,往往导致分析数据不可控制地呈放射性繁殖。

Data Acquisition

Many companies are tempted to skip the crucial step of truly integrating their data, choosing instead to deploy a series of uncoordinated, unintegrated data marts. But without the single set of business rule transformations that the data acquisition layer contains, these companies end up building isolated, user- or department-specific data marts. These marts often cannot be combined to produce valid information, and cannot be shared across the enterprise. The net effect of skipping a single, integrated data acquisition layer is to foster the uncontrolled proliferation of silos of analytical data.

 

数据仓库

现在普遍接受的数据仓库定义由Bill Inmon1980年代提出:“面向主题的、集成的、非易失的,随时间变化的,用来支持战略决策的数据集合”。数据仓库是数据集成的中心点,是数据转化为信息的第一步,关注于企业,为以下目标服务:

首先,数据仓库给整个企业提供统一的视图,而不管怎么使用它,这为对数据仓库中数据的解释(分析)提供了灵活性。数据仓库提供使用者一个稳定的数据源,包括前后一致的历史数据、各部门一致的可靠数据。

其次,企业作为一个整体,对历史信息有巨大的需求,数据仓库会增长到非常庞大(20——100TB),在一开始设计时,就必须考虑以有效的方式使用企业业务规则,使适应信息的增长。

最后,数据仓库用于支持企业内各种形式的分析技术,也就是说,在数据仓库上可以建立很多数据集市,而不是每个数据集市各自提取及使用各自的数据。

Data Warehouse

The universally accepted definition of a data warehouse developed by Bill Inmon in the 1980s is “a subject-oriented, integrated, time variant and nonvolatile collection of data used in strategic decision making”1. The data warehouse acts as the central point of data integration—the first step toward turning data into information. Due to this enterprise focus, it serves the following

purposes.

First, it delivers a common view of enterprise data, regardless of how it may later be used by the consumers. Since it is the common view of data for the business consumers, it supports the flexibility in how the data is later interpreted (analyzed). The data warehouse produces a stable source of historical information that is constant, consistent, and reliable for any consumer.

Second, because the enterprise as a whole has an enormous need for historical information, the data warehouse can grow to huge proportions (20 to 100 terabytes or more!). The design is set up from the beginning to accommodate the growth of this information in the most efficient manner using the enterprise’s business rules for use throughout the enterprise.

Finally, the data warehouse is set up to supply data for any form of analytical technology within the business community. That is, many data marts can be created from the data contained in the data warehouse rather than each data mart serving as its own producer and consumer of data.

操作型数据存贮

操作型数据存贮用于支持战术决策,不同于数据仓库用于支持战略决策,和数据仓库有些相似的属性,但是在其他方面有很大的不同:

■■与数据仓库一样,都是面向主题的。

■■与数据仓库一样,数据是集成的。

■■数据是实时的,如果技术允许,尽可能实时。这一点与数据仓库的历史性有明显的区别,ODS有少量历史性,尽可能的接近实时状态。

■■数据是不稳定的,或者说是可修改的,这与数据仓库的静态数据也是一个显著的区别。这一点,ODS类似于事务处理系统,当新的数据流进入ODS,受影响的领域用新的信息重写或更新,而不是写审计跟踪,之前的历史数据不再保存。

■■基本上都是明细数据,包含少量聚合或汇总数据,ODS常常被设计为用来包含事务级的数据,也就是说,主题域的最低明细级的数据。

ODS是关于客户、产品、库存等系统近实时的、准确的、集成的数据源,企业的任何系统都可以访问,而不只面向具体的应用。有4ODS的基本应用,每一类有不同的特性和用途,他们之间最显著的不同是更新的频率,从每天到近实时(小于一分钟的响应时间)。业务用户经常直接访问ODS,这与数据仓库不同,数据仓库很少出报表(报表由数据集市出)。

 

Operational Data Store

The operational data store (ODS) is used for tactical decision making, whereas the data warehouse supports strategic decisions. It has some characteristics that are similar to those of the data warehouse but is dramatically different in other aspects:

■■ It is subject oriented like a data warehouse.

■■ Its data is fully integrated like a data warehouse.

■■ Its data is current—or as current as technology will allow. This is a significant difference from the historical nature of the data warehouse. The ODS has minimal history and shows the state of the entity as close to real time as feasible.

■■ Its data is volatile or updatable. This too is a significant departure from the static data warehouse. The ODS is like a transaction-processing system in that, when new data flows into the ODS, the fields affected are overwritten or updated with the new information. Other than an audit trail, no history of the previous contents is retained.

■■ Its data is almost entirely detailed with a small amount of dynamic aggregation or summarization. The ODS is most often designed to contain the transaction-level data, that is, the lowest level of detail for the subject area.

The ODS is the source of near-real-time, accurate, integrated data about customers, products, inventory, and so on. It is accessible from anywhere in the corporation and is not application specific. There are four classes of ODS commonly used; each has distinct characteristics and usage, but the most significant difference among them is the frequency of updating, ranging from daily to almost real time (subminute latency). Unlike a data warehouse, in which very little reporting is done against the warehouse itself (reporting is pushed out to the data marts), business users frequently access an ODS directly.

数据交付

数据交付一般限于各种操作,如数据聚合、按某些维度或业务需求过滤数据、为终端用户使用方便或支持某些BI软件工具而修改数据格式等,在整个企业交付或转换数据。在一个成熟的CIF环境下,数据交付的基础平台是相对稳定的,然而,数据集市对数据的需求为跟上商业信息的改变而发生巨大的变化,这意味着数据交付必须有足够的灵活性来跟上这种需求。

Data Delivery

Data delivery is generally limited to operations such as aggregation of data, filtering by specific dimensions or business requirements, reformatting data to ease end-user access or to support specific BI access software tools, and finally delivery or transmittal of data across the organization. The data delivery infrastructure remains fairly static in a mature CIF environment; however, the data requirements of the data marts evolve rapidly to keep pace with changing business information needs. This means that the data delivery layer must be flexible enough to keep pace with these demands.

数据集市

数据集市是数据仓库的一个子集,也是大多数BI分析活动发生的地方。每个数据集市的数据裁减成具体的尺寸和功能,如产品利润分析,KPI分析,客户统计分析等等。每一个具体的数据集市不必要满足其他的用途,各种数据集市都有普遍性和特殊性。普遍性就是他们都是数据仓库的子集,在物理配置上,可以和数据仓库放在一起,也可以在分开的平台上,数据量从几M到几GTB不等。为了最大化投资回报率(ROI Return on Investment),需要包含并实现数据仓库体系结构,这样才能支持所有的分析。

Data Marts

Data marts are a subset of data warehouse data and are where most of the analytical activities in the BI environment take place. The data in each data mart is usually tailored for a particular capability or function, such as product profitability analysis, KPI analyses, customer demographic analyses, and so on. Each specific data mart is not necessarily valid for other uses. All varieties of data marts have universal and unique characteristics. The universal ones are that they contain a subset of data warehouse data, they may be physically collocated with the data warehouse or on their own separate platform, and they range in size from a few megabytes to multiple gigabytes to terabytes! To maximize your data warehousing ROI, you need to embrace and implement data warehouse architectures that enable this full spectrum of analysis.

元数据管理

元数据管理指在整个CIF体系内,收集、管理、配置元数据的过程集合。元数据分为3类:技术元数据,描述CIF的物理结构,即移动、转换数据的详细过程;业务元数据描述CIF 的数据结构、数据元素、业务规则、业务用例;管理元数据描述CIF的操作,包括审计、性能矩阵、数据质量矩阵等其他统计信息。

 

Meta Data Management

Meta data management is the set of processes the collect, manage, and deploy meta data throughout the CIF. The scope of meta data managed by these processes includes three categories. Technical meta data describes the physical structures in the CIF and the detailed processes that move and transform data in the environment. Business meta data describes the data structures, data elements, business rules, and business usage of data in the CIF. Finally, Administrative meta data describes the operation of the CIF, including audit trails, performance metrics, data quality metrics, and other statistical meta data.

信息反馈

信息反馈允许通过CIF获取的智能和知识以合适的方式共享给其他存贮的共享机制,它标志一个企业是否是一个真正的“不断学习”的企业。例如:

■■从数据集市得来的预算,反馈到数据仓库存贮,将用于历史分析。

■■传送ODS更新的数据(通过事务接口)到合适的操行型系统,使其反映最新的情况。

■■返回分析结果,如客户片区分类与生命周期价值评分数据,到操作型系统或者ODS

 

Information Feedback

Information feedback is the sharing mechanism that allows intelligence and knowledge gathered through the usage of the Corporate Information Factory to be shared with other data stores, as appropriate. It is the use of information feedback that identifies an organization as a true “learning organization.” Examples of information feedback include:

■■ Pulling derived measures such as new budget targets from data marts and feeding them back to the data warehouse where they will be stored for historical analysis.

■■ Transmitting data that has been updated in an operational data store (through the use of a Transactional Interface) to appropriate operational systems, so that those data stores can reflect the new data.

        Feeding the results of analyses, such as a customer’s segment classification and life time value score, back to the operational systems or ODS.

 

信息车间

信息车间时一套工具集合,帮助企业用户使用CIF 的资源,一个典型的功能就是提供数据和其他资源组织、分类的方法,方便用户的查找和使用,它是一种在企业内提升共享和重用分析结果的机制。在一些企业,也叫做企业门户,用于组织信息资源,并且让企业用户直接访问。信息车间的部件分类为:资源库、工具箱、工作台。

资源库和工具箱是企业创建信息车间的第一步,资源库提供CIF 可用的资源和数据目录,以企业用户容易理解的方式组织起来。这些目录类似图书馆的藏书目录, 按标准的分类方法分类并排序。这些分类法常常基于企业的组织结构或者更高层的企业过程。

工具箱是一些可重用的组件集合,如报表分析工具,企业用户可以共享,以更有效的工作及分享他人的分析成果。资源库与工具箱是信息车间的基本组成部分。

更成熟的CIF组织支持工作台的概念,用于继承信息。元数据、数据、分析工具按业务功能和任务组织起来。工作台不再使用资源库和工具箱那样使用严格的分类法,而是面向任务和工作流,支持企业用户的工作。

Information Workshop

The information workshop is the set of tools available to business users to help them use the resources of the Corporate Information Factory. The information workshop typically provides a way to organize and categorize the data and other resources in the CIF, so that users can find and use those resources. This is the mechanism that promotes the sharing and reuse of analysis across the organization. In some companies, this concept is manifested as an intranet portal, which organizes information resources and puts them at business users’ fingertips. We classify the components of the information workshop as the library, toolbox, and workbench.

IThe library and toolbox usually represent the organization’s first attempts to create an information workshop. The library component provides a directory of the resources and data available in the CIF, organized in a way that makes sense to business users. This directory is much like a library, in that there is a standard taxonomy for categorizing and ordering information components. This taxonomy is often based on organizational structures or high-level business processes.

The toolbox is the collection of reusable components (for example, analytical reports) that business users can share, in order to leverage work and analysis performed by others in the enterprise. Together, these two concepts constitute a basic version of the information workshop capability.

More mature CIF organizations support the information workshop concept through the use of integrated information workbenches. In the workbench, meta data, data, and analysis tools are organized around business functions and tasks. The workbench dispenses with the rigid taxonomy of the library and toolbox, and replaces it with a task-oriented or workflow interface that supports business users in their jobs.

 

操作与管理

对于一个不断增长的、可持续的CIF,操作与管理是非常必要的支撑和基础功能。在早期的CIF实现,很多公司没有认识到这些功能的重要性,在计划和实施时没有考虑。操作与管理功能包括CIF数据管理、系统管理、数据获取管理、服务管理、变更管理,每一类管理都包含一系列维护与加强这些重要进程的过程与策略。

Operations and Administration

Operation and administration include the crucial support and infrastructure functions that are necessary for a growing, sustainable Corporate Information Factory. In early CIF implementations, many companies did not recognize how important these functions were, and they were often left out during CIF planning and development. The operation and administration functions include CIF Data Management, Systems Management, Data Acquisition Management, Service Management, and Change Management. Each of these functions contains a set of procedures and policies for maintaining and enhancing these critically important processes.

数据仓库的多用途性

现在读者对数据仓库在BI中扮演的角色有了很好的理解。数据仓库不仅是操作型数据的集成点,更加是把数据提供给各企业用户的分发点。要为BI决策应用提供稳定的、永久的历史数据,必须拥有以下属性:

它必须是面向企业的。数据仓库应该是所有数据集市和分析应用的起点,它会被多个部门甚至是多个公司或分部使用。数据仓库设计组一个艰难而必须解决的问题就是数据元素及定义的冲突,必须要有企业业务人员参加。

The Multipurpose Nature of the Data Warehouse

Hopefully by now, you have a good understanding of the role the data warehouse plays in your BI environment. It not only serves as the integration point for your operational data, it must also serve as the distribution point of this data into the hands of the various business users. If the data warehouse is to act as a stable and permanent repository of historical data for use in your strategic BI applications, it should have the following characteristics:

It should be enterprise focused. The data warehouse should be the starting point for all data marts and analytical applications; thus, it will be used by multiple departments, maybe even multiple companies or subdivisions. CA difficult but mandatory part of any data warehouse design team’s activities must be the resolution of conflicting data elements and definitions. The participation by the business community is also obligatory.

它的设计必须尽可能有弹性,满足变化。因为数据仓库用来存贮多年大量的、细节的、战略的数据,谁都不愿意卸载数据,重新设计数据库,然后再装入数据,为了避免这些操作,你应该考虑一下过程独立、应用独立、BI技术独立的数据模型,目标是创建一个可以轻易容纳新的数据元素,而不需要重新设计已存在的数据元素或模型的数据模型。

必须设计成能在很短的时间内装入大量数据。数据仓库的数据库设计必须最小化冗余,也即减少重复的属性和实体。大多数的数据库有批量装入工具,包含一些属性和功能用于优化大数据量的装入,如并行选项、按块装入、内置应用程序接口(API)等。这些情况,也许要关掉索引,或者需要平文件。然而,必须注意到非常重要的一点:一个不好的、低效的数据库,使用最好的装入工具也不能解决问题。

Its design should be as resilient to change as possible. Since the data warehouse is used to store massive, detailed, strategic data over multiple years, it is very undesirable to unload the data, redesign the database, and then reload the data. To avoid this, you should think in terms of a process independent, application-independent, and BI technology-independent data model. The goal is to create a data model that can easily accommodate new data elements as they are discovered and needed without having to redesign the existing data elements or data model.

It should be designed to load massive amounts of data in very short amounts of time. The data warehouse database design must be created with a minimum of redundancy or duplicated attributes or entities. Most databases have bulk load utilities that include a range of features and functions that can help optimize this process. These include parallelization options, loading

data by block, and native application program interfaces (APIs). They may mean that you must turn off indexing, and they may require flat files. However, it is important to note that a poorly or ineffectively designed database cannot be overcome even with the best load utilities.

 

它的设计应该使用数据交付工具进行数据抽取时最优。记住数据仓库的目标是为企业团体使用各种数据集市提供数据,所以数据仓库必须非常好的文档,以方便数据交付开发组容易地编写交付程序,数据的质量、血统、计算或者引出、意义,必须有清晰的文档。

数据格式必须支持任何可能的BI分析工具、任何可能的技术。在同一格式内,它应该包含最少的公分母级的明细数据,以支持所有的BI技术。它的设计必须没有偏见或者部门的情绪化的特殊应用。

It should be designed for optimal data extraction processing by the data delivery programs. Remember that the ultimate goal for the data warehouse is to feed the plethora of data marts that are then used by the business community. Therefore, the data warehouse must be well documented

so that data delivery teams can easily create their data delivery programs. The quality of the data, its lineage, any calculations or derivations, and its meaning should all be clearly documented.

Its data should be in a format that supports any and all possible BI analyses in any and all technologies. It should contain the least common denominator level of detailed data in a format that supports all manner of BI technologies. And it must be designed without bias or any particular department’s utilization only in mind.

 

支持的数据集市类型

今天,我们有太多的技术支持不同的分析需要——联机分析处理(OLAP)、探索、数据挖掘、统计数据集市,还有现在定制的分析应用。从各种技术的唯一特性,就是支持每种类型的数据集市:

OLAP数据集市:数据集市设计成支持一般的多维分析,使用OLAP软件工具,使用星型模式或者私有的“超立方”技术。星型模式或者多维数据库管理系统(MD DBMS)极大的支持多维分析,以可以接受的响应时间、重复的报表,满足稳定的需求、清楚的预定义查询。这些分析包括销售分析、产品利润分析、人力资源贡献跟踪、渠道销售分析等。

 

Types of Data Marts Supported

Today, we have a plethora of technologies supporting different analytical needs—Online Analytical Processing (OLAP), exploration, data mining and statistical data marts, and now customizable analytical applications. The unique characteristics come from the specificity of the technology supporting each type of data mart:

Introduction

OLAP data mart. These data marts are designed to support generalized multidimensional analysis, using OLAP software tools. The data mart is designed using the star schema technique or proprietary “hypercube” technology. The star schema or multidimensional database management

system (MD DBMS) is great for supporting multidimensional analysis in data marts that have known, stable requirements, fairly predictable queries with reasonable response times, and recurring reports. These analyses may include sales analysis, product profitability analysis, human resources headcount distribution tracking, or channel sales analysis.

探索数据仓库。大多数数据集市设计成支持各种具体的分析与报表,探索数据库用于提供探索,或则真正的“即兴”的数据导航。自从商业浏览器做出了有用的发明,分析能够格式化成另一种形式的数据集市(如OLAP),这样其他人能长期从中得益。新技术(如语言符号、编码矢量、位图技术等)大大提高了探索数据的能力及更快更有效的创建原型。

Exploration warehouse. While most common data marts are designed to support specific types of analysis and reporting, the exploration warehouse is built to provide exploratory or true “ad hoc” navigation through data. After the business explorers make a useful discovery, that analysis may be formalized through the creation of another form of data mart (such as an OLAP one), so that others may benefit from it over time. New technologies have greatly improved the ability to explore data and to create a prototype quickly and efficiently. These include token, encoded vector, and bitmap technologies.

数据挖掘或统计仓库。数据挖掘或统计仓库是一个特殊的数据集市,研究人员和分析师用于发掘数据与事件已知的与未知的关系,这些关系出乎预料。这是一个安全的港口,让人们执行查询,使用挖掘与统计算法,不用担心破坏生产数据仓库或者得到有偏见的数据(在多维设计里,只构建了已知的、已被证明的关系。)

订制的分析应用。这种新的数据集市允许廉价的、有效的定制普通的应用,这些“预制”的应用适合每个公司大多数需要,同时也能定制满足剩下的特殊的功能。这需要考虑灵活性与快速响应,以满足多样化与定制化。

Data-mining or statistical warehouse. The data-mining or statistical warehouse is a specialized data mart designed to give researchers and analysts the ability to delve into the known and unknown relationships of data and events without having preconceived notions of those relationships. It is a safe haven for people to perform queries and apply mining and statistical algorithms to data, without having to worry about disabling the production data warehouse or receiving biased data such as that contained in multidimensional designs (in which only known, documented relationships are constructed).

Customizable analytical applications. These new additions permit inexpensive and effective customization of generic applications. These “canned” applications meet a high percentage of every company’s generic needs yet can be customized for the remaining specific functionality. They require that you think in terms of variety and customization through flexibility and quick responsiveness.

 

支持的BI技术

在现实中,数据集市数据库的结构各不相同,从规范化、到反规范化、到事务平文件。理想的情况是在需求建立之后,然而,数据库的结构/解决方案往往是在知道具体的业务需要前选择的。我们那些数据仓库顾问看到,开发团队在没有做业务分析前就为使用星型模式还是规范化设计争论不休。因为某种理由,系统架构师和数据建模员总倾向于某种具体的设计技术——也许是熟悉某项技术,或者忽视了另一项技术——并且强制所有数据集市都使用这种设计,就像使惯了锤子的人,看什么都像钉子。

Types of BI Technologies Supported

The reality is that database structures for data marts vary across a spectrum from normalized to denormalized to flat files of transactions. The ideal situation 1is after the requirements are established. Unfortunately, the database structure/solution is often selected before the specific

business needs are known. Those of us in the data warehouse consulting business have witnessed development teams debating star versus normalized designs before even starting business analysis. For whatever reason, architects and data modelers latch onto a particular design technique—perhaps through comfort with a particular technique or ignorance of other techniques—and force all data marts to have that one type of design. This is similar to the person who is an expert with a hammer—everything he or she sees resembles a nail.

我们推荐数据集市设计者,应该基于数据的使用和信息的类型来选择模式。当然,没有绝对,但我们觉得支持所有类型的数据集市的最好设计应该不要预建立或预确定数据关系。这里有一点重要的告诫:为数据集市提供数据的数据仓库,必须支持所有形式的分析,而不只是多维形式。

Our recommendation for data mart designs is that the schemas should be based on the usage of the data and the type of information requested. There are no absolutes, of course, but we feel that the best design to support all the types of data marts will be one that does not preestablish or predetermine the data relationships. An important caveat here is that the data warehouse that feeds the marts will be required to support any and all forms of analysis—not just multidimensional forms.

为了决定最好的数据库和后续数据集市设计以满足商业需求,我们推荐开发一个简单的矩阵,按数据挥发性、数据库类型划分,如图1.4。这个矩阵给设计者、架构师、数据库管理员看到需求与物理数据库之间的关系,就是说,挥发性、等待时间、多主题域等,与分析工具一起提供信息(通过开发的场景),如重复交付、即兴报表、产品报表、算法分析等o determine the best database design for your business requirements and ensuing data mart, we recommend that you develop a simple matrix that plots the volatility of the data against a type of database design required, similar to the one in Figure 1.4. Such a matrix allows designers, architects, and database administrators (DBAs) to view where the overall requirements lie in terms of the physical database drivers, that is, volatility, latency, multiple subject areas, and so on, and the analytical vehicle that will supply the information (via the scenarios that were developed), for example, repetitive delivery, ad hoc reports, production reports, algorithmic analysis, and so on.

可维护的数据仓库环境的特征

有了这些背景知识,我们讨论一个坚固的、可维护的数据仓库数据模型是什么样子呢?在设计任何数据仓库时,不管是为一个刚开始使用BI的公司,还是为一个已经有了复杂的技术和用户的公司,不管这个公司目前只使用一种BI工具还是已经有太多的BI工具使用,我们要考虑那些特征?

Characteristics of a Maintainable Data Warehouse Environment

With this as a background, what does a solid, maintainable data warehouse data model look like? What are the characteristics that should be considered when designing any data warehouse, whether for a company just beginning its BI initiative or for a company having a sophisticated set of technologies and users, whether the company has only one BI access tool today or has a plethora of BI technologies available?

创建BI 环境的方法论具有迭代性,很幸运的,我们拥有很多优秀的书籍专门讨论这种方法论(参见书末的“推荐阅读”章节)。简言之,包括以下步骤:

1、  首先,选择及文档化需要使用BI技术(某种数据集市)解决的商业问题。

2、  尽可能多的收集需求,这些需求会在下一步进行提炼。

3、  决定终端用户使用的技术,支持何种解决方法(OLAP,数据挖掘、探索、分析应用等等)。

4、  创建数据集市原型,和企业用户一起检验原型的功能,如果有必要,重新设计。

5、  分发数据仓库数据模型与商业数据模型,基于用户的需求。

6、  把数据集市的需求映射到数据仓库,并最终返回到操作型系统本身。

7、  编写代码执行ETL和数据交付过程 ,一定要包含错误检测与纠正及审计过程。

8、  测试数据仓库与数据集市创建过程,衡量数据质量参数,为环境创建合适的元数据。

9、  可以接受之后,把数据仓库与数据集市的首次迭代迁移到生产系统,培训企业的业务团队,开始计划下一次迭代。

The methodology for building a BI environment is iterative in nature. We are fortunate today to have many excellent books devoted to describing this methodology. (See the “Recommended Reading” section at the end of this book.) In a nutshell, here are the steps:

1. First, select and document the business problem to be solved with a business intelligence capability (data mart of some sort).

2. Gather as many of the requirements as you can. These will be further refined in the next step.

3. Determine the appropriate end-user technology to support the solution (OLAP, mining, exploration, analytical application, and so on).

4. Build a prototype of the data mart to test its functionality with the business users, redesigning it as necessary.

5. Develop the data warehouse data model, based on the user requirements and the business data model.

6. Map the data mart requirements to the data warehouse data model and ultimately back to the operational systems, themselves.

7. Generate the code to perform the ETL and data delivery processes. Be sure to include error detection and correction and audit trail procedures in these processes.

8. Test the data warehouse and data mart creation processes. Measure the data quality parameters and create the appropriate meta data for the environment.

9. Upon acceptance, move the first iteration of the data warehouse and the data mart into production, train the rest of the business community, and start planning for the next iteration.

警告

迄今为止,我们建议你在建第一个分析容器(数据集市)前,建一个整体的数据仓库,包含企业所有的决策数据。每一个后续由其他数据集市解决的商业问题会给数据仓库增加数据,并作为基础。最后,必须加入到数据仓库来支持新的数据集市的数据越来越少,甚至可以忽略,因为大多数数据已经存在于数据仓库内。

WARNING

Nowhere do we recommend that you build an entire data warehouse containing all the strategic enterprise data you will ever need before building the first analytical capability (data mart). Each successive business problem solved by another data mart implementation will add the growing set of data serving as the foundation in your data warehouse. Eventually, the amount of data that must be added to the data warehouse to support a new data mart will be negligible because most of it will already be present in the data warehouse.

既然你不知道最终加入到数据仓库的数据量会有多大,也不知道最后会有多少BI 技术用于企业解决战略问题,因此你必须学会假设并依此作出计划。你可以假设数据仓库会成为企业内最大的数据库。数据仓库容量开始只有几GB,然后迅速扩张到几百GB,甚至TB,而现在预计为PB的情况并不少见。因此,不管现在是在BI生命周期的那个阶段,不管是刚刚开始,还是环境已经搭好了几年,关系数据库仍然是数据库管理系统的最好选择,它们有利于减少冗余、提高数据库设计效率,同时,在数据仓库部署时,你可以使用关系型DBMS所有复杂的、有用的特性:

Since you will not know how large the data warehouse will ultimately be, nor do you know all of the BI technologies that will eventually be brought to bear upon strategic problems in your enterprise, you must make some educated assumptions and plan accordingly. You can assume that the warehouse will become one of the largest databases found in your enterprise. It is not unusual for the data warehouse size to start out in the low gigabyte range and then grow fairly rapidly to hundreds of gigabytes, terabytes, and some now predict pedabytes. So, regardless of where you are in your BI life cycle—just starting or several years into building the environment—the relational databases are still the best choice for your database management system (DBMS). They have the advantage of being very conducive to nonredundant, efficient database design. In addition, their deployment for the data warehouse means you can use all the sophisticated and useful characteristics of a relational DBMS:

■■通过工具访问数据(数据建模、ETL、元数据、BI访问),所有这些都使用关系型数据库的SQL

■■数据库尺寸可量测性。关系型数据库仍然在存贮大量数据方面有优势。

■■并行处理极大提高数据处理效率。关系数据库在并行处理方面非常优秀。

■■大量工具,如批量装载、碎片整理、容量重排、性能监控、备份与恢复、高效索引等。■■■关系型数据库是支持战略数据的理想选择。也许有一天多为数据库(MOLAP)能够在这方面与关系数据库对抗,但目前还不行。

■■ Access to the data by most any tool (data modeling, ETL, meta data, and BI access). All use SQL on the relational database.

■■ Scalability in terms of the size of data being stored. The relational databases are still superior in terms of storing massive amounts of data.

■■ Parallelism for efficient and extremely fast processing of data. The relational databases excel at this function.

■■ Utilities such as bulk loaders, defragmentation, and reorganization capabilities, performance monitors, backup and recovery functions, and index wizards. Again, the relational databases are ideal for supporting a repository of strategic data. There may come a time when the proprietary multidimensional databases (MOLAP) can effectively compete with their relational cousins, but that is not the situation currently.

数据仓库数据模型

我们曾经推荐使用关系型DBMS用于数据仓库建设,那么,在那种结构下,数据模型会有何特征?在继续讨论模型的特征之前,让我们再一次看看一些假设:

■■假设数据仓库的核心是面向企业的。这意味着数据仓库里的数据不能偏向一个部门或者企业的一部份,而忽视另一部份。因此,最终BI的能力需要进一步处理(如使用数据集市)用于为某个特定团队“定制”,但是,最开始的材料(数据)能被所有人使用。

■■作为上面假设的必然推论,数据仓库里的数据不能违反企业建立的商业规则。数据仓库的数据模型从形式到文档必须显示支持这些基本规则。

■■数据仓库必须尽快、尽可能高效的装入新数据,批窗口,也许他们曾经存在,现在正变得越来小。大量数据装入数据仓库必须发生在ETL过程,只允许用最少的时间装入数据。

■■数据仓库必须从一开始就支持多种BI技术,甚至在第一个数据集市项目还没有建立的时候。数据仓库建设若偏向于某种技术,如多维分析,会大大消除其他需要的能力,如数据挖掘与统计分析。

■■数据仓库必须优雅的接纳数据与数据结构的变化。记住,我们一开始并不知道所有的需求,也不知道所有战略数据的用途,我们假设,从一开始建立数据基础开始,就可能发生改变。在心里记住这些假设,让我们看看理想的数据仓库的数据模型。

The Data Warehouse Data Model

Given that we recommend a relational DBMS for your data warehouse, what should the characteristics of the data model for that structure look like? Again, let’s look at some assumptions before going into the characteristics of the model:

■■ The data warehouse is assumed to have an enterprise focus at its heart. This means that the data contained in it does not have a bias toward one department or one part of the enterprise over another. Therefore, the ultimate BI capabilities may require further processing (for example, the use of a data mart) to “customize” them for a specific group, but the starting material (data) can be used by all.

■■ As a corollary to the above assumption, it is assumed that the data within data warehouse does not violate any business rules established by the enterprise. The data model for the data warehouse must demonstrate adherence to these underlying rules through its form and documentation.

■■ The data warehouse must be loaded with new data as quickly and efficiently as possible. Batch windows, if they exist at all, are becoming smaller and smaller. The bulk of the work to get data into a data warehouse must occur in the ETL process, leaving minimal time to load the data.

■■ The data warehouse must be set up from the beginning to support multiple BI technologies—even if they are not known at the time of the first data mart project. Biasing the data warehouse toward one technology, such as multidimensional analyses, effectively eliminates the ability to satisfy other needs such as mining and statistical analyses.

■■The data warehouse must gracefully accommodate change in its data and data structures. Given that we do not have all of the requirements or known uses of the strategic data in the warehouse from the very beginning, we can be assured that changes will happen as we build onto the existing data warehouse foundation. With these assumptions in mind, let’s look at the characteristics of the ideal data warehouse data model.

 

非冗余

大多数数据仓库必须提供最小的装入周期和大量的数据,因此,数据模型必须包含最小冗余。冗余给装入工具增加很大的负担,也给设计者带来负担,他必须担心所有冗余的数据元素和实体在正确的时间得到正确的数据。你的数据模型引入到数据仓库的冗余越多,你最终“取数据入”的过程越复杂。这并意味着在数据仓库里不允许有冗余,在第4章,我们会介绍为什么及什么时候引入冗余,关键是深谋远虑的控制和管理冗余。

Nonredundant

To accommodate the limited load cycles and the massive amount of data that most data warehouses must have, the data model for the data warehouse should contain a minimum amount of redundancy. Redundancy adds a tremendous burden to the load utilities and to the designers who must worry about ensuring that all redundant data elements and entities get the correct data at the correct time. The more redundancy you introduce to your data warehouse data model, the more complex you make the ultimate process of “getting data in.” This does not mean that redundancy is not ever found in the data warehouse. In Chapter 4, we describe when and why some redundancy is introduced into the data warehouse. The key though thought 笔误?) is that redundancy is controlled and managed with forethought.

 

稳定

前面我们已经提到,我们以迭代方始构建数据仓库,这样能很快创建数据集市,但是也存在风险,可能遗失或者错误的表述一些重要的商业规则或数据元素。随着越来越多数据集市上线,这会变得更加坚决和突出。不可避免的,数据仓库和它的模型会发生变化。众所周知,在一个企业变化最多的是它的流程、应有和技术。如果我们依赖这三个因素创建一个模型,当这三者之一发生变化,我们必定要做大修。因此,作为设计者,我们必须使用数据建模技术,尽可能减轻这些问题,同时抓住企业所有重要的业务规则。减轻这些问题的最好的数据建模技术是创建一个不依赖于流程、应用、技术的数据模型。另一方面,既然变化不可避免,我们必须做好准备,当新的BI 能力和数据集市创建时,容纳新的实体和属性。数据仓库设计者必须再次使用建模技术,以合并新的变化,而不需要重新设计已经存在的元素和实体的实现。这个模型叫做系统模型,我们会在第3章进行更详细的描述。

 

Stable

As mentioned earlier, we build the data warehouse in an iterative fashion, which has the benefit of getting a data mart created quickly but runs the risk of missing or misstating significant business rules or data elements. These would be determined or highlighted as more and more data marts came online. It is inevitable that change will happen to the data warehouse and its data model. It is well known that what changes most often in any enterprise are its processes, applications, and technology. If we create a data model dependent upon any of these three factors, we can be assured of a major overhaul when one of the three changes. Therefore, as designers, we must use a data-modeling technique that mitigates this problem as much as possible yet captures the all important business rules of the enterprise. The best data-modeling technique for this mitigation is to create a process-, application-, and technology-independent data model. On the other hand, since change is inevitable, we must be prepared to accommodate newly discovered entities or attributes as new BI capabilities and data marts are created. Again, the designer of the data warehouse must use a modeling technique that can easily incorporate a new change without someone’s having to redesign the existing elements and entities already implemented. This model is called a system model, and will be described in Chapter 3 in more detail.

一致

也许数据仓库数据模型最重要的特征是一致性,也就是最重要的资产——数据带给企业的一致性。包含所有元数据(定义、物理属性、别名、商业规则、数据所有者和管家、角色等等)的数据模型,对于企业用户最终理解他们分析的数据非常重要。数据模型创建过程必须协调待发行、数据差异与冲突问题,在任何ETL过程或数据映射之前。

Consistent

Perhaps the most essential characteristic of any data warehouse data model is the consistency it brings to the business for its most important asset—its data. The data models contain all the meta data (definitions, physical characteristics, aliases, business rules, data owners and stewards, domains, roles, and so on) that is critically important to the ultimate understanding of the business users of what they are analyzing. The data model creation process must reconcile outstanding issues, data discrepancies, and conflicts before any ETL processing or data mapping can occur.

 

数据使用方面的稳定

数据仓库唯一重要的目的就是作为一个坚实的、可靠的、一致的数据基础,支持所有BI 应用。从现在开始应该清楚,不管你的第一个BI能力如何,你必须为所有的业务需求服务,不管他们使用何种技术。因此,数据仓库的数据模型必须保持应有和技术独立性,这样,才能实现支持任何应用和技术的理想。

另一方面,数据模型必须支撑企业建立的业务规则,这意味着数据模型不只是简单的平文件。平文件,虽然是创建星型模式、数据挖掘、数据探索子集的基础,但是不能加强,甚至证明任何已知的商业规则。作为设计者,你必须往前走一步,使用真正的特定企业规则、域、基数性、可选性等,建立一个真正的数据模型。不然,数据的后续使用可能不能掌握,违反业务规则的情况也会发生。

 

Flexible in Terms of the Ultimate Data Usage

The single most important purpose for the data warehouse is to serve as a solid, reliable, consistent foundation of data for any and all BI capabilities. It should be clear by now that, regardless of what your first BI capability is, you must be able to serve all business requirements regardless of their technologies. Therefore, the data warehouse data model must remain application and technology independent, thus making it ideal to support any application or technology.

On the other hand, the model must uphold the business rules established for the organization, and that means that the data model must be more than simply flat files. Flat files, while a useful base to create star schemas, data mining, and exploration subsets of data, do not enforce, or even document, any known business rules. As the designer, you must go one step further and create a real data model with the real business rules, domains, cardinalities, and optionalities specified. Otherwise, subsequent usage of the data could be mishandled, and violations in business rules could occur.

 

Codd Date 前提

考虑上面我们提到的一个好的数据仓库模型的特征,我们提出,你能够使用的好的数据建模技术是基于原来关系数据库设计的——由Chris Date Ted Codd开发的实体关系图(ERD),有简单明了的构造规则,是一个已经证明了的、可靠的数据建模方法。规范化规则(将在第3章讨论)产生一个稳定的、一致的数据模型,支撑企业建立的政策与规则,同时,为后来的数据集市怎样分析数据带来极大的灵活性。在数据存贮和装入方面,这样的数据库是最有效的。然而,它不是完美的,我们会在下一节看到。

当我们的确感觉这种方法是非常典雅的,更重要的是,这种数据建模技术支撑所有我们提出的数据仓库环境的所有特征,这个数据仓库是可持续的、灵活的、可维护的、好理解的。这种数据仓库的数据模型是可以使用任意技术翻译到数据库的设计。如下所示:

The Codd and Date Premise

Given all of the above characteristics of a good data warehouse data model, we submit that the best data-modeling technique you can use is one based on the original relational database design—the entity-relationship diagram (ERD) developed by Chris Date and Ted Codd. The ERD is a proven and reliable data-modeling approach with straightforward rules of construction. The normalization rules discussed in Chapter 3 yield a stable, consistent data model that upholds the policies and rules of engagement established by the enterprise, while lending a tremendous amount of flexibility in how the data is later analyzed by the data marts. The resulting database is the most efficient in terms of storage and data loading as well. It is, however, not perfect, as we will see in the next section.

While we certainly feel that this approach is elegant in the extreme, more importantly, this data-modeling technique upholds all of the features and characteristics we specified for a sustainable, flexible, maintainable, and understandable data warehouse environment. The resultant data model for your data warehouse is translatable, using any technology, into a database design that is:

业务之间可靠。这种方式下,数据元素或实体的命名、关系、注解没有冲突。

企业内共享。从这种数据模型实现的数据仓库能够被多种数据交付过程及企业内任何地方的用户访问。

支持多种灵活的数据集市。这种数据库不会偏向BI环境的某个方向或另一个方向,所有的技术对你和你的公司都是可以选择的。

业务之间的正确性。数据仓库的数据模型能提供关于企业使用信息的方式的准确的、忠诚的表示。

变化时可调整。这样的数据库能够容纳新的数据元素和实体,同时保持与已有数据保持完整性。

Reliable across the business. It contains no contradictions in the way that data elements or entities are named, related to each other, or documented.

Sharable across the enterprise. The data warehouse resulting from the implementation of this data model can be accessed by multiple data delivery processes and users from anywhere in the enterprise

Flexible in the types of data marts it supports. The resulting database will not bias your BI environment in one direction or another. All technological opportunities will still be available to you and your enterprise.

Correct across the business. The data warehouse data model will provide an accurate and faithful representation of the way information is used in the business.

Adaptable to changes. The resulting database will be able to accommodate new elements and entities, while maintaining the integrity of the implemented ones.

对创建数据集市的影响

现在我们已经描述了一个坚实的数据仓库数据模型的特征,并且已经推荐ERD或者叫规范化的方法,让我们来看看关于全面的BI环境的分支。

使用数据仓库最普遍的应用是多维分析,至少目前如此。在星型模式下,维度用于粗糙的关联到主题域模型——订单、顾客、产品、市场分区——时间也是一样。“今年1月到6月我们在西北区,某种商品有多少订单?”,如果我们使用数据仓库作为这个查询的数据来源,回答这一类问题需要花些功夫。它需要在好几个大表之间进行连接(订单、订单明细、产品、市场分区,SQL语句使用严格的时间片)。这种情形不可爱,也不受欢迎,对一般的企业用户,他们不熟悉SQL。因此,我们可以看到这样的情形,数据仓库的访问受到限制,仅仅那些非常精于数据设计和SQL的企业用户才能使用。如果一个企业有好的探索与挖掘技术,可能会去除所有对数据仓库的访问,而需要所有企业用户访问OLAP集市,或者探索集市,或者数据挖掘仓库。

Impact on Data Mart Creation

Now that we have described the characteristics of a solid data warehouse data model and have recommended an ERD or normalized (in the sense of Date and Codd) approach, let’s look at the ramifications that decision will have on our overall BI environment.

The most common applications that use the data warehouse data are multidimensional ones—at least today. The dimensions used in the star schemas correlate roughly to the subject areas developed in the subject area model—order, customer, product, market segment—and time. To answer the questions, “How many orders for what products did we get in the Northeast section from January to June this year?” would take a significant amount of effort if we were to use the data warehouse as the source of data for that query. It would require a rather large join across several big entities (Order, Order Line Item, Product, Market Segment, with the restriction of the timeframe in the SQL statement). This is not a pretty or particularly welcomed situation for the average business user who is distantly familiar with SQL. So, what we can see about this situation is that data warehouse access will have to be restricted and used by only those business users who are very sophisticated in database design and SQL. If an enterprise has good exploration and mining technology, it may choose to cut off all access to the data warehouse, thus requiring all business users to access an OLAP mart, or exploration or data mining warehouse instead.

这是问题吗?不真正是。所有BI环境必须有一个某种类型的“后备房间”能力。在这个后备房间里,我们执行一些困难的任务,如集成、数据健康、错误检测与纠正、转换、审计与控制机制等,总之,用于保证决策数据的质量。因此,所有BI环境有这种“向公众封锁”的部分。我们已经简单把它带到了第一步,并且说这一部分应该正式建模、创建于维护。

Is this a problem? Not really. All BI environments must have “back room” capabilities of one sort or another. It is in the back room that we perform the difficult tasks of integration, data hygiene, error correction and detection, transformation, and the audit and control mechanisms to ensure the quality of the strategic data anyway. Therefore, all BI environments have this “closed off to the public” section of their environment. We have simply taken it one step further and said that this section should be formally modeled, created, and maintained.

在只有数据集市的世界,数据交付过程(前面有描述),不只承担保证在正确的时间把合适的数据交付给正确的集市的任务,而且必须承担整个ETL任务,一再完成数据获取过程。如果他们不再需要担心从一致的、质量可靠的数据源(数据仓库)获取数据,数据交付过程可以极大地简化,只要转换成数据集市技术需要的格式(星形模式、平文件、规范化子集等),并装入数据集市就可。构建一个来自一个坚实的、基于ERD的数据模型的数据仓库的另一个好处是:你得到很多可重用的实体和数据元素。在只有数据集市的环境,每个集市必须在它的数据库内保存所有需要的明细数据。除非两个集市共享一致的维度,否则集成两个集市的数据会很困难,甚至不可能。想象一下,如果有一个存在所有明细数据的仓库,那么,如果需要,任何时候,数据交付过程能够提取数据,并且使用BI访问工具访问,而不需要一遍一遍的复制数据。这是数据仓库带给你的BI环境的另一个显著的好处。

In the data-mart-only world, the data delivery processes, described earlier, must take on not only the burden of ensuring the proper delivery of  the information to the right mart at the right time but must also take on the entire set of ETL tasks found in the data acquisition processing over and over again. Given this situation, it should be obvious that the data delivery processes can be simplified greatly if all they have to worry about is extracting the data they specifically need from a consistent, quality source (the data warehouse), format it into that required by the data mart technology (star schema, flat file, normalized subset, and so on), and deliver the data to the data mart environment for uploading. As another benefit to constructing the data warehouse from a solid, ERD-based data model, you get a very nice set of reusable data entities and elements. In a data-mart-only environment, each mart must carry all the detailed data it requires within its database. Unless the two data marts share common conformed dimensions, integrating the two may be difficult, or even impossible. Imagine if a repository of detailed data existed that the data delivery processes could extract from and the BI access tools could access, if they needed to, at any time without having to replicate the data over and over! That is another significant benefit the data warehouse brings to your BI environment.

总结

有很多BI方法学和顾问告诉你,说不需要建立一个数据仓库,而是所有数据集市合并在一起形成一个“数据仓库”,至少是一个虚拟仓库;或者说,所有企业真正想要的是一个孤立的数据集市。我们发现,所有这些方法严重缺乏可持续性和成长性。这本书提出了一个“最佳实践”方法,用于创建数据仓库。我们使用的最佳实践方法是给设计者一套建议,哪些活动应该采取,那些活动应该避免,这样,让他们的努力能够获取最大的成功。

Summary

There are several BI methodologies and consultants who will tell you that you do not need a data warehouse, that the combination of all the data marts together creates the “data warehouse,” or at least a virtual one, or that really, all the business really wants is just a standalone data mart. We find all of these approaches to be seriously lacking in sustainability and sophistication. This book takes a “best practices” approach to creating a data warehouse. The best practices we use are a set of recommendations that tells designers what actions they should take or avoid, thus maximizing the success of their overall efforts. 

这些建议基于多个在这个领域多年的经验,参与了多个数据库项目,观察了很多成功的、可维护的数据仓库环境。显然,没有一个方法是完美的,没有一个方法可以不考虑具体情况而盲目追从。你应该懂得,什么方法在你的环境下工作得好,然后使用你认为适用的规则,当发生变化或者有新情况出现时,改变这些规则。除了这些告诫,本书充满了有用的、有价值的信息、指导和提示。在接下来的章节,我们会更详细的描述数据模型,一步一步构建数据模型,并且讨论模型部署及你可能会碰到的问题。在本书的最后,你应该有资格开始构建你的BI环境,拥有针对数据仓库最好的设计技术。

These recommendations are based on the years of experience in the field, participation in many data warehouse projects, and the observation of many successful and maintainable data warehouse environments. Clearly, no one method is perfect, nor should one be followed blindly without thought being given to the specific situation. You should understand what works best in your environment and then apply these rules as you see fit, altering them as changes and new situations arise. In spite of this caveat, this book is filled with useful and valuable information, guidelines, and hints.

In the following chapters, we will describe the data models needed in more detail, go over the construction of the data warehouse data model step by step, and discuss deployment issues and problems you may encounter along the way to creating a sustainable and maintainable business intelligence environment. By the end of the book, you should be fully qualified to begin constructing your BI environment armed with the best design techniques possible for your data warehouse.

Introduction

你可能感兴趣的:(数据仓库)