COMP9318_WEEK2

声明：由于本人也是处于学习阶段，有些理解可能并不深刻，甚至会携带一定错误，因此请以批判的态度来进行阅读，如有错误，请留言或直接联系本人。

本周内容参照Jiawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分内容。

本周内容：1）Data Warehouse定义；2）OLAP介绍；3）Data Cube介绍

关键词：Data Warehouse; OLAP; Data Cube; Lattice; Roll-up; Drill-down; Slice and Dice; Pivot

首先提出问题：

什么是Data warehouse？

2. OLAP是什么？与Data Warehouse有什么关系？

问题一：什么是Data warehouse？

关于Data Warehouse 这里有很多版本的定义，我们这里采用William H. Inmon的定义：

W. H. Inmon对Data Warehouse的定义：A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.

特点一，subject-oriented（面向主题）：

1）Organized around major subjects, such as customer,product, sales.

2） Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

3） Provide a simpleand conciseview around particular subject issues by excluding data that are not useful in the decision support process.

（Data Warehouse 面向的数据是有侧重点的的；且它关注的是数据的模型搭建和分析，以帮助决策者来进行决策；它针对具体的主题通过简洁明了的视图来使决策者能够明了的看见有用的信息）

特点二，integrated（整合）：

1）Constructed by integrating multiple, heterogeneous data sources

1.1）relational databases, flat files, on-line transaction records

2）Data cleaning and data integration techniques are applied.

2.1）Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources.(E.g., Hotel price: currency, tax, breakfast covered, etc.)

2.2）When data is moved to the warehouse, it is converted.

（Data Warehouse 将复杂的异构的数据整合起来；这里会用到数据清洗和数据整合技术，来确保数据结构合理正确（naming conventions, encoding structures, attribute measures））

特点三，time-variant（时变）：

1）The time horizon for the data warehouse is significantly longer than that of operational systems.

1.1）Operational database: current value data.

1.2）Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

2）Every key structure in the data warehouse

2.1）Contains an element of time, explicitly or implicitly

2.2）But the key of operational data may or may not contain “time element”.

（Data Warehouse 的存在时间是久于Operational database的，所以它会存储很多历史性数据，因此我们需要在Data Warehouse的关键结构点（key structure）上显式的或隐式的存储时间标记（time element），所以Data Warehouse是具有时间性的。）

特点四，nonvolatile（非易失性）：

1）A physically separate store of data transformed from the operational environment.

2）Operational update of data does not occur in the data warehouse environment.

2.1）Does not require transaction processing, recovery, and concurrency control mechanisms

2.2）Requires only two operations in data accessing:

2.2.1）initial loading of data and access of data.

（Data Warehouse 的数据存储是独立的，且它不会实时更新数据（Does not require transaction processing, recovery, and concurrency control mechanisms），它只需要能够初始化加载数据和访问数据就行。因此在DW中不需要考虑数据冗余（data redundancy），因为即使一个个体的汇总数据是在不断更新的，但是DW不会覆盖历史数据，只会不断的存入该汇总数据。例如，Woolworths在Sydney的A2牛奶的销售汇总数据每年都是不一样的，DW会将1990年...,2015年，2016年，2017年的销售汇总数据存储起来，而不是每年都会用新数据覆盖旧数据。所以说，在DW中的data不能被删除也不能被修改。）

总结：

Data Mining:concepts and techniques P.106总结理一下William H. Inmon关于Data Warehouse的定义：In sum, a data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting, and decision making.

在PPT第11页，老师提供了个图，来帮助理解Data Warehouse Architecture

image.png

（1）外部数据通过Extract, Transform, Load, Refresh这些步骤，将数据进行clean , match, aggregate，然后放入DW。（2）现在，DW已经是所有subject的data一个统一体（我们还可以进行细分，将DW分成若干个Data Marts（这些mart都是针对于specific subject的））。然而在DW中，那我们用什么工具来帮助分析处理这些数据呢？当然是OLAP。（3）经过OLAP的分析处理后，我们还可以用这些已经比较直观的数据再次进行analysis, query, data Mining 来获取更加深层次的隐藏的信息。

问题二：OLAP是什么？与Data Warehouse 有什么关系？

OLAP是什么：

1.1）（来自Wikipedia）

Online analytical processing, or OLAP , is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing.[1] OLAP is part of the broader category of business intelligence, which also encompasses relational database, report writing and data mining.[2] Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM),[3] budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture.[4] The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP).[5]

OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.[6]

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time.[7]

1.2）来自DATA MINING

Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles.（P.3）

（注：韩教授书中对于OLAP的操作分类略有不同：Typical OLAP operations include rollup, drill-(down, across, through), slice-and-dice, pivot (rotate), as well as statistical operations such as ranking and computing moving averages and growth rates. OLAP operations can be implemented efficiently using the data cube structure.（DATA MINING P.151）本人更加倾向于韩教授的分类）

下面是关于Roll-up，Drill-down的解释：

Roll-up, Drill-down是OLAP的操作之一，它们允许用户在不同的汇总级别观察数据。例如，按地级市汇总的销售数据，用户可以Rool-up得到按省汇总的销售数据，可以Drill-down得到按县（区）汇总的销售数据。

如下图可直观的感受到Roll-up和Drill-down的操作。

image.png

2）OLAP与Data Warehouse 有什么关系？

Data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.（Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and on-line analytical processing and will provide an effective platform for data mining. ）（DATA MINING P.105）

新的问题产生：

3.Data Warehouse和传统的DBMS有什么不同呢？

4. 相较于OLTP, OLAP具有什么优势？

首先我们给OLTP下个定义：The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.（DATA MINING P.108）

问题三：Data Warehouse和传统的DBMS有什么不同呢？

（1）High performance for both systems

（1.1）DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

（1.2）Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

（2）Different functions and different data:

（2.1）missing data: Decision support requires historical data which operational DBs do not typically maintain

（2.2）data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

（2.3）data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.

课程PPT解释的太过笼统，下面请看韩教授对于它们不同点的详细解释：

1）Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.

2）Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making.

3）Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or snowflake model (to be discussed in Section 3.2.2) and a subjectoriented database design.

4）View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.

5）Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (because most data warehouses store historical rather than up-to-date information), although many could be complex queries.（DATA MINING P.108-109）

image.png

问题四：相较于OLTP, OLAP具有什么优势？

（1）Different workload:

（1.1）OLTP (on-line transaction processing)

（1.1.1）Major task of traditional relational DBMS

（1.1.2）Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

（1.2）OLAP (on-line analytical processing)

（1.2.1）Major task of data warehouse system

（1.2.2）Data analysis and decision making

（2）Queries hard/infeasible for OLTP, e.g.,

（2.1）Which week we have the largest sales?

（2.2）Does the sales of dairy products increase over time?

（2.3）Generate a spread sheet of total sales by state and by year.

（3）Difficult to represent these queries by using SQL

继续提问：5. 什么是multidimensional Model呢？

问题五：什么是multidimensional Model呢？

Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.（DATA MINING P.110）

好的，现在我们肯定有疑问，什么是fact？什么是measure？什么是dimension？什么是data cube？让我们一一解答。

什么是fact？：(PPT定义)Facts: the subject it models；(DATA MING P.111定义)Facts are numerical measures. （个人总结：fact就是用数字数据来度量的subject。）

什么是measure？:A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.(DATA MINING P.119)

什么是dimension？：Dimensions: context of the measure（Dimensions are the perspectives or entities with respect to which an organization wants to keep records. （DATA MINING P.110））让我们用实例来说明dimension。

image.png

在这个cube中，它的dimensions其实是product，location，time，那么为什么我们看到的图的三个维度分别是product，month，city呢？因为每个dimension是分hierarchy的（就如这里的city，如果我们使用OLAP工具进行操作的话，它可以roll-up到sate，也可以drill-down到stor。注意，OLAP操作和herarchy不是一类事物，详见DATA MINING P.121）。因此，我们需要给每个dimension一个类似于总称的东西。

什么是data cube？：A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.（DATA MINING P.110）（个人总结：data cube是由Tables and Spreadsheets 转化而来的更为直观立体的数据表现形式。）

在ppt22页有这张图，让我们对其进行探究。

image.png

这里的蓝色区域其实表示所有product的汇总，正如绿色方框内文字所表示的，箭头所指小方框表示“所有product在一月份在纽约的销售量”。

由此，我们可以根据如下图，得到下下一张的汇总图：

image.png

image.png

我们将汇总图局部拆解，显得更加直观

image.png

依此进行组合就可以得到上图的一个汇总图。

好的，我们还可以Lattice of the cuboids，

image.png

这张图我们可以理解为，在1层时（从下往上，0层开始）为点，2层为面，3层为立方体。依此往上，我们可以很方便的将大于3维的cube直观的展现在我们眼前。

下面我们讲一下，分析一下Common OLAP Operations，由于前面我们已经分析过Roll-up和drill-down，所以这里我们不再赘述。（详细说明可参照DATA MINING P.125）

Slice：The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.

image.png

Dice：The dice operation defines a subcube by performing a selection on two or more dimensions.

image.png

Pivot：Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cubeinto a series of 2-D planes.

image.png

COMP9318_WEEK2

你可能感兴趣的:(COMP9318_WEEK2)