Data Mining Tutorial数据挖掘教程 (PART-1)

                     Data Mining Tutorial数据挖掘教程 (PART-1)



Seth Paul
Jamie MacLennan
Zhaohui Tang
Scott Oveson

Microsoft Corporation

June 2005

中文翻译: 张晓东 [email protected]

 

摘要SQL Server 2005 为构架一个数据挖掘模型提供一个集成环境,本教程使用如下四个实例阐述如何使用SQL Server 2005 所提供的数据挖掘技术以及工具:定位邮件,数据预期,市场模型,顺序集群。

Abstract: Microsoft® SQL Server™ 2005 provides an integrated environment for creating and working with data mining models. This tutorial uses four scenarios, targeted mailing, forecasting, market basket, and sequence clustering, to demonstrate how to use the mining model algorithms, mining model viewers, and data mining tools that are included in this release of SQL Server.


 


Contents 目录

Data Mining Tutorial 数据挖掘教程.. 1

Contents 目录.. i

Introduction 介绍.. 1

Mining Model Algorithms 数据挖掘系统构架.. 6

Microsoft Decision Trees 决策树.. 6

Microsoft Clustering 集群.. 6

Microsoft Naïve Bayes 7

Microsoft Time Series 时间系.. 7

Microsoft Association 联结.. 7

Microsoft Sequence Clustering 顺序集群.. 8

Microsoft Neural Network 中枢网络.. 8

Microsoft Linear Regression 线性回归.. 9

Microsoft Logistic Regression 后勤回归.. 9

Working Through the Tutorial如何完成本教程.. 10

Preparing the SQL Server Database 准备数据库.. 10

Preparing the Analysis Services Database准备分析数据库.. 11

Creating an Analysis Services Project创建SSAS 项目.. 12

Creating a Data Source 创建数据源.. 12

Creating a Data Source View创建数据视图.. 13

Editing the Data Source View编辑数据视图.. 15

Building and Working with the Mining Models创建及维护数据挖掘模型.. 16

Targeted Mailing 目标邮件实例.. 17

Forecasting 数据预测实例.. 47

Market Basket 市场篮实例.. 56

Sequence Clustering 顺序集群实例.. 64

 


Introduction介绍

通过学习该数据挖掘教程,你将学会如何使用SQL Server 2005创建数据挖掘模型的整个过程。SQL Server 2005中的数据挖掘算法以及工具使得建立一个全面完善的解决方案变得容易,本教程使用四个实例详细的介绍如何实现数据挖掘模型。

The data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.

SQL 2005中使用最多的工具 就是你用来创建和维护数据模型的工作区,实现OLAP 以及数据挖掘的工具包含在两个可视化工具中BI以及SSMS,使用BI你可以在不直接连接数据库的情况下开发一个SSAS项目,你完成了整个开发工作,你可以再将它部署到一个SQL数据库。SSMS的主要作用是管理一个SQL2005服务器,后面的内容会详细介绍这两个工具,关于何时使用这两个工具的更多信息请参考SQL BOL 中的如下章节:"Choosing Between SQL Server Management Studio and Business Intelligence Development Studio"

The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see "Choosing Between SQL Server Management Studio and Business Intelligence Development Studio" in SQL Server Books Online.

所有相关的数据挖掘工具都可以再数据挖掘编辑器中使用,使用数据挖掘编辑器你可以管理数据模型,建立,查看,比较数据模型,并且再已经存在的数据模型上建立数据预报。

All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing models.

当你创建了一个数据模型,你希望能够浏览它,查阅感兴趣的数据部分以及数据关系,在数据模型编辑器中的每个数据模型视图可以定制为浏览特定算法的数据模型,对于数据模型视图的更多信息,请查看SQL BOL中的 "Viewing a Data Mining Model"

After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see "Viewing a Data Mining Model" in SQL Server Books Online.

通常,你的项目会包含多个数据模型,因此在你能在一个模型上建立数据预期之前,你需要能够决定使用那一个模型更为合适,为此,数据模型编辑器包含了一个数据模型比较工具叫做 the Mining Accuracy Chart tab。你可以通过使用这个工具实现数据模型的对比,并选择最佳的模型。

Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.

为了建立数据预期,你将使用一种 DME语言,DMX扩展了传统的SQL语法,包含了一些创建修改和建立数据预期的命令,关于DMX的详细信息,请参考SQL BOL中的 Data Mining Extensions (DMX) Reference”章节。因为建立一个数据预期可能比较复杂,所以数据挖掘编辑器包含了一个工具叫做 Prediction Query Builder”, 该工具可以让你在一个图形化的界面下编辑DMX查询语句,你也可以在该工具中可以查看自动生成的DMX语句

To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see "Data Mining Extensions (DMX) Reference" in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.


了解了前面介绍的实现数据挖掘的工具之外,同等重要的是了解数据挖掘模型的结构本身,建立一个数据模型的关键是数据挖掘算法,该算法在你操作的数据中寻找我们需要的部分,并且转换这些数据成为一个可操作的数据模型,SQL2005 包含9中数据模型算法。

Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process. SQL Server 2005 includes nine algorithms:

·         Microsoft Decision Trees 决策树

·         Microsoft Clustering 集群

·         Microsoft Naïve Bayes

·         Microsoft Sequence Clustering 顺序集群

·         Microsoft Time Series 时间系

·         Microsoft Association 联结

·         Microsoft Neural Network 中枢网络

·         Microsoft Linear Regression 线性回归

·         Microsoft Logistic Regression

组合的使用这9种数据算法,你能够创建适应大部分商业逻辑的数据挖掘解决方案,本教程将详细的介绍这些算法。

Using a combination of these nine algorithms, you can create solutions to common business problems. These algorithms are described in more detail later in this tutorial.

一些很重要的建立数据挖掘解决方案的步骤是用来整理准备那些用于建立数据模型的数据,SQL2005包含一个DTS的工作环境以及一些DTS的工具用于清理验证准备数据,关于DTS的更多信息请查看SQL BOL中的"DTS Data Mining Tasks and Transformations"章节。

Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see "DTS Data Mining Tasks and Transformations" in SQL Server Books Online.

为了阐述SQL2005中的数据挖掘特性,本教程使用了一个新的示例数据库AdventureWorksDW 该数据库包含在 SQL2005中它提供OLAP以及数据挖掘的一些实例数据。为了使用这个数据库你需要在安装SQL的时候选择它。

In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.

本教程的假定读者是一些需要进行数据分析的开发者或者是了解数据挖掘概念的并有一定使用数据挖掘经验的数据库管理员,如果你不熟悉这些相关的知识,请先下载并阅读如下文档: Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis Services”,下载地址:

http://msdn.microsoft.com/library/default.asp?url=/servers/books/sqlserver/mining.asp

 

The audience for this tutorial is business analysts, developers, and database administrators who have used data mining tools before and are familiar with data mining concepts. If you are new to data mining, download "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis Services" (msdn.microsoft.com/library/default.asp?url=/servers/books/sqlserver/mining.asp).


Adventure Works 数据库

AdventureWorksDW 数据库是基于一个虚构的自行车制造公司而建立,公司的名称叫做 Adventure Works Cycles”(简称AW公司)。AW公司在世界各地生产销售自行车以及原料,主要的工作都在华盛顿Bothell完成,那里拥有 500 员工,以及一些地区销售部门遍及各地。

AdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.

AW公司通过INTERNET批发和零售他们的产品,本教程中的数据模型实例需要你使用这些网络销售数据作为数据模型。

Adventure Works sells products wholesale to specialty shops and to individuals through the Internet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.

关于AW公司数据库的更多信息,请参考SQL BOL中的如下章节:"Sample Databases and Business Scenarios"

For more information on Adventure Works Cycles see "Sample Databases and Business Scenarios" in SQL Server Books Online.

Database Details数据库详细信息

网络销售数据构架包含9242个客户的信息,这些客户分布在3大区域6个国家

The Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:

·         North America (83%) 南美

·         Europe (12%) 欧洲

·         Australia (7%) 澳大利亚

这个数据库包含3年的财务数据 2002 2003 2004

The database contains data for three fiscal years: 2002, 2003, and 2004.

数据库中的产品可以被被划分为不同子类,模型,产品类。

The products in the database are broken down by subcategory, model, and product.

Business Intelligence Development Studio
BI
开发工具

BI可视化开发工具视一组建立BI项目的工具的集合,在BI中你可以在不连接数据库的情况下建立一个完整的解决方案。在部署解决方案之前所有对项目的改动都不会影响现有的数据库。

Business Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.

使用BI可视开发环境有如下优点:

Working in an IDE is beneficial for the following reasons:

·         You have powerful customization tools available to configure Business Intelligence Development Studio to suit your needs.
BI
是一个功能强大的可定制的工具,你可以随时配置BI适应你的需求

·         You can integrate your Analysis Services project with a variety of other business intelligence projects encapsulating your entire solution into a single view.
你可以将各种数据挖掘技术与SSAS项目集成,在同一个工具中完成一个全面的解决方案.

·         Full source control integration enables your entire team to collaborate in creating a complete business intelligence solution.
强大的源码以及版本控制支持使你的团队可以协作的建立一个解决方案.

建立一个SSAS项目是所有BI项目的基础,一个SSAS项目独立的建立一个SSAS数据库用于集成多种技术,这个数据库作为数据挖掘模型以及OLAP等技术的基础。你可以使用BI 建立和修改一个SSAS项目并部署这个项目到一个或多个SSAS服务

The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.

如果你在开发一个SSAS项目你也可以使用BI直接连接数据库,这样你所作的改动可以立刻影响到数据库中。

If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.



To be continued...


[ 本人翻译该文章纯属学习之用,如果对SQL SERVER 2005技术感兴趣欢迎来信交流 ]

转载于:https://www.cnblogs.com/coldwine/archive/2005/10/21/258812.html

你可能感兴趣的:(Data Mining Tutorial数据挖掘教程 (PART-1))