sql server 入门
In past chats, we have had a look at a myriad of different Business Intelligence techniques that one can utilize to turn data into information. In today’s get together we are going to have a look at a technique dear to my heart and often overlooked. We are going to be looking at data mining with SQL Server, from soup to nuts.
在过去的聊天中,我们了解了无数种可以用来将数据转换为信息的不同商业智能技术。 在今天的聚会中,我们将了解一种我心中常常被忽视的技术。 我们将研究使用SQL Server进行数据挖掘的过程,从无所不包。
Microsoft has come up with a fantastic set of data mining tools which are often underutilized by Business Intelligence folks, not because they are of poor quality but rather because not many folks know of their existence OR due to the fact that people have never had to opportunity to get to utilize them.
微软提供了一套出色的数据挖掘工具,这些工具经常被商业情报人员利用,这不是因为它们的质量很差,而是因为没有多少人知道他们的存在,或者是因为人们从来没有机会去利用它们。
Rest assured that you are NOW going to get a bird’s eye view of the power of the mining algorithms in our ‘fire-side’ chat today.
请放心,您现在将在今天的“火边”聊天中大致了解挖掘算法的功能。
As I wish to describe the “getting started” process in detail, this article has been split into two parts. The first describes exactly this (getting started), whilst the second part will discuss turning the data into real information.
正如我希望详细描述“入门”过程一样,本文分为两部分。 第一部分准确地描述了这一点(入门),而第二部分将讨论将数据转换为真实信息。
So ‘grab a pick and shovel’ and let us get to it!
因此,“抢一把铲子”,让我们开始吧!
For today’s exercise, we start by having a quick look at our source data. It is a simple relational table within the SQLShackFinancial database that we have utilized in past exercises.
对于今天的练习,我们首先快速查看源数据。 它是我们在过去的练习中使用SQLShackFinancial数据库中的简单关系表。
As a disclosure, I have changed the names and addresses of the true customers for the “production data” that we shall be utilizing. The names and addresses of the folks that we shall utilize come from the Microsoft Contoso database. Further, I have split the client data into two distinct tables: one containing customer numbers under 25000 and the other with customer numbers greater than 25000. The reason for doing so will become clear as we progress.
作为披露,我更改了我们将要使用的“生产数据”的真实客户的名称和地址。 我们将利用的人员的姓名和地址来自Microsoft Contoso数据库。 此外,我已经将客户数据分为两个不同的表:一个包含25000以下的客户编号,另一个包含大于25000的客户编号。这样做的原因将随着我们的发展而变得清楚。
Having a quick look at the customer table (containing customer numbers less than 25000), we find the following data.
快速浏览客户表(包含少于25000的客户号),我们发现以下数据。
The screenshot above shows the residential addresses of people who have applied for financial loans from SQLShack Finance.
上面的屏幕截图显示了从SQLShack Finance申请了金融贷款的人的住所。
Moreover, the data shows criteria such as the number of cars that the applicant owns, his or her marital status and whether or not he or she owns a house. NOTE that I have not mentioned the person’s income or net worth. This is will come into play going forward.
此外,数据还显示一些标准,例如申请人拥有的汽车数量,他或她的婚姻状况以及他或她是否拥有房屋。 注意,我没有提及该人的收入或净资产。 这将在未来发挥作用。
Now that we have had a quick look at our raw data, we open SQL Server Data Tools (henceforward referred to as SSDT) to begin our adventure into the “wonderful world of data mining”.
现在,我们已经快速浏览了原始数据,我们将打开SQL Server数据工具(以下称为SSDT)开始我们的冒险,进入“精彩的数据挖掘世界”。
Opening SSDT, we select “New” from the “File” tab on the activity ribbon and select “Project” (see above).
打开SSDT,我们从活动功能区的“文件”选项卡中选择“新建”,然后选择“项目”(见上文)。
We select the “Analysis Services Multidimensional and Data Mining” option. We give our new project a name and click OK to continue.
我们选择“ Analysis Services多维和数据挖掘”选项。 我们给新项目起一个名字,然后单击“确定”继续。
Having clicked “OK”, we find ourselves on our working surface.
单击“确定”后,我们发现自己在工作表面上。
Our first task is to establish a connection to our relational data. We do this by creating a new “Data Source” (see below).
我们的首要任务是建立与我们的关系数据的连接。 为此,我们创建了一个新的“数据源”(见下文)。
We right-click on the “Data Sources” folder (see above and to the right) and select the “New Data Source” option.
我们右键单击“数据源”文件夹(请参见上方和右侧),然后选择“新数据源”选项。
The “New Data Source” Wizard is brought up. We click “Next”.
出现“新数据源”向导。 我们点击“下一步”。
We now find ourselves looking at connections that we have used in past and SSDT wishes to know which (if any) of these connections we wish to utilize. We choose our “SQLShackFinancial” connection.
现在,我们发现自己正在查看过去使用的连接,SSDT希望知道我们希望使用这些连接中的哪些(如果有)。 我们选择“ SQLShackFinancial”连接。
We select “Next”
我们选择“下一步”
We are asked for our credentials (see above) and click next.
要求我们提供凭据(见上文),然后单击下一步。
We are now asked to give a name to our connection (see above).
现在,我们被要求给我们的连接起一个名字(见上文)。
We click finish.
我们点击完成。
Our next task is to create a Data Source View. This is different to what we have done in past exercises.
我们的下一个任务是创建一个数据源视图。 这与我们在过去的练习中所做的不同。
The data source view permits us to create relationships (from our relational data) which we wish to carry forward into the ‘analytic world’. One may think of a “Data Source View” as a staging area for our relational data prior to its importation into our cubes and mining models.
数据源视图使我们能够(希望从关系数据中)创建关系,并希望将这些关系推向“分析世界”。 在将关系数据导入多维数据集和挖掘模型之前,可以将“数据源视图”视为关系数据的暂存区域。
We right-click on the “Data Source Views” folder and select “New Data Source View”.
我们右键单击“数据源视图”文件夹,然后选择“新数据源视图”。
The “Data Source View” wizard is brought up (see below).
出现“数据源视图”向导(请参见下文)。
We click “Next” (see above).
我们单击“下一步”(见上文)。
We select our “Data Source” that we defined above (see above).
我们选择上面定义的“数据源”(请参见上文)。
The “Name Matching” dialogue box is brought into view. As we shall be working with one table for this exercise, there is not much impact from this screen HOWEVER if we were creating a relationship between two or more tables we would indicate to the system that we want it to create the necessary logical relationships between the two or more tables to ensure that our tables are correctly joined.
出现“名称匹配”对话框。 由于我们将使用一个表进行此练习,因此,如果在两个或多个表之间创建关系,则此屏幕不会产生太大影响,但会向系统指示我们希望系统在表之间创建必要的逻辑关系。两个或更多表,以确保我们的表正确连接。
In our case we merely select “Next” (see above).
在我们的情况下,我们仅选择“下一步”(请参见上文)。
We are now asked to select the table or tables that we wish to utilize.
现在,要求我们选择希望使用的一个或多个表。
For our current exercise, I select the “Customer” table (See above) and move the table to the “Included Objects” (see below).
对于我们当前的练习,我选择“客户”表(见上文),然后将该表移至“包含的对象”(见下文)。
We then click “Next”.
然后,我们单击“下一步”。
We are now asked to give our “Data Source View” a name (see above) and we then click “Finish” to complete this task.
现在,我们要求给“数据源视图”命名(请参见上文),然后单击“完成”以完成此任务。
We find ourselves back on our work surface. Note that the Customer entity is now showing in the center of the screenshot above, as is the name of the “Data Source View” (see upper right).
我们发现自己回到了工作现场。 请注意,Customer实体现在显示在上方屏幕快照的中心,“数据源视图”的名称也是如此(请参见右上方)。
We now right click on the ‘Mining Structure” folder and select “New Mining Structure” (see above).
现在,我们右键单击“ Mining Structure”文件夹,然后选择“ New Mining Structure”(请参见上文)。
The “Data Mining Wizard” now appears (see below).
现在出现“数据挖掘向导”(见下文)。
We click “Next”.
我们点击“下一步”。
For the “Select the Definition Method” screen we shall accept the default “From existing relational database or data warehouse” option (see below).
对于“选择定义方法”屏幕,我们将接受默认的“来自现有关系数据库或数据仓库”选项(请参见下文)。
We then click “Next”.
然后,我们单击“下一步”。
The “Create the Data Mining Structure” screen is brought into view. The wizard asks us which mining technique we wish to use. In total for this exercise, we shall be creating four structure. “Microsoft Decision Trees” is one of the four. That said, we shall leave the default setting “Microsoft Decision Trees” as is.
进入“创建数据挖掘结构”屏幕。 向导将询问我们希望使用哪种挖掘技术。 总的来说,我们将创建四个结构。 “ Microsoft决策树”是这四个之一。 也就是说,我们将保留默认设置“ Microsoft Decision Trees”。
We ignore the warning shown in the message box as we shall create the necessary connectivity on the next few screens.
我们将忽略消息框中显示的警告,因为我们将在接下来的几个屏幕上创建必要的连接。
The reader will note that the system wishes to know which “Data Source View” we wish to utilize. We select the one that we created above. We then click “Next”.
读者会注意到,该系统希望知道我们希望使用哪种“数据源视图”。 我们选择上面创建的那个。 然后,我们单击“下一步”。
The mining wizard now asks us to let it know where the source data resides. We select the “Customer” table (see above) and we click next.
现在,挖掘向导会要求我们告知源数据所在的位置。 我们选择“客户”表(见上文),然后单击下一步。
At this point, we need to understand that once the model is created we shall “process” the model. Processing the model achieves two important things. First it “Trains” the model as to what type of data we are utilizing and runs that data against the data mining model that we have selected. After obtaining the necessary results, the process compares the actual results with the predicted results. The closer the actuals are to the predicted results the more accurate the model that we selected. The reader should note that whilst Microsoft provides us with +/- twelve mining models NOT ALL will provide a satisfactory solution and therefore a different model may need to be used. We shall see just this within a few minutes.
在这一点上,我们需要了解,一旦创建了模型,我们将“处理”该模型。 处理模型可以实现两件重要的事情。 首先,它“训练”该模型以了解我们正在使用什么类型的数据,并根据我们选择的数据挖掘模型来运行该数据。 在获得必要的结果后,该过程会将实际结果与预测结果进行比较。 实际值与预测结果越接近,我们选择的模型越准确。 读者应注意,尽管Microsoft为我们提供了+/-十二种采矿模型,但并非ALL会提供令人满意的解决方案,因此可能需要使用其他模型。 我们将在几分钟内看到这一点。
We now must specify the “training data” or in simple terms “introduce the Microsoft mining models to the customer raw data and see what the mining model detects”. In our case, it is the data from the “Customer” table. What we must do is to provide the system with a Primary Key field. Further, we must tell the system what data fields/criteria will be the data inputs that will be utilized with the mining model to see what correlation (if any) there is between these input fields (Does the client owns a house? How many cars does the person own? Is he or she married?) and what we wish to ascertain from the “Predicted” field (Is the person a good credit risk?) .
现在,我们必须指定“训练数据”或简单地说“将Microsoft挖掘模型引入客户原始数据并查看挖掘模型检测到的内容”。 在我们的例子中,它是“客户”表中的数据。 我们必须做的是为系统提供主键字段。 此外,我们必须告诉系统哪些数据字段 / 条件将是挖掘模型将使用的数据输入,以查看这些输入字段之间存在什么关联(如果有)(客户是否拥有房屋?有多少辆汽车?该人拥有吗?是否已婚?)以及我们希望从“预测”字段中确定的内容(该人是否存在良好的信用风险?)。
For the primary key we select the fields “PK_Customer_Name” (see above).
对于主键,我们选择“ PK_Customer_Name”字段(请参见上文)。
We select “Houseowner” and “Marital_Status” (see above)
我们选择“房主”和“婚姻状况”(见上文)
and “number of cars owned” (see above)
和“拥有的汽车数量”(见上文)
As the reader will see from the two screen shots above, we selected
读者将从上面的两个屏幕截图中看到,我们选择了
NOTE that I have not included income and this was deliberate for our example.
注意,我没有包括收入,这是我们的示例所故意的。
Bernie Madoff’s income was large however we KNOW that he would not be a good risk.
伯尼·麦道夫(Bernie Madoff)的收入很大,但是我们知道他不会冒很大的风险。
Lastly, included within the raw data was a field called Credit Class which are KNOWN credit ratings for the clients concerned.
最后 ,原始数据中包含一个名为“信用等级”的字段,该字段是有关客户的已知信用等级。
Last but not least, we must select the field that we wish the mining model to predict. This field is the “Credit Class” as may be seen below:
最后但并非最不重要的一点是,我们必须选择希望挖掘模型预测的字段。 该字段是“信用等级”,如下所示:
We now click “Next”.
现在,我们单击“下一步”。
Having clicked “Next we arrive at the “Specify Columns’ Content and Data Type Screen”.
单击“下一步,我们进入”指定列的内容和数据类型屏幕”。
Credit class (the predicted field) is either a 0, 1, 2, 3, 4. These are discrete values (see above).
信用等级(预测字段)为0、1、2、3、4。这些是离散值(请参见上文)。
The number of cars owned is also a discrete value. No person owns 1.2 cars.
拥有的汽车数量也是一个离散值。 没有人拥有1.2辆汽车。
House Owner is a Boolean (Y or N).
房主是布尔值(Y或N)。
Marital (Married) status is also a Boolean Value (Y or N).
婚姻(已婚)状态也是布尔值(Y或N)。
We click next.
我们点击下一步。
SQL Server now wishes to know of all the records within the customer table, what percentage of the data (RANDOMLY SELECTED BY THE MINING ALGORITHM) should be utilized to test just how closely the predicted values of “Credit class” tie with the actual values of “Credit Class”. One normally accept 30% as a good sample (of the population). As a reminder to the reader, the accounts within the data ALL have account numbers under 25000. We shall see why I have mentioned this again (in a few minutes).
现在,SQL Server希望了解客户表中的所有记录,应使用百分之几的数据(由挖掘算法随机选择)来测试“信用等级”的预测值与实际值之间的紧密关系。 “信用等级”。 通常情况下,有30%的人作为良好样本(人口中的一员)。 提醒读者,数据ALL中的帐户的帐户号均低于25000。我们将看到为什么我在几分钟后再次提到了这一点。
We then click next.
然后单击下一步。
The system wants us to give our mining model a name. In this case, we choose. “SQLShackMainMiningModel”. This is the “mommy”. “SQLShackMainMiningModel” has four children, one being the Decision Tree algorithm that we just created and three more which we shall create in a few moments. For the mining model name, we select “DecisionTreeSQLShackModel”.
系统希望我们给我们的挖掘模型起一个名字。 在这种情况下,我们选择。 “ SQLShackMainMiningModel”。 这是“妈妈”。 “ SQLShackMainMiningModel”有四个子代,一个是我们刚创建的决策树算法,另外三个是我们稍后将创建的子代。 对于挖掘模型名称,我们选择“ DecisionTreeSQLShackModel”。
We now click “Finish”.
现在,我们单击“完成”。
We are returned to our main working surface as may be seen above.
如上所示,我们回到了主要工作表面。
From the “Mining Structures” folder we double-click our “SQLShackMainMiningModel” that we just created.
从“挖掘结构”文件夹中,双击我们刚刚创建的“ SQLShackMainMiningModel”。
The “Mining Structure” opens. In the upper left-hand side, we can see the fields for which we opted. They are shown under the Mining structure directory (see above).
“采矿结构”打开。 在左上角,我们可以看到我们选择的字段。 它们显示在“采矿结构”目录下(请参见上文)。
Clicking on the “Mining Models” tab, we can see the first model that we just created.
单击“挖掘模型”选项卡,我们可以看到刚创建的第一个模型。
What we now wish to do is to create the remaining three models that we discussed above.
现在,我们要做的是创建上面讨论的其余三个模型。
The first of the three will be a Naïve-Bayes Model. This is commonly used in a predictive analysis. The principles behind the Naïve-Bayes model are beyond the scope of this paper and the reader is redirected to any good predictive analysis book.
这三个中的第一个将是朴素贝叶斯模型。 这通常用于预测分析中。 朴素贝叶斯模型背后的原理超出了本文的范围,读者可以重新定向到任何好的预测分析书。
We select the “Create a related mining model” option (see above with the pick and shovel).
我们选择“创建相关的挖掘模型”选项(请参见上文中的“镐和铲”)。
The “New Mining Model” dialogue box is brought up to be completed (see above).
弹出“ New Mining Model”对话框,以完成该操作(请参见上文)。
We give our model a name and select the algorithm type (see above).
我们给我们的模型起一个名字,然后选择算法类型(见上文)。
In a similar manner, we shall create a “Clustering Model” and a “Neural Network”. The final results may be seen below:
以类似的方式,我们将创建一个“聚类模型”和一个“神经网络”。 最终结果如下所示:
We have now completed all the heavy work and are in a position to process our models.
现在,我们已经完成了所有繁重的工作,并且能够处理我们的模型。
We click on the “Project” tab on the main ribbon and select “SQLShackDataMining” properties (see above).
我们单击主功能区上的“项目”选项卡,然后选择“ SQLShackDataMining”属性(请参见上文)。
The “SQLShackDataMining Property Pages” are brought into view. Clicking on the “Deployment” tab, we select the server to which we wish to deploy our OLAP database, and in addition, give the database a name.
进入“ SQLShackDataMining属性页”。 单击“部署”选项卡,我们选择希望将OLAP数据库部署到的服务器,此外,为数据库命名。
We then click “OK”.
然后,我们单击“确定”。
We right click on the “SQLShackMainMiningModel” and select “Process”.
我们右键单击“ SQLShackMainMiningModel”,然后选择“进程”。
We are told that our data is old and do we want to reprocess the models (see below).
有人告诉我们我们的数据很旧,我们是否要重新处理模型(请参见下文)。
We answer “Yes”.
我们回答“是”。
We are then asked for our credentials (see above). Once completed, we select “OK”.
然后,要求我们提供凭据(见上文)。 完成后,我们选择“确定”。
One the build is complete, we are taken to the “Process Mining Structure” screen. We select the run option found at the bottom of the screen (see below in the blue oval).
一个构建完成后,我们进入“过程挖掘结构”屏幕。 我们选择在屏幕底部找到的运行选项(请参见下面的蓝色椭圆形)。
Processing occurs and the results are shown above.
进行处理,结果如上所示。
Upon completion of processing, we click the “Close” button to leave the processing routine (see above). We now find ourselves back on our work surface.
处理完成后,我们单击“关闭”按钮以退出处理例程(请参见上文)。 现在,我们回到工作表面。
Now that our models have been processed and tested (this occurred during the processing that we just performed), it is time to have a look at the results.
既然我们的模型已经过处理和测试(这是在我们刚刚执行的处理过程中发生的),那么现在该看看结果了。
We click on the third tab “Mining Model Viewer”
我们点击第三个标签“ Mining Model Viewer”
Selecting our “Decision Tree” model as a starting point, we select zero as our background value. The astute reader will remember that zero is the best risk from our lending department. THE DARKER THE COLOUR OF THE BOXES is the direction that we should be following (according to the predicted results of the processing).
选择“决策树”模型作为起点,我们选择零作为背景值。 精明的读者会记住,零是我们贷款部门的最大风险。 暗箱颜色是我们应该遵循的方向(根据处理的预测结果)。
That said, we should be looking at folks who own no cars, are not married and do not own a house. You say weird!! Not entirely. It can indicate that the person has no debt. We all know what happens after getting married and having children to raise
就是说,我们应该看看那些没有汽车,没有结婚,也没有房子的人。 你说很奇怪! 不是完全。 它可以表明该人没有债务。 我们都知道结婚和生孩子后会发生什么情况
Clicking the “Dependency Network” tab we see that the mining model has found that the credit class is dependent Houseowner, Marital Status and Num Cars Owned.
单击“依赖关系网络”选项卡,我们看到挖掘模型已发现信贷类别为依赖的房主,婚姻状况和拥有的汽车数量。
By sliding the “more selective” slider found under the text “All Links” (see above) we are telling the model to go down to the grain of the wood “to see which one of the three is the most decisive” in determining the relationship between it and the credit class (see below).
通过滑动在“所有链接”(见上文)下找到的“更具选择性”滑块,我们告诉模型下降到木纹,“以确定三个因素中哪一个是最具决定性的”,以确定与信用等级之间的关系(请参见下文)。
We note that “Num Cars Owned” seems to play a major role. In other words, the mining model believes that there is a strong relationship between the credit class and the number of cars that the person either owns OR is currently financing. Now the “doubting Thomas” will say why? Mainly because cars cost money. Most people finance the purchase of cars. Credit plays a big role in financing.
我们注意到,“拥有的Num Cars”似乎起着主要作用。 换句话说,挖掘模型认为,信用等级与该人拥有或目前正在融资的汽车数量之间存在很强的关系。 现在“怀疑托马斯”会说为什么? 主要是因为汽车要花钱。 大多数人资助购车。 信贷在融资中起着重要作用。
A full discussion of all four algorithms, how they work and what to look for to justify selecting any of the four (over and above the others) is certainly in order, however in the interests of brevity and driving home the importance of data mining itself, we shall put this discussion off until a future get together.
对这四种算法,它们如何工作以及寻找什么来证明选择这四种算法中的任何一种的全面讨论当然是有序的,但是为了简洁起见,驱使数据挖掘本身的重要性,我们将推迟进行讨论,直到将来聚会。
We shall, however, continue to see how the system has ranked these algorithms and which of the four the process recommends.
但是,我们将继续观察系统如何对这些算法进行排名,以及该过程建议使用哪种算法。
Having now created our four mining models, we now wish to ascertain which of the four have the best fit for our data and has the highest probability of rendering the best possible information.
创建了四个挖掘模型之后,我们现在要确定四个模型中哪个最适合我们的数据,并且最有可能呈现最佳信息。
We click on the “Mining Accuracy Chart” tab
我们点击“采矿精度图表”标签
Note that the accuracy chart has four tabs itself. The first of the tabs is the “Input Selection”. We also note that our four mining models are present on the screen (see above).
请注意,精度图表本身具有四个选项卡。 选项卡中的第一个是“输入选择”。 我们还注意到,屏幕上显示了我们的四个挖掘模型(请参见上文)。
As SQLShack financial makes most of its earnings from lending money and as we all realize that they wish to lend funds to only clients that they believe are a good risk (i.e. a rating of 0 ), they set the “Predict Value” to zero for all four algorithms (see below).
由于SQLShack financial的大部分收入都来自放贷,并且我们都认识到他们只希望将资金借给他们认为有良好风险(即评级为0)的客户,因此他们将“预测值”设置为零。所有四种算法(请参见下文)。
and when complete, our screen should look as follows (see below):
完成后,我们的屏幕应如下所示(如下所示):
The astute reader will note that in the lower portion of the screen we are asked which dataset we wish to utilize. We accept the dataset “the mining model test cases” used by the system in the creation of our model. Later in this discussion, we shall utilize data that we have held back from the model to verify that the mining models hold for that data as well. That will be the proof of the pudding!
精明的读者会注意到,在屏幕的下部,询问了我们希望使用哪个数据集。 我们接受系统在创建模型时使用的数据集“挖掘模型测试用例”。 在本讨论的稍后部分,我们将利用从模型中保留的数据来验证挖掘模型是否也适用于该数据。 那将是布丁的证明!
The lift chart (in my humble opinion) tells all. Its purpose is the show us which of the four models the system believes is the best fit for the data that we have.
电梯图表(以我的拙见)可以说明一切。 其目的是向我们展示系统认为这四个模型中的哪个最适合我们拥有的数据 。
I like to call the Lift Chart “the race to the top”. It informs us how much of the population should be sampled (check their real credit rating) to make a decision on the credit risk which is most beneficial to SQLShack Financial. In simple terms, it is saying to us “This model requires checking x% of the population before you can be fairly certain that your model is accurate”. Keeping things extremely simple, the lower the required sampling amount, the more certain that one can be that the model is accurate and is, in fact, one of the models that we should be utilizing.
我喜欢称举升图表为“顶级比赛”。 它告诉我们应该抽样多少人口(检查其真实信用等级)以决定对SQLShack Financial最有利的信用风险。 简而言之,这是对我们说的:“此模型需要先检查x%的人口,然后才能确定模型是正确的”。 使事情保持极其简单,所需的采样量越低,就越可以肯定该模型是准确的,并且实际上是我们应该使用的模型之一。
That said, the line graph in pink (see above) is a line generated by the mining structure processing. It shows the best possible outcome. It essentially is telling us “with the best possible LUCK, we only need to check the true credit ratings of 22% of our applicants”. The light blue undulating lines represent the “Decision Tree” model and the “Neural Network” model and they peak (reach one hundred percent on the Y axis at a population sampling of just over 50 % (X-axis see the graph above). This said they are the most promising algorithms to use. The “Naïve Bayes” and “Clustering algorithms” peak closer to 100% on the X-axis and are therefore not as reliable as the “Decision Tree” and the “Neural Network” algorithms. The straight line in blue from (0,0) to (100,100) is the “Shear dumb luck” line. Enough said. More on accuracy in a few minute. Please stay tuned.
就是说,粉红色的线图(请参见上文)是由挖掘结构处理生成的线。 它显示了最好的结果。 它实质上是在告诉我们“如果运气最好,我们只需要检查22%的申请人的真实信用等级”。 淡蓝色的起伏线代表“决策树”模型和“神经网络”模型,并且达到峰值(在人口抽样刚超过50%时,Y轴达到100%(X轴请参见上图)。这表示它们是最有前途的算法,“朴素贝叶斯”和“聚类算法”在X轴上的峰值接近100%,因此不如“决策树”和“神经网络”算法可靠。从(0,0)到(100,100)的蓝色直线是“ Shear dumb luck”线,足够多了,请稍后再说。
As proof of our assertions immediately above, we now have a quick look at the next tab, the “Classification Matrix”
作为我们上面的断言的证明,我们现在快速浏览下一个选项卡“分类矩阵”
In this fantastic and informative tool, the model shows us how many instances the system found “where the ‘predicted’ was the SAME as the ‘actual’”. Note the first matrix for the “Decision Tree” (the first matrix) and note the strong diagonal between the “Actuals” on the X axis and the “Predicted” on the Y-axis. The same is reasonably true for the “Neural Network” model (see the bottom of the screenshot below).
在这个奇妙而有用的工具中,模型向我们显示了系统发现了多少个实例“ 其中“预测的”是SAME的“实际””。 注意“决策树”的第一个矩阵(第一个矩阵),并注意X轴上的“ Actuals”和Y轴上“ Predicted”之间的强对角线。 对于“神经网络”模型也是如此(请参见下面的屏幕截图的底部)。
The reader will note that the predicted vs. actuals for the remaining two models are randomly dispersed. The more the entropy, the more doubtful the accuracy of the model (with regards to our data).
读者会注意到,其余两个模型的预测值与实际值是随机分散的。 熵越多,模型的准确性(就我们的数据而言)就越值得怀疑。
At this point, we supposedly have two relatively reliable models with which to work. What we now must do is to verify the predicted versus the actuals. As a reminder to the reader, the only way to ensure that our algorithms are yielding correct predictions is by comparing what “they say” with the “truth” from other credit agencies. The more that match, the more accurate the algorithm will be in predicting; as more and more data is added to our systems.
至此,我们应该可以使用两个相对可靠的模型。 现在,我们要做的是验证预测值与实际值。 提醒读者,确保我们的算法得出正确预测的唯一方法是将“他们说的话”与其他信贷机构的“真相”进行比较。 匹配越多,算法预测的准确性就越高; 随着越来越多的数据添加到我们的系统中。
We now click on the “Mining Model Predictions” tab.
现在,我们单击“挖掘模型预测”选项卡。
We note that “Decision Tree” model has been selected. Obviously, we could have selected one of the other models but in the interest of brevity, we choose to look at the “Decision Tree”. We must now select the physical input table (with all its records) that we wish the model to act upon.
我们注意到已选择“决策树”模型。 显然,我们可以选择其他模型之一,但是为了简洁起见,我们选择查看“决策树”。 现在,我们必须选择希望模型对其进行操作的物理输入表(及其所有记录)。
We click “Select Case Table” as shown above and select the “Customer” table (see below).
我们如上图所示单击“选择案例表”,然后选择“客户”表(见下文)。
Note that the fields of the mining model are joined to the actual field of the “Customer” table. What is now required, is to remove the link from the credit class of the model to the actual credit class in the relational table. The field in the table being the known credit class. Simplistically we want the model to predict the credit class, and we shall then see how many matches we obtain.
请注意,挖掘模型的字段已连接到“客户”表的实际字段。 现在需要的是,从模型的信用等级到关系表中的实际信用等级删除链接。 表中的字段是已知的信用等级。 简单地说,我们希望模型预测信用等级,然后我们将看到获得了多少个匹配项。
We right click on the credit class link and delete it (see above).
我们右键单击信用等级链接并将其删除(请参见上文)。
Our screen now looks as follows (see above).
现在,我们的屏幕如下所示(请参见上文)。
We are now in a position to create our first Data Mining Query (DMX or Data Mining Expression) to “prove out” our model.
我们现在是在一个位置来创建我们的第一个数据挖掘查询(DMX或d ATA 中号都进不去E X PRESSION)“证明了”我们的模型。
In the screen above we select the first field to be the “Credit Risk” from the data mining model.
在上面的屏幕中,我们从数据挖掘模型中选择第一个字段为“信用风险”。
We set its “Criteria / Argument” to 0 (see above).
我们将其“条件/参数”设置为0(请参见上文)。
We now add more fields from the “Customer” table to identify these folks!
现在,我们从“客户”表中添加更多字段来识别这些人!
We have now added a few more fields (from the source table) as may be seen above. Let us see what we have.
如上所示,我们现在(从源表中)添加了一些其他字段。 让我们看看我们拥有什么。
Under the words “Mining Structure” (upper left in the screen shot above) we click on the drop down box. We select the “Query” option (see below).
在“采矿结构”一词(在上面的屏幕截图的左上方)下,我们单击下拉框。 我们选择“查询”选项(见下文)。
The resulting DMX code from our design screen is brought into view.
从我们的设计屏幕中生成的DMX代码被显示。
This DMX code should not to be confused with MDX code. Personally I LOVE the feature of having access to the code, as now that we have the code, we can utilize this code for reporting. More about this in my next part of this article.
请勿将此DMX代码与MDX代码混淆。 我个人喜欢访问代码的功能,因为现在有了代码,我们可以利用此代码进行报告。 在本文的下一部分中,将对此进行更多介绍。
Once again we select the drop down box below the words “Mining Structure” and select “Result” (see below).
我们再次选择“采矿结构”下方的下拉框,然后选择“结果”(见下文)。
We now obtain our result set.
现在,我们获得了结果集。
We note that the model has rendered 1330 rows that it believes would be a good credit risk.
我们注意到,该模型已呈现1330行,它认为这将是一个很好的信用风险。
Let us now throw a spanner into the works and add one more field to the mix. This field is the TRUE CREDIT RISK and we shall tell the system that we wish to see only those records whose “true credit risk” was a 0. In short, actual equals predicted.
现在让我们将一把扳手投入作品中,并在混合中再添加一个字段。 该字段是“ 真实信用风险” ,我们将告诉系统我们希望仅查看那些“真实信用风险”为0的记录。总之,实际等于预测值。
Our design now looks as follows.
现在,我们的设计如下所示。
Running our query again, we now find that:
再次运行查询,我们现在发现:
994 rows were returned, that the algorithm predicted would be a 0 and were, in fact, a credit class of 0. This represents 74.7% accuracy which is surprisingly good.
返回了994行,预测的算法将为0,并且实际上是信用等级为0。这表示74.7%的准确性,这令人惊讶地好。
Thus far we have performed our exercises with account numbers less than 25000.
到目前为止,我们已经使用少于25000的帐号进行了练习。
Let us now add a few more accounts. These accounts will have account numbers over 25000. We now load these accounts and reprocess the results. We now set up our query as follows (see below).
现在让我们添加更多帐户。 这些帐户的帐号将超过25000。我们现在加载这些帐户并重新处理结果。 现在,我们按照以下步骤设置查询(请参见下文)。
The result set may be seen above. We retrieved 974 rows.
结果集可以在上面看到。 我们检索了974行。
Changing this query slightly and once again telling the query to only show the rows where the predicted and actual credit classes are 0 we find…
略微更改此查询,然后再次告诉该查询以仅显示预测信用等级和实际信用等级为0的行,我们发现…
924 rows are returned which indicates a 95% accuracy. The only one take away from this 95% is that we are still on track with our model.. nothing more, nothing less.
返回924行,表明准确度为95%。 唯一可以摆脱这95%的因素的是,我们仍在使用我们的模型 。 仅此 而已。
It cannot be stressed enough that data mining is an iterative process. In real life business scenarios, one would take into consideration more degrees of freedom.
数据挖掘是一个反复的过程,这已经足够强调了。 在现实生活中的业务场景中,人们会考虑更多的自由度。
Attributes such as
诸如以下的属性
Etc.
等等。
Further, time must be spent in refining the actual combinations of these parameters with the myriad of mining models in order to ensure we are utilizing the most effective model(s).
此外,必须花费时间来完善这些参数与大量采矿模型的实际组合,以确保我们使用的是最有效的模型。
In short, it is a trial and error exercise. The more data that we have and the more degrees of freedom that we utilize, the closer we come to the ‘truth’ and ‘reality’.
简而言之,这是一个反复试验的练习。 我们拥有的数据越多,利用的自由度越高,我们就越接近“真相”和“现实”。
In the second part of this article (to be published soon), we shall see how we may utilize the information emanating from the models, in our day to day reporting activities.
在本文的第二部分(即将发布)中,我们将在日常报告活动中看到如何利用模型产生的信息。
Lastly, common sense and understanding are prerequisites to any successful data mining project. There are no correct answers, merely close approximates. This is the reality! Happy programming!
最后,常识和理解是任何成功的数据挖掘项目的先决条件。 没有正确答案,只有近似值。 这是现实! 编程愉快!
翻译自: https://www.sqlshack.com/getting-started-sql-server-data-mining/
sql server 入门