数据挖掘学习笔记之 Classification模型选择 (一) ROC

ROC曲线（Receiver Operating Characteeristic Curve）是显示Classification模型真正率和假正率之间折中的一种图形化方法。

解读ROC图的一些概念定义:：

真正（True Positive , TP）被模型预测为正的正样本
假负（False Negative , FN）被模型预测为负的正样本
假正（False Positive , FP）被模型预测为正的负样本
真负（True Negative , TN）被模型预测为负的负样本

真正率（True Positive Rate , TPR）或灵敏度（sensitivity）
   TPR = TP /（TP + FN）
   正样本预测结果数 / 正样本实际数
假负率（False Negative Rate , FNR）
   FNR = FN /（TP + FN）
   被预测为负的正样本结果数 / 正样本实际数
假正率（False Positive Rate , FPR）
   FPR = FP /（FP + TN）
   被预测为正的负样本结果数 /负样本实际数
真负率（True Negative Rate , TNR）或特指度（specificity）
   TNR = TN /（TN + FP）
   负样本预测结果数 / 负样本实际数

目标属性的被选中的那个期望值称作是“正”（positive）

ROC曲线上几个关键点的解释：

( TPR=0,FPR=0 ) 把每个实例都预测为负类的模型
( TPR=1,FPR=1 ) 把每个实例都预测为正类的模型
( TPR=1,FPR=0 ) 理想模型

此处图像以后再补

一个好的分类模型应该尽可能靠近图形的左上角，而一个随机猜测模型应位于连接点（TPR=0,FPR=0）和（TPR=1,FPR=1）的主对角线上。

ROC曲线下方的面积（AUC）提供了评价模型平均性能的另一种方法。如果模型是完美的，那么它的AUG = 1，如果模型是个简单的随机猜测模型，那么它的AUG = 0.5，如果一个模型好于另一个，则它的曲线下方面积相对较大。

Oracle 论坛上对ROC 的解释
This explaination comes from one of our algorithm engineers:

"The ROC analysis applies to binary classification problems. One of the classes is selected as a "positive" one. The ROC chart plots the true positive rate as a function of the false positive rate. It is parametrized by the probability threshold values. The true positive rate represents the fraction of positive cases that were correctly classified by the model. The false positive rate represents the fraction of negative cases that were incorrectly classified as positive. Each point on the ROC plot represents a true_positive_rate/false_positive_rate pair corresponding to a particular probability threshold. Each point has a corresponding confusion matrix. The user can analyze the confusion matrices produced at different threshold levels and select a probability threshold to be used for scoring. The probability threshold choice is usually based on application requirements (i.e., acceptable level of false positives).

The ROC does not represent a model. Instead it quantifies its discriminatory ability and assists the user in selecting an appropriate operating point for scoring."

I would add to this that you can select a threshold point the build activity to bias the apply process. Currently we generate a cost matrix based on the selected threshold point rather than use the threshold point directly.

http://forums.oracle.com/forums/thread.jspa?threadID=415870&tstart=15

数据挖掘学习笔记之 Classification模型选择 (一) ROC

你可能感兴趣的:(数据挖掘学习笔记之 Classification模型选择 (一) ROC)