python的roc曲线与阈值_Roc曲线和切断点.Python

我运行了逻辑回归模型,并对logit值进行了预测.我用它来获得ROC曲线上的点数:

from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

我知道metrics.roc_auc_score给出了ROC曲线下的面积.谁能告诉我什么命令会找到最佳截止点(阈值)?

解决方法:

虽然回答很晚,但思想可能会有所帮助.您可以使用R (here!)中的epi软件包来完成此操作,但是我在python中找不到类似的软件包或示例.

最佳截止点是真阳性率高且误报率低的地方.基于这个逻辑,我在下面举了一个例子来找到最佳阈值.

Python代码:

import pandas as pd

import statsmodels.api as sm

import pylab as pl

import numpy as np

from sklearn.metrics import roc_curve, auc

# read the data in

df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

# rename the 'rank' column because there is also a DataFrame method called 'rank'

df.columns = ["admit", "gre", "gpa", "prestige"]

# dummify rank

dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')

# create a clean data frame for the regression

cols_to_keep = ['admit', 'gre', 'gpa']

data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])

# manually add the intercept

data['intercept'] = 1.0

train_cols = data.columns[1:]

# fit the model

result = sm.Logit(data['admit'], data[train_cols]).fit()

print result.summary()

# Add prediction to dataframe

data['pred'] = result.predict(data[train_cols])

fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])

roc_auc = auc(fpr, tpr)

print("Area under the ROC curve : %f" % roc_auc)

####################################

# The optimal cut off would be where tpr is high and fpr is low

# tpr - (1-fpr) is zero or near to zero is the optimal cut off point

####################################

i = np.arange(len(tpr)) # index for df

roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})

roc.ix[(roc.tf-0).abs().argsort()[:1]]

# Plot tpr vs 1-fpr

fig, ax = pl.subplots()

pl.plot(roc['tpr'])

pl.plot(roc['1-fpr'], color = 'red')

pl.xlabel('1-False Positive Rate')

pl.ylabel('True Positive Rate')

pl.title('Receiver operating characteristic')

ax.set_xticklabels([])

最佳截止点为0.317628,因此高于此值的任何值都可以标记为1,否则为0.您可以从输出/图表中看到tpr与1-fpr交叉的位置,tpr为63%,fpr为36%,tpr-( 1-fpr)在当前示例中最接近零.

输出:

1-fpr fpr tf thresholds tpr

171 0.637363 0.362637 0.000433 0.317628 0.637795

希望这是有帮助的.

编辑

为了简化和引入可重用性,我已经找到了找到最佳概率截止点的函数.

Python代码:

def Find_Optimal_Cutoff(target, predicted):

""" Find the optimal probability cutoff point for a classification model related to event rate

Parameters

----------

target : Matrix with dependent or target data, where rows are observations

predicted : Matrix with predicted data, where rows are observations

Returns

-------

list type, with optimal cutoff value

"""

fpr, tpr, threshold = roc_curve(target, predicted)

i = np.arange(len(tpr))

roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})

roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]

return list(roc_t['threshold'])

# Add prediction probability to dataframe

data['pred_proba'] = result.predict(data[train_cols])

# Find optimal probability threshold

threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])

print threshold

# [0.31762762459360921]

# Find prediction to the dataframe applying threshold

data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)

# Print confusion Matrix

from sklearn.metrics import confusion_matrix

confusion_matrix(data['admit'], data['pred'])

# array([[175, 98],

# [ 46, 81]])

标签:logistic-regression,python,roc

来源: https://codeday.me/bug/20190926/1820972.html

你可能感兴趣的:(python的roc曲线与阈值)