scikit-learn模型建立与评估

文章目录

  • 1.mean_squared_error()
  • 2.预测结果准确率的比重
  • 3.ROC曲线,roc_auc_score
  • 4.交叉验证
  • 5.三分类

1.mean_squared_error()

  • 模型:LinearRegression 线性回归
  • 均方误差(mean-square error, MSE)是反映估计量与被估计量之间差异程度的一种度量。
###汽车油耗效率
import pandas as pd
import matplotlib.pyplot as plt
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars = pd.read_table("auto-mpg.data", delim_whitespace=True,names=columns)
print(cars.shape)
print(cars.head(5))

fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
cars.plot("weight", "mpg", kind='scatter', ax=ax1)
cars.plot("acceleration", "mpg", kind='scatter', ax=ax2)
plt.show()

scikit-learn模型建立与评估_第1张图片
scikit-learn模型建立与评估_第2张图片

import sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(cars[["weight"]], cars["mpg"])

import sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
lr.fit(cars[["weight"]], cars["mpg"])
predictions = lr.predict(cars[["weight"]])
print(predictions[0:5])
print(cars["mpg"][0:5])

plt.scatter(cars["weight"], cars["mpg"], c='red')
plt.scatter(cars["weight"], predictions, c='blue')
plt.show()

lr = LinearRegression()
lr.fit(cars[["weight"]], cars["mpg"])
predictions = lr.predict(cars[["weight"]])
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(cars["mpg"], predictions)
print(mse)
rmse = mse ** (0.5)
print (rmse)

scikit-learn模型建立与评估_第3张图片
在这里插入图片描述

2.预测结果准确率的比重

  • 逻辑回归算法(LogisticRegression)虽然是线性回归算法,但是其它线性回归有所不同,
    逻辑回归的预测结果只有两种,即true(1)和false(0)。因此,它的名字是回归,是一个用于分类的线性模型而不是用于回归。逻辑回归算法往往适用于数据的分类(二分类)
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
np.random.seed(8)
admissions = pd.read_csv("./data/admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)
print(admissions.head())
shuffled_index = np.random.permutation(admissions.index)#打乱索引
print (shuffled_index)
shuffled_admissions = admissions.loc[shuffled_index]
print(shuffled_admissions.head())

##划分数据集
train = shuffled_admissions.iloc[0:515]
test = shuffled_admissions.iloc[515:len(shuffled_admissions)]
#print(train.head())

model = LogisticRegression()
model.fit(train[["gpa"]], train["actual_label"])
  • 预测的准确率
labels = model.predict(test[["gpa"]])
test["predicted_label"] = labels

matches = test["predicted_label"] == test["actual_label"]
correct_predictions = test[matches]
accuracy = len(correct_predictions) / float(len(test))
print('模型预测准确率',accuracy)
  • 预测结果正确的1和0的个数
true_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 1)
true_positives = len(admissions[true_positive_filter])
print(true_positives)

true_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 0)
true_negatives = len(admissions[true_negative_filter])
print(true_negatives)

在这里插入图片描述

  • 预测的值为1的准确率
true_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 1)
true_positives = len(admissions[true_positive_filter])
false_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 1)
false_negatives = len(admissions[false_negative_filter])

sensitivity = true_positives / float((true_positives + false_negatives))
print(sensitivity)
  • 预测的值为0的准确率
true_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 0)
true_negatives = len(admissions[true_negative_filter])
false_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 0)
false_positives = len(admissions[false_positive_filter])
specificity = (true_negatives) / float((false_positives + true_negatives))
print(specificity)

3.ROC曲线,roc_auc_score

  • ROC(Receiver Operating Characteristic)曲线和AUC常被用来评价一个二值分类器(binary classifier)的优劣。
  • AUC(Area Under ROC Curve),即ROC曲线下面积。
    若学习器A的ROC曲线被学习器B的ROC曲线包围,则学习器B的性能优于学习器A的性能;若学习器A的ROC曲线和学习器B的ROC曲线交叉,则比较二者ROC曲线下的面积大小,即比较AUC的大小,AUC值越大,性能越好。
import matplotlib.pyplot as plt
from sklearn import metrics

probabilities = model.predict_proba(test[["gpa"]])
fpr, tpr, thresholds = metrics.roc_curve(test["actual_label"], probabilities[:,1])
print(thresholds)
plt.plot(fpr, tpr)
plt.show()

scikit-learn模型建立与评估_第4张图片

from sklearn.metrics import roc_auc_score
probabilities = model.predict_proba(test[["gpa"]])

# Means we can just use roc_auc_curve() instead of metrics.roc_auc_curve()
auc_score = roc_auc_score(test["actual_label"], probabilities[:,1])
print(auc_score)

在这里插入图片描述

4.交叉验证

交叉验证优点:

  • 交叉验证用于评估模型的预测性能,尤其是训练好的模型在新数据上的表现,可以在一定程度上减小过拟合。
  • 还可以从有限的数据中获取尽可能多的有效信息。
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

admissions = pd.read_csv("./data/admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)

kf = KFold(5, shuffle=True, random_state=8)
lr = LogisticRegression()
#roc_auc
accuracies = cross_val_score(lr,admissions[["gpa"]], admissions["actual_label"], scoring="roc_auc", cv=kf)
average_accuracy = sum(accuracies) / len(accuracies)

print(accuracies)
print(average_accuracy)

在这里插入图片描述

5.三分类

import pandas as pd
import matplotlib.pyplot as plt
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin", "car name"]
cars = pd.read_table("./data/auto-mpg.data", delim_whitespace=True, names=columns)
print(cars.head(5))
print(cars.tail(5))

#选取特征
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
#print dummy_cylinders
cars = pd.concat([cars, dummy_cylinders], axis=1)
print(cars.head())
dummy_years = pd.get_dummies(cars["year"], prefix="year")
#print dummy_years
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop("year", axis=1)
cars = cars.drop("cylinders", axis=1)
print(cars.head())

#训练集 测试集的建立
import numpy as np
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

from sklearn.linear_model import LogisticRegression
#分类数
unique_origins = cars["origin"].unique()
unique_origins.sort()
print(unique_origins)

在这里插入图片描述

models = {}#保存每个类的模型

#选取one-hot编码特征列
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()

    X_train = train[features]
    y_train = train["origin"] == origin
    print(y_train)

    model.fit(X_train, y_train)
    models[origin] = model
    #print(models)

testing_probs = pd.DataFrame(columns=unique_origins)
#print(testing_probs)

for origin in unique_origins:
    # Select testing features.
    X_test = test[features]
    # Compute probability of observation being in the origin.
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]#predict_proba返回0,1概率
print(testing_probs)

打印每个预测结果正确的概率
scikit-learn模型建立与评估_第5张图片

你可能感兴趣的:(机器学习)