1) 基本指标:误差率
指标解释:错分类样本占总样本的比例
2) 基本指标:准确率
指标解释:正确分类样本占总样本的比例
指标解读:准确率越接近1,模型越准确
3) 混淆矩阵(二分类问题
)
4) 衍生指标:查准率(precision)
指标解释:所有真正例占所有预测为正的样本的比例(TP/(TP+FP)
)
指标举例:在商品推荐的过程中,我们会关心所有推荐给用户的商品(预测为正)中有多少是客户真正喜欢的(真正例)
5) 衍生指标:查全率(recall)
指标解释:所有真正例占所有真实为正的样本的比例(TP/(TP+FN)
)
指标举例:在银行用户风险识别中,我们会关心,所有有风险的用户,有多少能被我们的模型识别出来
6) 其他指标:ROC曲线与AUC值
ROC曲线:以真正例比率为纵轴、假正例率为横轴,采用不同的截断点,来绘制ROC曲线
AUC值:ROC曲线与坐标轴构成的图形面积
指标解读:auc指标越接近1
,则代表模型准确率越高
,auc值等于0.5,代表模型准确率与随机猜测准确率一致,auc值小于0.5:模型效果不如随机猜测
由于这里是分类问题,那么指标就不再是之前的数值指标,而应该是类型指标,这里按照平均价格的中位数来进行房价高低的划分,其他的和上一个回归模型的处理是一致的,都需要先处理掉共线数据
1) 加载数据并对数据进行处理
import pandas as pd
import matplotlib.pyplot as plt
import os
os.chdir(r'C:\Users\86177\Desktop')
# 样例数据读取
df = pd.read_excel('realestate_sample_preprocessed.xlsx')
# 根据共线性矩阵,保留与房价相关性最高的日间人口,将夜间人口和20-39岁夜间人口进行比例处理
def age_percent(row):
if row['nightpop'] == 0:
return 0
else:
return row['night20-39']/row['nightpop']
df['per_a20_39'] = df.apply(age_percent,axis=1)
df = df.drop(columns=['nightpop','night20-39'])
# 制作标签变量
price_median = df['average_price'].median()
print(price_median)
df['is_high'] = df['average_price'].map(lambda x: True if x>= price_median else False)
print(df['is_high'].value_counts())
# 数据集基本情况查看
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
–> 输出的结果为:
30273.0
True 449
False 449
Name: is_high, dtype: int64
(898, 10)
id int64
complete_year int64
average_price float64
area float64
daypop float64
sub_kde float64
bus_kde float64
kind_kde float64
per_a20_39 float64
is_high bool
dtype: object
id 0
complete_year 0
average_price 0
area 0
daypop 0
sub_kde 0
bus_kde 0
kind_kde 0
per_a20_39 0
is_high 0
dtype: int64
2) 划分数据集
x = df[['complete_year','area', 'daypop', 'sub_kde',
'bus_kde', 'kind_kde','per_a20_39']]
y = df['is_high']
print(x.shape)
print(y.shape)
–> 输出的结果为:(和之前的不同就是在于这里的y值)
(898, 7)
(898,)
3) 建立分类模型
使用pipeline整合数据处理、特征筛选与模型
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
# 构建模型工作流
pipe_clf = Pipeline([
('sc',StandardScaler()),
('power_trans',PowerTransformer()),
('polynom_trans',PolynomialFeatures(degree=3)),
('lgostic_clf', LogisticRegression(penalty='l1', fit_intercept=True, solver='liblinear'))
])
print(pipe_clf)
–> 输出的结果为:(回归选择的逻辑回归,惩罚系数选择L1
,相当于是逻辑回归版本的lasso
模型)
Pipeline(memory=None,
steps=[('sc',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('power_trans',
PowerTransformer(copy=True, method='yeo-johnson',
standardize=True)),
('polynom_trans',
PolynomialFeatures(degree=3, include_bias=True,
interaction_only=False, order='C')),
('lgostic_clf',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l1', random_state=None,
solver='liblinear', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False)
4) 查看模型表现
import warnings
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
warnings.filterwarnings('ignore')
pipe_clf.fit(x,y)
y_predict = pipe_clf.predict(x)
print(f'Accuracy score is: {accuracy_score(y,y_predict)}')
print(f'Precision score is: {precision_score(y,y_predict)}')
print(f'Recall score is: {recall_score(y,y_predict)}')
print(f'AUC: {roc_auc_score(y,y_predict)}')
–> 输出的结果为:(全部的指标都可以调用sklearn的模块)
Accuracy score is: 0.8741648106904232
Precision score is: 0.8783783783783784
Recall score is: 0.8685968819599109
AUC: 0.8741648106904232