https://tianchi.aliyun.com/competition/entrance/231784/information
零基础入门数据挖掘-二手车交易价格预测
本次赛题实质为一次回归分析,可使用数据挖掘常用库或者框架来做
catboost:可赋予分类变量指标,进而通过独热最大量得到独热编码形式的结果
lbg:和 CatBoost 类似,LighGBM 也可以通过使用特征名称的输入来处理属性数据;它没有对数据进行独热编码,因此速度比独热编码快得多。LGBM 使用了一个特殊的算法来确定属性特征的分割值。
xgb:和 CatBoost 以及 LGBM 算法不同,XGBoost 本身无法处理分类变量,而是像随机森林一样,只接受数值数据。因此在将分类数据传入 XGBoost 之前,必须通过各种编码方式:例如标记编码、均值编码或独热编码对数据进行处理。
数据处理:pandas、numpy
数据可视化:matplotlib、seaborn
机器/深度学习:sklearn、keras
做探索性分析
本次预测实质为一次回归分析,回归评价指标一般有MSE、RMSE、MAE、MAPE。本次使用MAE,即平均绝对误差,该值越小越好。
M A E = ∑ i = 1 n ∣ y i − y ^ i ∣ n MAE =\frac{\sum^n_{i=1}|y_i - \hat{y}_i|}{n} MAE=n∑i=1n∣yi−y^i∣
其中, y i y_i~ yi 代表第 i ~i~ i 个样本的真实值, y ^ i \hat{y}_i~ y^i 代表第 i ~i~ i 个样本的预测值。
详细参考:https://blog.csdn.net/qq_42257962/article/details/108265730
分类算法的常见评估指标
二类分类器:accuracy、[Precision,Recall, F-score,Pr曲线],ROC-AUC曲线
多类分类器:accuracy、[宏平均和微平均,F-score]
详细参考:https://blog.csdn.net/qq_29168809/article/details/102993081
1、数据读取
import pandas as pd
import numpy as np
data_train = pd.read_csv('used_car_train_20200313.csv', sep=' ')
data_test_a = pd.read_csv('used_car_testA_20200313.csv', sep=' ')
data_test_b = pd.read_csv('used_car_testB_20200421.csv', sep=' ')
2、数据信息查看
data_train.head(10) # 查看前10行
data_train.shape # 数据的行列数
data_train.info() # 数据的空值情况、字段类型等
data_train.describe() # 数值数据列的平均值、方差、最大最小值等信息
通过这些命令可以了解数据大概的情况,尤其是缺失值情况。
3、分类指标评价计算练习
accuracy
sklearn.metrics.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]
文档:https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
# 计算精确度得分 accuracy
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
0.5
Precision,Recall,F1-score
sklearn.metrics.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶
sklearn.metrics.recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]
sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]
文档:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
## Precision,Recall,F1-score
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
precision = metrics.precision_score(y_true, y_pred)
recall = metrics.recall_score(y_true, y_pred)
f1_score = metrics.f1_score(y_true, y_pred)
print("Precision: %s \r\n"
"Recall: %s \r\n"
"F1-score: %s \r\n"
%(precision, recall, f1_score))
Precision: 1.0
Recall: 0.5
F1-score: 0.6666666666666666
AUC
sklearn.metrics.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]
文档:https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
## AUC
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)
0.75
4、回归指标评价计算示例
sklearn.metrics.mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared=True)
sklearn.metrics.mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source
文档:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])
#MAPE 需要自己写
def mape(y_true, y_pred):
return np.mean(np.abs((y_pred - y_true)/y_true))
print("MSE:" , metrics.mean_squared_error(y_true, y_pred))
print("RMSE:" , np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
print("MAE:" , metrics.mean_absolute_error(y_true, y_pred))
print("MAPE:" , mape(y_true, y_pred))
MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.1461904761904762
R2-score
sklearn.metrics.r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]
文档:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
## R2-score
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r2_score(y_true, y_pred)
0.9486081370449679
第一次参加 Datawhale 数据分析学习,虽然有一定的 python 基础,但对很多统计学的概念不是很清楚,希望借由这次学习,能好好的入门,了解一些必要基本能力,比如特征工程的构建、模型的选择。