二手车交易价格预测——数据的探索性分析EDA

环境:Pycharm、python3.7

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

先导入数据:

train = pd.read_csv('used_car_train_20200313.csv',sep=' ')
test = pd.read_csv('used_car_testA_20200313.csv',sep=' ')

可以通过describe()方法来查看数据的统计信息:

print(train.describe())
print(test.describe())

可以通过isnull()方法来查看是否有缺失值:

print(train.isnull().any())
print(test.isnull().any())

发现四个特征有缺失值,分别是:
model True
bodyType True
fuelType True
gearbox True
train中的model的缺失数量只有一个,而test中的model没有缺失值,因此我分析后认为可以删除train中那一个缺失值所在的那一行。
至于bodyType、fuelType、gearbox三个特征的缺失值,由于数量过多,删除会导致训练样本不足,因此进行填充处理。
根据每个特征的值不同,分别进行填充:

train.fillna(['bodyType'], 8, inplace=True)
test.fillna(['bodyType'], 8, inplace=True)

train.fillna(['fuelType'], 7, inplace=True)
test.fillna(['fuelType'], 7, inplace=True)

train.fillna(['gearbox'], 3, inplace=True)
test.fillna(['gearbox'], 3, inplace=True)

分析一下power的范围值:

print(train['power'].describe())

发现power有超越取值范围的值,因此需要做个界限约束。

train['power'] = train['power'].map(lambda x: x if(x<600) else 600)
test['power'] = test['power'].map(lambda x: x if(x<600) else 600)

将train的数据分布和test的数据分布做一个比较:
(以下画图分析的代码为别人在Jupyter notebook使用的代码,有需求在Pycharm运行的可以修改后使用)

plt.figure(figsize=(20,15))
plt.subplot(4,5,1)
sns.distplot(price)
plt.subplot(4,5,2)
sns.distplot(price_log,axlabel='price_log')
plt.subplot(4,5,3)
price_log.plot.box()
plt.subplot(4,5,4)
sns.distplot(train['name'],axlabel='train_name')
plt.subplot(4,5,5)
sns.distplot(test['name'],axlabel='test_name')
plt.subplot(4,5,6)
sns.distplot(np.log(test['name']+0.00001),axlabel='test_name_log')
plt.subplot(4,5,7)
sns.distplot(train['model'],axlabel='train_model')
plt.subplot(4,5,8)
sns.distplot(test['model'],axlabel='test_model')
plt.subplot(4,5,9)
sns.distplot(np.log(test['model']+0.0001),axlabel='test_model_log')
plt.subplot(4,5,10)
sns.distplot(train['brand'],axlabel='train_brand')
plt.subplot(4,5,11)
sns.distplot(test['brand'],axlabel='test_brand')
plt.subplot(4,5,12)
sns.distplot(train['bodyType'],axlabel='train_bodyType')
plt.subplot(4,5,13)
sns.distplot(test['bodyType'],axlabel='test_bodyType')
plt.subplot(4,5,14)
sns.distplot(train['fuelType'],axlabel='train_fuelType')
plt.subplot(4,5,15)
sns.distplot(test['fuelType'],axlabel='test_fuleType')
plt.subplot(4,5,16)
sns.distplot(train['gearbox'],axlabel='train_gearbox')
plt.subplot(4,5,17)
sns.distplot(test['gearbox'],axlabel='test_gearbox')
plt.subplot(4,5,18)
sns.distplot(train['power'],axlabel='train_power')
plt.subplot(4,5,19)
sns.distplot(test['power'],axlabel='test_power')
plt.subplot(4,5,7)
sns.distplot(train['v_1'],axlabel='train_v1')
plt.subplot(4,5,8)
sns.distplot(test['v_1'],axlabel='test_v1')
plt.subplot(4,5,9)
sns.distplot(train['v_2'],axlabel='train_v2')
plt.subplot(4,5,10)
sns.distplot(test['v_2'],axlabel='test_v2')
plt.subplot(4,5,11)
sns.distplot(train['v_3'],axlabel='train_v3')
plt.subplot(4,5,12)
sns.distplot(test['v_3'],axlabel='test_v3')
plt.subplot(4,5,13)
sns.distplot(train['v_4'],axlabel='train_v4')
plt.subplot(4,5,14)
sns.distplot(test['v_4'],axlabel='test_v4')
plt.subplot(4,5,15)
sns.distplot(train['v_5'],axlabel='train_v5')
plt.subplot(4,5,16)
sns.distplot(test['v_5'],axlabel='test_v5')
plt.subplot(4,5,17)
sns.distplot(train['v_6'],axlabel='train_v6')
plt.subplot(4,5,18)
sns.distplot(test['v_6'],axlabel='test_v6')
plt.subplot(4,5,19)
sns.distplot(train['v_7'],axlabel='train_v7')
plt.subplot(4,5,20)
sns.distplot(test['v_7'],axlabel='test_v7')
plt.subplot(4,5,1)
sns.distplot(train['v_8'],axlabel='train_v8')
plt.subplot(4,5,2)
sns.distplot(test['v_8'],axlabel='test_v8')
plt.subplot(4,5,3)
sns.distplot(train['v_9'],axlabel='train_v9')
plt.subplot(4,5,4)
sns.distplot(test['v_9'],axlabel='test_v9')
plt.subplot(4,5,5)
sns.distplot(train['v_10'],axlabel='train_v10')
plt.subplot(4,5,6)
sns.distplot(test['v_10'],axlabel='test_v10')
plt.subplot(4,5,7)
sns.distplot(train['v_11'],axlabel='train_v11')
plt.subplot(4,5,8)
sns.distplot(test['v_11'],axlabel='test_v11')
plt.subplot(4,5,9)
sns.distplot(train['v_12'],axlabel='train_v12')
plt.subplot(4,5,10)
sns.distplot(test['v_12'],axlabel='test_v12')
plt.subplot(4,5,11)
sns.distplot(train['v_13'],axlabel='train_v13')
plt.subplot(4,5,12)
sns.distplot(test['v_13'],axlabel='test_v13')
plt.subplot(4,5,13)
sns.distplot(train['v_14'],axlabel='train_v14')
plt.subplot(4,5,14)
sns.distplot(test['v_14'],axlabel='test_v14')

可以发现各个特征的分布大都一致。
notRepairedDamage的数据中有object类型的 ‘-’ 的值,将其修改为float类型的值:

train['notRepairedDamage'] = train['notRepairedDamage'].map(lambda x: -1 if(x=='-') else x)
train['notRepairedDamage'] = train['notRepairedDamage'].map(lambda x: float(x))
test['notRepairedDamage'] = test['notRepairedDamage'].map(lambda x: -1 if(x=='-') else x)
test['notRepairedDamage'] = test['notRepairedDamage'].map(lambda x: float(x))

数据探索性分析就到这里,接下来将进行独热编码等一系列数据预处理,使数据间的“距离”得到充分延展。
如有错误,请多指正。

你可能感兴趣的:(python,python,数据分析,机器学习)