1:检测异常值
2:比较同类型数据的分布状况定性对比
Python实例(sns库boxplot)
1:检测异常值:(导入工业蒸汽训练数据第一列)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_train.txt",sep='\t',encoding='utf-8')
fig = plt.figure(figsize=(4,6)) # 制定图像大小
sns.boxplot(train_data['V0'],orient='v',width=0.5) # 用于查看V0这一列数据是否有异常值
plt.show()
2:同类数据分部对比(用的少)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_train.txt",sep='\t',encoding='utf-8')
test_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_test.txt",sep='\t',encoding='utf-8')
fig = plt.figure(figsize=(4,4)) # 制定图像大小
plt.subplot(1, 2, 1)
sns.boxplot(train_data['V0'],orient='v',width=0.5)
plt.subplot(1, 2, 2)
sns.boxplot(test_data['V0'],orient='v',width=0.5)
# 分别查看train和test的V0列数据分布状况对比,从图来看,看不出明显分布差异
plt.show()
直方图(y轴为频率/组距)就是绘制数据的分布(加窗平滑后曲线为概率密度曲线),以块状图形式给出。而QQ图原理如下:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
train_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_train.txt",sep='\t',encoding='utf-8')
test_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_test.txt",sep='\t',encoding='utf-8')
plt.figure(figsize=(10,5))
plt.subplot(1, 2, 1)
# 绘制VO列数据的分布,并且绘制其曲线(蓝线),同时与标准正态分布(黑线)对比
sns.distplot(train_data['V0'],fit=stats.norm)
plt.subplot(1, 2, 2)
# 数据线与直线重合,说明符合标准正态分布,反之,不符合
stats.probplot(train_data['V0'],plot=plt)
plt.show()
绘制单变量的数据分布曲线图(概率密度图),图面积为1。可以看出数据的分布状况,可以理解为对直方图的加窗平滑处理。
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
train_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_train.txt",sep='\t',encoding='utf-8')
test_data=pd.read_csv(r"C:\Users\Administrator\Desktop\数据挖掘项目\蒸汽预测数据\zhengqi_test.txt",sep='\t',encoding='utf-8')
plt.figure(figsize=(10,5))
ax = sns.kdeplot(train_data['V0'], color= "Red")
ax = sns.kdeplot(test_data['V0'], color= "Blue")
ax.set_xlabel('V0')
ax.set_ylabel("概率密度")
ax.legend(["train", "test"])
plt.show()