流程 | 具体操作 |
---|---|
基本查看 | 查看缺失值(可以用直接查看方式isnull、图像查看方式查看缺失值missingno)、查看数值类型特征与非数值类型特征、一次性绘制所有特征的分布图像 |
预处理 | 缺失值处理(填充)拆分数据(获取有需要的值) 、统一数据格式、特征工程(特征编码、0/1字符转换) 、特征衍生、降维(特征相关性、PCA降维) |
数据分析 | groupby分组求最值数据、seaborn可视化 |
预测 | 拆分数据集、建立模型(RandomForestRegressor、LogisticRegression、GradientBoostingRegressor)、训练模型、预测、评估模型(ROC曲线、MSE、MAE、RMSE、R2) |
数量查看:条形图
占比查看:饼图
数据分区分布查看:概率密度函数图
查看相关关系:条形图、热力图
分布分析:分类直方图(countplot)、分布图-带有趋势线的直方图(distplot)
项目目标
使用美国人口普查收入数据集,根据人口普查数据预测个人收入是否超过每年50,000美元
数据来源
数据集地址: https://archive.ics.uci.edu/ml/datasets/adult
>50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-empinc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HSgrad, Prof-school, Assoc-acdm, Assoc-voc, 9th,
7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Otherservice, Sales, Exec-managerial, Profspecialty, Handlers-cleaners, Machine-opinspct, Adm-clerical, Farming-fishing,
Transport-moving, Priv-house-serv, Protectiveserv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-infamily, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-IndianEskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia,
England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam,
Mexico, Portugal, Ireland, France, DominicanRepublic, Laos, Ecuador, Taiwan, Haiti,
Columbia, Hungary, Guatemala, Nicaragua,
Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, HolandNetherlands.
# 1.创建字段名
headers = ['age', 'workclass', 'fnlwgt',
'education', 'education-num',
'marital-status', 'occupation',
'relationship', 'race', 'sex',
'capital-gain', 'capital-loss',
'hours-per-week', 'native-country',
'predclass']
# 2.加载训练集和测试集
training_raw = pd.read_csv('dataset/adult.data',
names=headers,
sep=',\s', # 分隔符
na_values=['?'], # 缺失值是什么
engine='python'
)
test_raw = pd.read_csv('dataset/adult.test',
names=headers,
sep=',\s', # 分隔符
na_values=['?'], # 缺失值是什么
engine='python',
skiprows=1 # 跳过1行
)
# 3.合并数据集并设置新的索引
dataset_raw = training_raw.append(test_raw) # 追加方式合并dataframe
dataset_raw.reset_index(inplace=True) # 重置索引
dataset_raw.drop('index', inplace=True, axis=1) # 删除原先的索引
import missingno
# 1.以矩阵方式查看缺失值
missingno.matrix(dataset_raw, figsize=(30,5))
# 2.以条形图方式查看缺失值
missingno.bar(dataset_raw, sort="ascending", figsize=(30,5))
# 3.删除缺失值(第6步后)
dataset_bin = dataset_bin.dropna(axis=0)
dataset_con = dataset_con.dropna(axis=0)
import math
# 使用一张画布绘制所有特征的图像
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')# 绘制风格
fig = plt.figure(figsize=(width, height)) # 画布大小
# 子图调整
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
# enumerate枚举,遍历数据特征
for i, column in enumerate(dataset.columns):
ax = fig.add_subplot(rows, cols, i+1) # 添加子图
ax.set_title(column) # 设置标题
if dataset.dtypes[column] == np.object: # 判断列的数据类型
g = sns.countplot(y=column, data=dataset) # 非数字类型用统计
substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
plt.xticks(rotation=25)
else:
g = sns.distplot(dataset[column]) # 数字类型用直方图
plt.xticks(rotation=25)
plot_distribution(dataset_raw, cols=3, width=20, height=20, hspace=0.45, wspace=0.5)
# 1.创建新的DataFrame
dataset_bin = pd.DataFrame() # 包含所有离散后的值
dataset_con = pd.DataFrame() # 包含所有未离散的值
# 2.predclass标签属性, 预测目标:转换为0/1,年收入超过50k记为1.
# 转换
dataset_raw.loc[dataset_raw['predclass']=='>50K', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='>50K.', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='<=50K', 'predclass'] = 0
dataset_raw.loc[dataset_raw['predclass']=='<=50K.', 'predclass'] = 0
# 存储到两个DataFrame中
dataset_bin['predclass'] = dataset_raw['predclass']
dataset_con['predclass'] = dataset_raw['predclass']
# 可视化:predclass属性
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,1))
sns.countplot(y='predclass', data=dataset_raw)
# 3.age标签属性: 预测目标:分为是否分箱进行查看
# 存储数据
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10) # 分箱存储离散化数据
dataset_con['age'] = dataset_raw['age'] # 未离散化
# 绘制离散化数据
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age', data=dataset_bin)
# 绘制未离散化数据(带有趋势线的直方图):超过50k收入的人的年龄
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age']) # 高收入年龄趋势图
sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age']) # 低收入年龄趋势图
# 4.特征workclass
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_raw)
# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 减少类别数目
dataset_raw.loc[dataset_raw['workclass'] == 'Without-pay','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Never-worked','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Federal-gov','workclass'] = 'Fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'State-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Local-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-not-inc','workclass'] = 'Self-emp'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-inc','workclass'] = 'Self-emp'
# 存储记录
dataset_bin['workclass'] = dataset_raw['workclass']
dataset_con['workclass'] = dataset_raw['workclass']
# 合并工作类别后绘制图
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_bin)
# 5.特征occupation
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5))
sns.countplot(y="occupation", data=dataset_raw)
# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['occupation'] == 'Adm-clerical', 'occupation'] = 'Admin'
dataset_raw.loc[dataset_raw['occupation'] == 'Armed-Forces', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Craft-repair', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Exec-managerial', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Farming-fishing', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Handlers-cleaners', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Machine-op-inspct', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Other-service', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Priv-house-serv', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Prof-specialty', 'occupation'] = 'Professional'
dataset_raw.loc[dataset_raw['occupation'] == 'Protective-serv', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Sales', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Tech-support', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Transport-moving', 'occupation'] = 'Manual Labour'
dataset_bin['occupation'] = dataset_raw['occupation']
dataset_con['occupation'] = dataset_raw['occupation']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="occupation", data=dataset_bin)
# 6.特征native country
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,10))
sns.countplot(y="native-country", data=dataset_raw)
# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['native-country'] == 'Cambodia' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Canada' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'China' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Columbia' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Cuba' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Dominican-Republic' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Ecuador' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'El-Salvador' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'England' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'France' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Germany' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Greece' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Guatemala' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Haiti' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Holand-Netherlands' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Honduras' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Hong' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Hungary' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'India' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Iran' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Ireland' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Italy' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Jamaica' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Japan' , 'native-country'] = 'APAC'
dataset_raw.loc[dataset_raw['native-country'] == 'Laos' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Mexico' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Nicaragua' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Outlying-US(Guam-USVI-etc)' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Peru' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Philippines' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Poland' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Portugal' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Puerto-Rico' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Scotland' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'South' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Taiwan' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Thailand' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Trinadad&Tobago' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'United-States' , 'native-country'] = 'United-States'
dataset_raw.loc[dataset_raw['native-country'] == 'Vietnam' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Yugoslavia' , 'native-country'] = 'Euro_Group_2'
dataset_bin['native-country'] = dataset_raw['native-country']
dataset_con['native-country'] = dataset_raw['native-country']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="native-country", data=dataset_bin)
# 7.education特征
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5))
sns.countplot(y="education", data=dataset_raw)
dataset_raw.loc[dataset_raw['education'] == '10th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '11th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '12th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '1st-4th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '5th-6th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '7th-8th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '9th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-acdm' , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-voc' , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Bachelors' , 'education'] = 'Bachelors'
dataset_raw.loc[dataset_raw['education'] == 'Doctorate' , 'education'] = 'Doctorate'
dataset_raw.loc[dataset_raw['education'] == 'HS-Grad' , 'education'] = 'HS-Graduate'
dataset_raw.loc[dataset_raw['education'] == 'Masters' , 'education'] = 'Masters'
dataset_raw.loc[dataset_raw['education'] == 'Preschool' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Prof-school' , 'education'] = 'Professor'
dataset_raw.loc[dataset_raw['education'] == 'Some-college' , 'education'] = 'HS-Graduate'
dataset_bin['education'] = dataset_raw['education']
dataset_con['education'] = dataset_raw['education']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="education", data=dataset_bin)
# 8.特征Marital Status
plt.figure(figsize=(20,3))
sns.countplot(y="marital-status", data=dataset_raw)
dataset_raw.loc[dataset_raw['marital-status'] == 'Never-married' , 'marital-status'] = 'Never-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-AF-spouse' , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-civ-spouse' , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-spouse-absent', 'marital-status'] = 'Not-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Separated' , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Divorced' , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Widowed' , 'marital-status'] = 'Widowed'
dataset_bin['marital-status'] = dataset_raw['marital-status']
dataset_con['marital-status'] = dataset_raw['marital-status']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="marital-status", data=dataset_bin)
# 9.特征Final Weight 体重分箱
dataset_bin['fnlwgt'] = pd.cut(dataset_raw['fnlwgt'], 10)
dataset_con['fnlwgt'] = dataset_raw['fnlwgt'] # 未离散化
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="fnlwgt", data=dataset_bin)
# 10.特征Education Number
dataset_bin['education-num'] = pd.cut(dataset_raw['education-num'], 10) # 分箱离散化
dataset_con['education-num'] = dataset_raw['education-num'] # 未离散化
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y="education-num", data=dataset_bin)
# 11.特征Hours per Week
# 周工作时间(小时)分箱
dataset_bin['hours-per-week'] = pd.cut(dataset_raw['hours-per-week'], 10)
dataset_con['hours-per-week'] = dataset_raw['hours-per-week']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
plt.subplot(1, 2, 1)
sns.countplot(y="hours-per-week", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['hours-per-week'])
# 12.Capital Gain
dataset_bin['capital-gain'] = pd.cut(dataset_raw['capital-gain'], 5)
dataset_con['capital-gain'] = dataset_raw['capital-gain']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
plt.subplot(1, 2, 1)
sns.countplot(y="capital-gain", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-gain'])
# 13.特征Capital Loss
dataset_bin['capital-loss'] = pd.cut(dataset_raw['capital-loss'], 5)
dataset_con['capital-loss'] = dataset_raw['capital-loss']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
plt.subplot(1, 2, 1)
sns.countplot(y="capital-loss", data=dataset_bin)
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-loss'])
# 14.特征Race, Sex, Relationship
# 无需处理
dataset_con['sex'] = dataset_bin['sex'] = dataset_raw['sex']
dataset_con['race'] = dataset_bin['race'] = dataset_raw['race']
dataset_con['relationship'] = dataset_bin['relationship'] = dataset_raw['relationship']
特征衍生的意思是根据已有的特征创建新的特征
# 1.连续型特征衍生(age与hours per-week共同衍生的特征)
dataset_con['age-hours'] = dataset_con['age'] * dataset_con['hours-per-week']
dataset_bin['age-hours'] = pd.cut(dataset_con['age-hours'],10)
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age-hours', data=dataset_bin) # 绘制横向数量统计图
plt.subplot(1,2,2)
# 连续型衍生特征趋势图
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age-hours'])
sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age-hours'])
# 2.离散型特征衍生(sex与marital-status共同衍生的特征)
dataset_bin['sex-marital'] = dataset_con['sex-marital'] = dataset_bin['sex'] + dataset_bin['marital-status']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y='sex-marital', data=dataset_bin)
机器学习算法接收的是数值型变量 把字符型编码为数值型的过程叫做编码 我们用到的:
Label encoding(标签编码)
例如:令 红=1,黄=2,蓝=3. 那么这样其实实现了标签编码,即给不同类别以标签。然而这意味着机器可能会学习到“红<黄<蓝”
One-Hot encoding(独热编码)
每个样本只对应于一个类别(即只在对应的特征处值为1,其余地方值为0)
例如:有三种颜色状态,所以就有3个比特。即红色:1 0 0 ,黄色: 0 1 0,蓝色:0 0 1 。如此一来每两个向量之间的距离都是根号2,在向量空间距离都相等,所以这样不会出现偏序性,基本不会影响基于向量空间度量算法的效果
# 1.对所有 离散型 特征进行one-hot编码
one_hot_cols = dataset_bin.columns.tolist() # 获取所有列索引,并转化为字符串
one_hot_cols.remove('predclass') # 不对标签列进行编码
# one-hot编码
dataset_bin_env = pd.get_dummies(dataset_bin, columns=one_hot_cols)
dataset_bin_env.head()
# 2.所有连续型特征进行Label_encoding编码
encoder = LabelEncoder()
dataset_con = dataset_con.astype(str) # 获取所有列索引转换为字符串类型
dataset_con_env = dataset_con.apply(encoder.fit_transform)
dataset_con_env.head()
特征降维的作用:
# 1.查看特征相关性
# 绘制两个数据集的热力图
plt.style.use('seaborn-whitegrid') # 设置绘图风格
fig = plt.figure(figsize=(20,10))
# 绘制第一个热力图
plt.subplot(1,2,1) # 设置子图,1行2列的第一个子图
# 根据dataset_bin_enc(离散型)的特征相关性,创建布尔型数组
mask = np.zeros_like(dataset_bin_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True(如何将绘制相关系数热力图只保留左下角部分)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_bin_env.corr(),
vmin=-1, vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r",100),
mask=mask,
linewidths=.5)
# 绘制第二个热力图
plt.subplot(1,2,2) # 设置子图,1行2列的第二个子图
# 根据dataset_con_enc(离散型)的特征相关性,创建布尔型数组
mask = np.zeros_like(dataset_con_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True(如何将绘制相关系数热力图只保留左下角部分)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_con_env.corr(),
vmin=-1, vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r",100),
mask=mask,
linewidths=.5)
# 2.PCA降维
# 假设降低到10维(保留10个特征)
X = dataset_bin_env.drop('predclass',axis=1) # 提取特征(不包含标签)
pca = PCA(n_components=10)
X_reduction = pca.fit_transform(X)
# 1.选择数据集
# 第一个可选数据集,dataset_bin_enc(离散编码)
# 第二个可选数据集,dataset_con_enc(连续编码)
selected_dataset = dataset_bin_enc
selected_dataset.head()
# 2.拆分数据集
# 由于原数据集已经帮我们分配好了训练与测试样本,这里直接复原成原来的即可
train = selected_dataset.loc[:32560, :]
test = selected_dataset.loc[32561:,:]
# 算法开始之前重命名特征和标签
X_train = train.drop('predclass',axis=1)
y_train = train['predclass'].astype('int64')
X_test = test.drop('predclass', axis=1)
y_test = test['predclass'].astype('int64')
# 3.建立模型(选择逻辑回归LogisticRegression)
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
decision_scores = log_reg.decision_function(X_test) # 返回一个分数评估
print("decision_scores:",decision_scores)
# 4.评估模型(绘制ROC曲线)
fpr, tpr, thresholds = roc_curve(y_test, decision_scores)
plt.title('Receiver Operation Characteristic')
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--') # 绘制对角线方法,不是根据坐标绘制的,背下来即可
plt.xlabel('False Postive Rate')
plt.ylabel('True Positive Rate')
plt.show()