特征工程——用转换器抽取特征

前言:

《python数据挖掘入门与实践》第五章。学习构造自己的转换器,和特征工程的一些技巧。
代码及每一部分的输出如下。

数据下载:

http://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/

用pandas加载数据集文件adult.data

import os
import pandas as pd
data_folder = "E:\DataMining\Project\dataming_with_python\Adult"
adult_filename = os.path.join(data_folder,"adult.data")
adult = pd.read_csv(adult_filename,header=None,names=["Age","Work-Class","fnlwgt","Education"
,"Education-Num","Marital-Atatus","Occupation",
"Relationship","Race","Sex",
"Capital-gain","Capital-loss","Hours-per-week",
"Native-Country","Earnings-Raw"])
#删除包含无效数字的行
adult.dropna(how="all",inplace=True)
#adult.columns

对连续,序数特征进行常的见统计量计算方法

adult["Hours-per-week"].describe()
count    32561.000000
mean 40.437456
std 12.347429
min 1.000000
25% 40.000000
50% 40.000000
75% 45.000000
max 99.000000
Name: Hours-per-week, dtype: float64

Education 特征求均值无意义转成教育年限Education-Num有意义

adult["Education-Num"].median()
10.0

获取类别特征Work-Class的所有情况

adult["Work-Class"].unique()
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
' Never-worked'], dtype=object)

对数值型进行类别转换

adult["LongHours"] = adult["Hours-per-week"] > 40
#adult["LongHours"]

使用VarianceThresshold转换器删除方差达不到要求的特征

import numpy as np
X = np.arange(30).reshape((10,3))
#把第二列改为1
X[:,1] = 1
X
array([[ 0,  1,  2],
[ 3, 1, 5],
[ 6, 1, 8],
[ 9, 1, 11],
[12, 1, 14],
[15, 1, 17],
[18, 1, 20],
[21, 1, 23],
[24, 1, 26],
[27, 1, 29]])
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold()
Xt = vt.fit_transform(X)
Xt#删除了第二列
array([[ 0,  2],
[ 3, 5],
[ 6, 8],
[ 9, 11],
[12, 14],
[15, 17],
[18, 20],
[21, 23],
[24, 26],
[27, 29]])
print(vt.variances_)#输出每一列方差
[74.25  0.   74.25]

选择最佳特征

不要寻找表现好的子集,而只是找表现好的单个特征,依据是它们各自能达到的精度
单个特征和某一类别的相关性计算方法有很多:卡方检测(x2),互信息,信息熵等

X = adult[["Age","Education-Num","Capital-gain","Capital-loss","Hours-per-week"]].values
#税前收入分类
y = (adult["Earnings-Raw"] == ' >50K').values
#使用SelectKBest转换器,用卡方函数打分,初始化转换器
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer = SelectKBest(score_func=chi2,k=3)#使用卡方函数,选择分类效果好的3个特征
Xt_chi2 = transformer.fit_transform(X,y)#生成分类效果好的三个特征
print(transformer.scores_)#每一个特征的相关性
[8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06
6.47640900e+03]

使用皮尔逊相关系数计算相关性

from scipy.stats import pearsonr
#因为SciPy的pearsonr函数参数为两个数组,但第一个数组为一维的
#实现一个包装器函数,就能像上面那样处理多维数组
def multivariate_pearsonr(X,y):
scores,pvalues = [],[]
for column in range(X.shape[1]):
cur_score,cur_p = pearsonr(X[:,column],y)
scores.append(abs(cur_score))
pvalues.append(cur_p)
return (np.array(scores),np.array(pvalues))
#现在就可以像之前一样使用转换器
transformer = SelectKBest(score_func=multivariate_pearsonr,k=3)
Xt_pearsonr = transformer.fit_transform(X,y)
print(transformer.scores_)
[0.2340371  0.33515395 0.22332882 0.15052631 0.22968907]
#看哪个集合效果更好
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf,Xt_chi2,y,scoring="accuracy")
scores_pearsonr = cross_val_score(clf,Xt_pearsonr,y,scoring="accuracy")
#输出各自预测准确率
print("卡方函数:{0:.1f}%".format(np.mean(scores_chi2)*100))
print("pearsonr系数:{0:.1f}%".format(np.mean(scores_pearsonr)*100))
E:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
卡方函数:82.9%
pearsonr系数:77.1%

创建特征

特征之间的相关性很强,或者特征冗余,会增加算法处理难度,从已有特征创建新特征很有必要,有很多方法

import os
import pandas as pd
import numpy as np
from collections import defaultdict

#导入广告数据
ad_data_folder = "E:\DataMining\Project\dataming_with_python\Adult"
ad_adult_filename = os.path.join(ad_data_folder,"ad.data")
#编写将字符串(数值形)转换成数字的函数

def convert_number(x):
try:
return float(x)
except ValueError:
#此处应该返回其他充当缺失值的,这里简单填6
return 6


#创建一个字典储存所有特征和转换结果
myconverters = defaultdict(convert_number)
for i in ads.columns[:-1]:
myconverters[i] = convert_number
#最后一列类别转换成0,1
myconverters[1558] = lambda x: 1 if x.strip() =="ad." else 0#lambda:函数的简洁表示
#csv
ads = pd.read_csv(ad_adult_filename,header = None,converters=myconverters)#指定转换函数
ads.iloc[192:195,0]
192     6.0
193 6.0
194 60.0
Name: 0, dtype: float64
#抽取x矩阵和y数组
X = ads.drop(1558,axis=1).values
y = ads[1558]

主成分分析

PCA的目的是找到能用较少信息描述数据集的特征组合
这些特征的方差跟整体方差没有多大差距,这些特征称之为主成分。
得到的主成分往往是其他几个特征的复杂组合,例如下面的第一个特征就是原始数据的1558个特征分别乘不同权重得到的。

from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components=5)#传入主成分数量参数
#Xd = pca.fit_transform(X)
#c查看每个特征的方差
Xd = pca.fit_transform(X)
np.set_printoptions(precision=3, suppress=True)
pca.explained_variance_ratio_
array([0.877, 0.121, 0.001, 0.   , 0.   ])
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_reduced =cross_val_score(clf,Xd,y,scoring="accuracy")
print("主成分算法后预测准确率:{0:.1f}%".format(np.mean(scores_reduced)))
主成分算法后预测准确率:0.9%

PCA算法号还可以把抽象难懂的数据集绘制成图形

#将pca返回的前两个特征做成图形
%matplotlib inline
from matplotlib import pyplot as plt
#获取类别取值
classes = set(y)
colors = ["red","green"]
#zip()将两个列表组合起来
for cur_class,color in zip(classes,colors):
#为当前类别所有个体设置遮罩层
mask = (y ==cur_class).values
#scatter()显示位置,这里x,y是的值是前两个特征
plt.scatter(Xd[mask,0],Xd[mask,1],marker='o',color=color,label=int(cur_class))
plt.legend()
plt.show()

创建自己的转换器

1.转换器API
fit():接收训练数据,从而设置内部参数
transform():转换过程,接收训练数据集或相同格式的新数据集
#用来设置Api的类: TransformerMixin
from sklearn.base import TransformerMixin
from sklearn.utils import as_float_array
#创建一个接收数组,根据均值离散化的转换器
class MeanDiscrete(TransformerMixin):
#定义接口规范与API一致的fit(),transform()
def fit(self,X,y):
X = as_float_array(X)
#保存每个特征的均值
self.mean = X.mean(axis=0)
#返回本身,确保在转换器中能够进行链式调用,类似transformer.fit(X).transform(X)
return self
def transform(self,X):
X = as_float_array(X)
#检查输入的数据列数是否一致
assert X.shape[1] == self.mean.shape[0]
return X > self.mean

#初始化实例,测试
mean_discrete = MeanDiscrete()
X_mean = mean_discrete.fit_transform(X)
print(X_mean)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
input-39-f0fadcb37c80> in <module>()
19 #初始化实例,测试
20 mean_discrete = MeanDiscrete()
---> 21 X_mean = mean_discrete.fit_transform(X)
22 print(X_mean)
E:\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
515 if y is None:
516 # fit method of arity 1 (unsupervised transformation)
--> 517 return self.fit(X, **fit_params).transform(X)
518 else:
519 # fit method of arity 2 (supervised transformation)
TypeError: fit() missing 1 required positional argument: 'y'
#创建完函数或类后,最好进行单元测试
from numpy.testing import assert_array_equal
#创建测试函数,以test_开头,方便查找
def test_meandiscrete():
X_test = np.array([[0,2],
[3,5],
[6,8],
[9,11],
[12,14],
[15,17],
[18,20],
[21,23],
[24,26],
[27,29]])
#创建转换函数实例,用测试数据进行训练
mean_discrete = MeanDiscrete()
mean_discrete.fit(X_test)
#其他途径知均值13.5,15.5,检查内部数据均值参数是否正确
assert_array_equal(mean_discrete.mean,np.array([13.5,15.5]))
X_transformed = mean_discrete.transform(X_test)
X_expected = np.array([[0,0],
[0,0],
[0,0],
[0,0],
[0,0],
[1,1],
[1,1],
[1,1],
[1,1],
[1,1]])
#是否一致
assert_array_equal(X_transformed,X_expected)

test_meandiscrete()
#重要函数解释:
#1.assert
#assert断言是声明其布尔值必须为真的判定,如果发生异常就说明表达示为假。可以理解assert断言语句为raise-if-not,用来测试表示式,其返回值为假,就会触发异常。

#下面做一些assert用法的语句供参考:
#assert 1==1
#assert 2+2==2*2
#assert len(['my boy',12])<10
#assert range(4)==[0,1,2,3]

#assert的异常参数,其实就是在断言表达式后添加字符串信息,用来解释断言并更好的知道是哪里出了问题。格式如下:
#assert expression [, arguments]

#assert len(lists) >=5,'列表元素个数小于5'
#assert 2==1,'2不等于1'

组装起来,用于流水线。Pipeline类

from sklearn.pipeline import Pipeline
#第一步使用MeanDiscrete转换器,第二部使用决策树分类器,同时进行交叉验证
pipeline = Pipeline([('mean_discrete',MeanDiscrete()),('classifier',DecisionTreeClassifier(random_state=14))])
scores_mean_discrete = cross_val_score(pipeline,X,y,scoring='accuracy')
print("Mean Discrete performance:{0:.3f}".format(scores_mean_discrete.mean()))
Mean Discrete performance:0.937

 最后:

———关注我的公众号,一起学数据挖掘————
这里写图片描述

你可能感兴趣的:(数据挖掘)