1、当变量为类别变量,且变量的类别较少时,可以考虑考转换成虚拟变量来处理
In [34]:
embark_dummies = pd.get_dummies(train_data.Embarked) # drop the original column train_data.drop('Embarked',axis=1,inplace=True) train_data = train_data.join(embark_dummies)
In [35]:
sex_dummies = pd.get_dummies(train_data.Sex) # drop the original column train_data.drop('Sex',axis=1,inplace=True) train_data = train_data.join(sex_dummies)
In [36]:
train_data.head()
Out[36]:
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | C | Q | S | female | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 1 | 0 | 0 | 1 | 1 | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | 1 | 0 | 0 | 1 | 1 | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | 1 | 0 | 0 | 1 | 0 | 1 |
2、由于不同特征数据的数据的维度是不一致的,一般需要把这些数值映射到一个维度上,不然对于数值较大的特征会获得更高的权重,数据的维度大小不一致可能会影响到模型的结果
In [37]:
from sklearn.preprocessing import StandardScaler train_data['Age'] = StandardScaler().fit_transform(train_data.Age.values.reshape(-1,1))
3、对于连续特征,比如本数据中的 Fare 特征,我们可以通分箱操作将连续特征离散化。这样也可以减少由于某一类异常值的出现导致的模型的不稳定(在已经判断该异常值不是具有丰富含义的情况下),然后可以考虑是使用虚拟变量还是使用将每个类别映射到一个特定的数据值上
In [38]:
# divide the fares into 5 parts train_data['Fare'] = pd.qcut(train_data.Fare, 5) # factorize train_data['Fare'] = pd.factorize(train_data.Fare)[0]
对于某些重要的变量特征之间进行皮尔逊系数的分析
In [39]:
corr = pd.DataFrame(train_data[['C','Q','S','female','male','Age','Pclass','Survived']]) plt.figure(figsize=(10,10)) plt.title("Pearson Correlation of Features", y=1.05, size=15) sns.heatmap(corr.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=plt.cm.viridis, linecolor='white', annot=True)
丢弃某些无用特征,像乘客ID,姓名,飞机票
In [40]:
train_data.drop(['PassengerId',"Name","Ticket"],axis=1,inplace=True)
In [41]:
train_data.head()
Out[41]:
Survived | Pclass | Age | SibSp | Parch | Fare | Cabin | C | Q | S | female | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | -0.558012 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 1 | 1 | 0.607433 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 1 | 3 | -0.266651 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 1 | 0 |
3 | 1 | 1 | 0.388912 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
4 | 0 | 3 | 0.388912 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 1 |
在进行特征工程,我们同时需要对于测试数据进行处理,使得在训练数据集上能够运行的模型能够很好在测试集上运行