1.读取数据
本文采用的是美国成年人收入的数据集
import pandas as pd
from IPython.display import display
data = pd.read_csv(
adult_path, header=None, index_col=False,
names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'income'])
2.检查字符串的分类数据
使用pandas Series 的value_counts函数,显示类别和出现次数
print(data.gender.value_counts())
#输出
Male 21790
Female 10771
Name: gender, dtype: int64
3.对数据进行one-hot编码
利用get_dummies函数自动转换对象(通常默认类别的结果是字符串)
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))
display(data_dummies.head(n=2))
3.将结果存到NumPy数组
利用values属性将data_dummies数据框转换为NumPy,作为训练集。仅取包含特征的列(本例是从age到occupation_Transport-moving),不包含目标值。
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# extract NumPy arrays
X = features.values
y = data_dummies['income_ >50K'].values
4.训练模型
本例进行logic回归
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))
#输出
Test score: 0.81
此外,有些分类变量的值是数字,此时直接调用get_dummies不能为其编码。为了解决这个问题:首先,将数据框中的数值列转换成字符串;然后,使用colums参数显示地给出想要编码的列
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
display(pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']))