【机器学习】OneHotEncoder编码固定长度

如何保存onehotencoder编码结果呢?scikit learn - Save OneHot Encoder object python - Stack Overflow

正文:

遇见拟合阶段没有看到的数据,在编码时会报错。

如果知道有几种类型,就可以在编码时指定长度。

python - How to add your own categories into the OneHotEncoder - Stack Overflow

Example:

from sklearn.preprocessing import OneHotEncoder

a = [['1'], ['2'], ['3'], ['5']]
enc = OneHotEncoder()
X = enc.fit_transform(a)
enc.transform([['4']])

You can see that my training data does not contain '4', even though '4' is a possible label. so when I encode it and transform '4', it throws an error:

ValueError: Found unknown categories ['4'] in column 0 during transform

解决方案:

There can be two cases here.

1. If you know all the categories beforehand.

Pass all the possible categories as a list when OneHot Encoder is initialized.

enc = OneHotEncoder(categories = [str(i) for i in range(10)])

2. If you don't know some categories beforehand.

# This argument by default is set to `error` hence throws error is an unknown
# category is encountered.
enc = OneHotEncoder(handle_unknown='ignore')

使用:

nvalues = []
n = y_train_pred.shape[1]
num_leaves = [i for i in range(20)]
for j in range(n):
    nvalues.append(num_leaves)


nvalues
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
...]]

待预测数据

y_train_pred

array([[2, 2, 3, ..., 0, 5, 0],
       [2, 2, 3, ..., 0, 5, 0],
       [2, 2, 3, ..., 0, 5, 0],
       ...,
       [2, 2, 3, ..., 0, 5, 0],
       [2, 2, 3, ..., 0, 5, 0],
       [1, 1, 1, ..., 1, 1, 1]], dtype=int32)



y_train_pred.shape # (465726, 89)

编码

enc = OneHotEncoder(categories = nvalues)

enc.fit(y_train_pred)

enc.categories_

# 编码
train_new_feature = np.array(enc.transform(y_train_pred).toarray())


train_new_feature
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])


train_new_feature.shape[1] # 1780

由89个特征,编码成1780个特征。

参考:

sklearn.preprocessing.OneHotEncoder()函数介绍_monster.YC的博客-CSDN博客_preprocessing.onehotencoder()

sklearn.preprocessing.OneHotEncoder — scikit-learn 1.2.0 documentation

你可能感兴趣的:(机器学习,推荐系统,python,人工智能)