OneHotEncoder()函数

编码类别

1. OrdinalEncoder 哑编码

作用
有时候特征不是连续值而是间断值,例如一个人的性别的值域为["male", "female"],国籍的值域为["from Europe", "from US", "from Asia"],常用浏览器的值域为["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
['male', 'from US', 'uses Safari']['female', 'from Europe', 'uses Firefox']这两个人用[0, 1, 2]和[1, 0, 0]表示。

代码

enc = OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
print(enc.transform([['female', 'from US', 'uses Safari']]))  # [[0. 1. 1.]]

2. OneHotEncoder 独热编码

作用
分类编码变量,将每一个类可能取值的特征变换为二进制特征向量,每一类的特征向量只有一个地方是1,其余位置都是0

代码

2.1 自动推断类别

enc = OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
print(enc.categories_)
"""[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]"""
print(enc.get_feature_names())
"""['x0_female' 'x0_male' 'x1_from Europe' 'x1_from US' 'x2_uses Firefox' 'x2_uses Safari']"""
print(enc.transform([['female', 'from US', 'uses Safari'],
                     ['male', 'from Europe', 'uses Safari']]).toarray())
"""
[[1. 0. 0. 1. 0. 1.]
 [0. 1. 1. 0. 0. 1.]]
"""

可以看出性别、国籍、常用浏览器均有两个值可选,那么每两个值为一类。例如[1. 0. 0. 1. 0. 1.],前两个值[1. 0.]为female,中两个值[0. 1.]为from US,后两个值[0. 1.]为uses Safari

2.2 手动设置类别

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = OneHotEncoder(categories=[genders, locations, browsers])
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
print(enc.get_feature_names())
"""['x0_female' 'x0_male' 'x1_from Africa' 'x1_from Asia' 'x1_from Europe' 'x1_from US' 'x2_uses Chrome' 'x2_uses Firefox' 'x2_uses IE' 'x2_uses Safari']"""
print(enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray())
"""[[1. 0. 0. 1. 0. 0. 1. 0. 0. 0.]]"""

你可能感兴趣的:(Python)