Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder
, which transforms each categorical feature with n_categories
possible values into n_categories
binary features, with one of them 1, and all others 0.
Continuing the example above:
>>>
>>> enc = preprocessing.OneHotEncoder() >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=None, drop=None, dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) >>> enc.transform([['female', 'from US', 'uses Safari'], ... ['male', 'from Europe', 'uses Safari']]).toarray() array([[1., 0., 0., 1., 0., 1.], [0., 1., 1., 0., 0., 1.]])
By default, the values each feature can take is inferred automatically from the dataset and can be found in the categories_
attribute:
>>>
>>> enc.categories_ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
It is possible to specify this explicitly using the parameter categories
. There are two genders, four possible continents and four web browsers in our dataset:
>>>
>>> genders = ['female', 'male'] >>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US'] >>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari'] >>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers]) >>> # Note that for there are missing categorical values for the 2nd and 3rd >>> # feature >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=[...], drop=None, dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray() array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore'
instead of setting the categories
manually as above. When handle_unknown='ignore'
is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore'
is only supported for one-hot encoding):
>>>
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore') >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=None, drop=None, dtype=<... 'numpy.float64'>, handle_unknown='ignore', n_values=None, sparse=True) >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray() array([[1., 0., 0., 0., 0., 0.]])
It is also possible to encode each column into n_categories - 1
columns instead of n_categories
columns by using the drop
parameter. This parameter allows the user to specify a category for each feature to be dropped. This is useful to avoid co-linearity in the input matrix in some classifiers. Such functionality is useful, for example, when using non-regularized regression (LinearRegression
), since co-linearity would cause the covariance matrix to be non-invertible. When this paramenter is not None, handle_unknown
must be set to error
:
>>>
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)] >>> drop_enc.transform(X).toarray() array([[1., 1., 1.], [0., 0., 0.]])