在之前的代码中多次出现了使用sklearn.pipeline.Pipeline和sklearn.preprocessing.PolynomialFeatures这两个类。我在找相关资料的时候发现很少有写这方面的文章和博客。除了官网的英文文档,其实这个文档写的非常好。但考虑到自己的英文水平有限,于是想写点什么来记录这两个类。
1、sklearn.preprocessing.PolynomialFeatures类
先给出它的官方文档链接http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html。
首先要知道它是一个类。全称如下:class sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
官文的注释如下:
Generate polynomial and interaction features.
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
我的理解是:专门产生多项式的,并且多项式包含的是相互影响的特征集。比如:一个输入样本是2维的。形式如[a,b] ,则二阶多项式的特征集如下[1,a,b,a^2,ab,b^2]。
参数理解:(一共只有3个参数)
degree : integer
The degree of the polynomial features. Default = 2.
多项式的阶数,一般默认是2。
interaction_only : boolean, default = False
If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).
如果值为true(默认是false),则会产生相互影响的特征集。
include_bias : boolean
If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
是否包含偏差列
属性:
powers_ : array, shape (n_input_features, n_output_features)
powers_[i, j] is the exponent of the jth input in the ith output.
n_input_features_ : int
The total number of input features.
输入特征的个数
n_output_features_ : int
The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.
输出多项式的特征个数。它的计算是通过遍历所有的适当大小的输入特征组合。
note:Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.
注意:多项式的阶数不要太高,否则会出现过拟合。
方法:
Methods:
1.fit(X, y=None)
Compute number of output features.
计算输出特征的个数
2.fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a
transformed version of X.
Parameters:
X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns:
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
输入参数:输入特征矩阵
返回:输出特征矩阵
get_params([deep])
Get parameters for this estimator.
set_params(**params)
Set the parameters of this estimator.
transform(X[, y])
Transform data to polynomial features
例子如下:
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = PolynomialFeatures(2) #设置多项式阶数为2,其他的默认
>>> poly.fit_transform(X)
array([[ 1, 0, 1, 0, 0, 1],
[ 1, 2, 3, 4, 6, 9],
[ 1, 4, 5, 16, 20, 25]])
>>> poly = PolynomialFeatures(interaction_only=True)#默认的阶数是2,同时设置交互关系为true
>>> poly.fit_transform(X)
array([[ 1, 0, 1, 0],
[ 1, 2, 3, 6],
[ 1, 4, 5, 20]])
备注:上面的数组中,每一行是一个list。比如[0,1] 类似与上面的[a,b]。好的现在它的多项式输出矩阵就是
[1,a,b,a^2,ab,b^2]。所以就是下面对应的[1,0,1,0,0,1]。
现在将interaction_only=True。这时就是只找交互作用的多项式输出矩阵。例如[a,b]的多项式交互式输出[1,a,b,ab]。不存在自己与自己交互的情况如;a^2或者a*b^2之类的。