sklearn中提供了很多的线性回归的模型,详情可以点击:
https://www.cnblogs.com/pinard/p/6026343.html
sklearn对Data Mining的各类算法已经有了较好的封装,基本可以使用fit
、predict
、score
来训练、评价模型,并使用模型进行预测,
LinearRegression
已经实现了多元线性回归模型,当然,也可以用来计算一元线性模型,通过使用list[list]传递数据就行。下面是LinearRegression
的具体说明。
实例化
sklearn一直秉承着简洁为美得思想设计着估计器,实例化的方式很简单,使用clf = LinearRegression()
就可以完成,但是仍然推荐看一下几个可能会用到的参数:
fit_intercept
:是否存在截距,默认存在normalize
:标准化开关,默认关闭还有一些参数感觉不是太有用,就不再说明了,可以去官网文档中查看。
回归
其实在上面的例子中已经使用了fit
进行回归计算了,使用的方法也是相当的简单。
fit(X,y,sample_weight=None)
:X
,y
以矩阵的方式传入,而sample_weight
则是每条测试数据的权重,同样以array
格式传入。predict(X)
:预测方法,将返回预测值y_pred
score(X,y,sample_weight=None)
:评分函数,将返回一个小于1的得分,可能会小于0方程
LinearRegression
将方程分为两个部分存放,coef_
存放回归系数,intercept_
则存放截距,因此要查看方程,就是查看这两个变量的取值。
其实,多项式就是多元回归的一个变种,只不过是原来需要传入的是X向量,而多项式则只要一个x值就行。通过将x扩展为指定阶数的向量,就可以使用LinearRegression
进行回归了。sklearn已经提供了扩展的方法——sklearn.preprocessing.PolynomialFeatures
。
例子
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.model_selection import train_test_split
#一元或者多元线性回归
# lr = linear_model.LinearRegression(normalize=True)#默认的正则化是Flase
boston = datasets.load_boston()
X = boston.data
y = boston.target
train_x,test_x,train_y,test_y = train_test_split(X,y,test_size=0.3)
# lr.fit(train_x,train_y)
# score = cross_val_score(lr,train_x,train_y,cv=10,scoring='neg_mean_squared_error')#计算均方误差
# print(score.mean())
# print(score)
# print(lr.score(test_x,test_y))
#多项式回归
from sklearn.preprocessing import PolynomialFeatures
# for k in range(1,4):
# lr_featurizer = PolynomialFeatures(degree=k) # 用于产生多项式 degree:最高次项的次数
# print ( '-----%d-----' % k)
# X_pf_train = lr_featurizer.fit_transform(train_x)
# X_pf_test = lr_featurizer.transform(test_x)
#
# pf_scores = cross_val_score(lr, X_pf_train, train_y, cv=10, scoring='neg_mean_squared_error')
# print (pf_scores.mean())
#
# lr.fit(X_pf_train, train_y)
# print (lr.score(X_pf_test, test_y))
# print (lr.score(X_pf_train, train_y))
#
#多项式回归过拟合问题,正则化可以优化
from sklearn.linear_model import Lasso
lr_featurizer = PolynomialFeatures(degree=3) # 用于产生多项式 degree:最高次项的次数
X_pf_train = lr_featurizer.fit_transform(train_x)
X_pf_test = lr_featurizer.transform(test_x)
# LASSO回归:
set = [0.0001,0.0002,0.0003,0.0004,0.0005]
for a in set:
print( '----%f-----'% a)
lasso = Lasso(alpha=a,normalize=True)
pf_scores = cross_val_score(lasso,X_pf_train,train_y,cv=10,scoring='neg_mean_squared_error')
print (pf_scores.mean())
lasso.fit(X_pf_train,train_y)
print (lasso.score(X_pf_test,test_y))
print (lasso.score(X_pf_train,train_y))
#还可以使用Ridge进行正则化,都可以。
逻辑回归其实是一个分类的算法
可以做概率预测,也可用于分类,仅能用于线性问题。通过计算真实值与预测值的概率,然后变换成损失函数,求损失函数最小值来计算模型参数,从而得出模型。
官方API:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0,fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,solver='liblinear', max_iter=100, multi_class='ovr', verbose=0,warm_start=False, n_jobs=1)
penalty : str, ‘l1’or ‘l2’, default: ‘l2’
Usedto specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and‘lbfgs’ solvers support only l2 penalties.
Dualor primal formulation. Dual formulation is only implemented for l2 penalty withliblinear solver. Prefer dual=False whenn_samples > n_features.
Inverseof regularization strength; must be a positive float. Like in support vectormachines, smaller values specify stronger regularization.
Specifiesif a constant (a.k.a. bias or intercept) should be added to the decisionfunction.
Usefulonly when the solver ‘liblinear’ is used and self.fit_intercept is set to True.In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” featurewith constant value equal to intercept_scaling is appended to the instancevector. The intercept becomes intercept_scaling * synthetic_feature_weight.
Note!the synthetic feature weight is subject to l1/l2 regularization as all otherfeatures. To lessen the effect of regularization on synthetic feature weight(and therefore on the intercept) intercept_scaling has to be increased.
仅在正则化项为"liblinear",且fit_intercept设置为True时有用。
solver
{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}, default: ‘liblinear’
Algorithmto use in the optimization problem.
Forsmall datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is
fasterfor large ones.
Formulticlass problems, only ‘newton-cg’, ‘sag’ and ‘lbfgs’ handle
multinomialloss; ‘liblinear’ is limited to one-versus-rest schemes.
‘newton-cg’,‘lbfgs’ and ‘sag’ only handle L2 penalty.
Notethat ‘sag’ fast convergence is only guaranteed on features with approximatelythe same scale. You can preprocess the data with a scaler fromsklearn.preprocessing.
Newin version 0.17: Stochastic Average Gradient descent solver.
a) liblinear:使用了开源的liblinear库实现,内部使用了坐标轴下降法来迭代优化损失函数。
b) lbfgs:拟牛顿法的一种,利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
c) newton-cg:也是牛顿法家族的一种,利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
d) sag:即随机平均梯度下降,是梯度下降法的变种,和普通梯度下降法的区别是每次迭代仅仅用一部分的样本来计算梯度,适合于样本数据多的时候。
L1 |
liblinear |
liblinear适用于小数据集;如果选择L2正则化发现还是过拟合,即预测效果差的时候,就可以考虑L1正则化;如果模型的特征非常多,希望一些不重要的特征系数归零,从而让模型系数稀疏化的话,也可以使用L1正则化。 |
L2 |
liblinear |
libniear只支持多元逻辑回归的OvR,不支持MvM,但MVM相对精确。 |
L2 |
lbfgs/newton-cg/sag |
较大数据集,支持one-vs-rest(OvR)和many-vs-many(MvM)两种多元逻辑回归。 |
L2 |
sag |
如果样本量非常大,比如大于10万,sag是第一选择;但不能用于L1正则化。 |
具体OvR和MvM有什么不同下一节讲。
multi_class : str, {‘ovr’, ‘multinomial’}, default:‘ovr’
Multiclassoption can be either ‘ovr’ or ‘multinomial’. If the option chosen is ‘ovr’,then a binary problem is fit for each label. Else the loss minimised is themultinomial loss fit across the entire probability distribution. Works only forthe ‘newton-cg’, ‘sag’ and ‘lbfgs’ solver.
Newin version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’case.
OvR的思想很简单,无论你是多少元逻辑回归,我们都可以看做二元逻辑回归。具体做法是,对于第K类的分类决策,我们把所有第K类的样本作为正例,除了第K类样本以外的所有样本都作为负例,然后在上面做二元逻辑回归,得到第K类的分类模型。其他类的分类模型获得以此类推。
而MvM则相对复杂,这里举MvM的特例one-vs-one(OvO)作讲解。如果模型有T类,我们每次在所有的T类样本里面选择两类样本出来,不妨记为T1类和T2类,把所有的输出为T1和T2的样本放在一起,把T1作为正例,T2作为负例,进行二元逻辑回归,得到模型参数。我们一共需要T(T-1)/2次分类。
可以看出OvR相对简单,但分类效果相对略差(这里指大多数样本分布情况,某些样本分布下OvR可能更好)。而MvM分类相对精确,但是分类速度没有OvR快。如果选择了ovr,则4种损失函数的优化方法liblinear,newton-cg,lbfgs和sag都可以选择。但是如果选择了multinomial,则只能选择newton-cg, lbfgs和sag了。
class_weight : dictor ‘balanced’, default: None
Weightsassociated with classes in the form {class_label: weight}. If not given, allclasses are supposed to have weight one.
The“balanced” mode uses the values of y to automatically adjust weights inverselyproportional to class frequencies in the input data as n_samples / (n_classes *np.bincount(y)).
Notethat these weights will be multiplied with sample_weight (passed through thefit method) if sample_weight is specified.
Newin version 0.17: class_weight=’balanced’ instead of deprecatedclass_weight=’auto’.
n_samples为样本数,n_classes为类别数量,np.bincount(y)会输出每个类的样本数,例如y=[1,0,0,1,1],则np.bincount(y)=[2,3]
在分类模型中,我们经常会遇到两类问题:
第一种是误分类的代价很高。比如对合法用户和非法用户进行分类,将非法用户分类为合法用户的代价很高,我们宁愿将合法用户分类为非法用户,这时可以人工再甄别,但是却不愿将非法用户分类为合法用户。这时,我们可以适当提高非法用户的权重。
第二种是样本是高度失衡的,比如我们有合法用户和非法用户的二元样本数据10000条,里面合法用户有9995条,非法用户只有5条,如果我们不考虑权重,则我们可以将所有的测试集都预测为合法用户,这样预测准确率理论上有99.95%,但是却没有任何意义。这时,我们可以选择balanced,让类库自动提高非法用户样本的权重。
提高了某种分类的权重,相比不考虑权重,会有更多的样本分类划分到高权重的类别,从而可以解决上面两类问题。
当然,对于第二种样本失衡的情况,我们还可以考虑用下一节讲到的样本权重参数: sample_weight,而不使用class_weight。sample_weight在下一节讲。
sample_weight(fit函数参数)
Usefulonly for the newton-cg, sag and lbfgs solvers. Maximum number of iterationstaken for the solvers to converge.
The seed of the pseudo random number generator touse when shuffling the data. Used only in solvers ‘sag’ and ‘liblinear’.
Tolerance for stopping criteria.迭代终止判据的误差范围。
Forthe liblinear and lbfgs solvers set verbose to any positive number forverbosity.
Whenset to True, reuse the solution of the previous call to fit as initialization,otherwise, just erase the previous solution. Useless for liblinear solver.
Newin version 0.17: warm_start to support lbfgs, newton-cg, sag solvers.
Numberof CPU cores used during the cross-validation loop. If given a value of -1, allcores are used.
LogisticRegression类中的方法有如下几种,常用的是fit和predict
Fitthe model according to the given training data.
Parameters:
X: {array-like, sparse matrix}, shape (n_samples, n_features)
Trainingvector, where n_samples is the number of samples and n_features is the numberof features.
y: array-like, shape (n_samples,)
Targetvector relative to X.
sample_weight : array-like, shape (n_samples,)optional
Arrayof weights that are assigned to individual samples. If not provided, then eachsample is given unit weight.
Newin version 0.17: sample_weight support to LogisticRegression.
Returns:
self: object
Returnsself.
Fitto data, then transform it.
Fitstransformer to X and y with optional parameters fit_params and returns atransformed version of X.
Parameters:
X: numpy array of shape [n_samples, n_features]
Trainingset.
y: numpy array of shape [n_samples]
Targetvalues.
Returns:
X_new: numpy array of shape [n_samples, n_features_new]
Transformedarray.
DEPRECATED:Support to use estimators as feature selectors will be removed in version 0.19.Use SelectFromModel instead.
ReduceX to its most important features.
Usescoef_ or feature_importances_ to determine the most important features. Formodels with a coef_ for each class, the absolute sum over the classes is used.
Parameters:
X: array or scipy sparse matrix of shape [n_samples, n_features]
Theinput samples.
Threshold:string, float or None, optional (default=None)
The threshold value to use for featureselection. Features whose importance is greater or equal are kept while theothers are discarded. If “median” (resp. “mean”), then the thresholdvalue is the median (resp. the mean) of the feature importances. A scalingfactor (e.g., “1.25*mean”) may also be used. If None and if available, theobject attribute threshold is used. Otherwise, “mean” is used by default.
Returns:
X_r: array of shape [n_samples, n_selected_features]
Theinput samples with only the selected features.
Predictclass labels for samples in X.
Parameters:
X: {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns:
C: array, shape = [n_samples]
Predictedclass label per sample.
用来预测样本的标记,也就是分类,X是测试集
Probabilityestimates.
The returned estimates for all classes areordered by the label of classes.
Fora multi_class problem, if multi_class is set to be “multinomial” the softmaxfunction is used to find the predicted probability of each class. Else use aone-vs-rest approach, i.e calculate the probability of each class assuming itto be positive using the logistic function. and normalize these values acrossall the classes.
Parameters:
X :array-like, shape = [n_samples, n_features]
Returns:
T :array-like, shape = [n_samples, n_classes]
Returns the probability of the sample foreach class in the model, where classes are ordered as they are in self.classes_.
例子:
import numpy as np
import matplotlib.pyplot as plt
# 使用交叉验证的方法,把数据集分为训练集合测试集
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
# 加载iris数据集
def load_data():
diabetes = datasets.load_iris()
# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.30, random_state=0)
return X_train, X_test, y_train, y_test
# 使用LogisticRegression考察线性回归的预测能力
def test_LogisticRegression(X_train, X_test, y_train, y_test):
# 选择模型
cls = LogisticRegression()#所有的参数都是默认的
# 把数据交给模型训练
cls.fit(X_train, y_train)
print("Coefficients:%s, intercept %s"%(cls.coef_,cls.intercept_))
print("Residual sum of squares: %.2f"% np.mean((cls.predict(X_test) - y_test) ** 2))
print('Score: %.2f' % cls.score(X_test, y_test))
if __name__=='__main__':
X_train,X_test,y_train,y_test=load_data() # 产生用于回归问题的数据集
test_LogisticRegression(X_train,X_test,y_train,y_test) # 调用 test_LinearRegression