LASSO-Logistic模型--基于R语言glmnet包

     R语言中glmnet包是比较重要且流行的包之一,曾被誉为“三驾马车”之一。从包名就可以大致推测出,glmnet主要是使用Elastic-Net来实现GLM,广大的user可以通过该包使用Lasso 、 Elastic-Net 等Regularized方式来完成Linear Regression、 Logistic 、Multinomial Regression 等模型的构建。本人学习了CRAN上Glmnet_Vignette.pdf文档,有一些体会,首先是Linear Regression,然后是Logistic Regression(Binomial Models),还是直接上代码吧。


#####################################################################################################

  #glmnet学习#

#原理大致如下:The elastic-net penalty is controlled by a(alpha用英文字母a代替), and bridges the 
#gap between lasso (a= 1, the default) and ridge (a = 0). The tuning parameter Lambda controls the
#overall strength of the penalty
#LASSO回归与Ridge回归同属于一个被称为Elastic Net的广义线性模型家族。 这一家族的模型除了相同作用的参数λ之外,
#还有另一个参数α来控制应对高相关性(highly correlated)数据时模型的性状。 
#LASSO回归α=1,Ridge回归α=0,一般Elastic Net模型0<α<1。
#以下的数据和脚本来自Glmnet_Vignette.pdf,说明为本人理解。

#Introduction  Glmnet_Vignette.pdf中对Glmnet的介绍
#Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization
#path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter
#lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x. It fits linear, logistic
#and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted
#models. It can also fit multi-response linear regression.
#The authors of glmnet are Jerome Friedman, Trevor Hastie, Rob Tibshirani and Noah Simon, and the R
#package is maintained by Trevor Hastie. The matlab version of glmnet is maintained by Junyang Qian.

######################################################################################################
#一、Linear Regression
library(Matrix)
library(foreach)
library(glmnet)
data(QuickStartExample)
#使用data()可以发现x、y是glmnet包中的2个数据集
#说明1:nlambda是Lambda的个数,weights是每个观测的权重
fit = glmnet(x, y, alpha = 0.2, weights = c(rep(1,50),rep(2,50)), nlambda = 20)
fit
#说明2:Df (the number of nonzero coefficients), %dev (the percent deviance explained) and Lambda (the corresponding value of Lambda).
#Df   %Dev   Lambda
#[1,]  0 0.0000 7.939000
#[2,]  4 0.1789 4.889000
#[3,]  7 0.4445 3.011000
#[4,]  7 0.6567 1.854000
#[5,]  8 0.7850 1.142000
#[6,]  9 0.8539 0.703300
#[7,] 10 0.8867 0.433100
#说明2.1:LASSO回归复杂度调整的程度由参数λ来控制,λ越大对变量较多的线性模型的惩罚力度就越大,从而最终获得一个变量较少的模型。
#参照上面Lambda和Df的值也可以发现此规律
#说明3:文档中提到梯度下降的计算有2种停止方式:According to the default internal settings, the computations stop if either the fractional
#change in deviance down the path is less than 10-5 or the fraction of explained deviance reaches 0.999.
#说明4:关于选择非零参数的个数,可以通过列表的方式(fit,print(fit)),也可以画图的方式,画图横坐标有3中不同的参数
#Users can decide what is on the X-axis. xvar allows three measures: “norm” for the L1-norm of the coefficients
#(default), “lambda” for the log-lambda value and “dev” for %deviance explained.
#说明5:label = TRUE是在图上标明变量的序号(顺序),Users can also label the curves with variable sequence numbers simply by setting 
#label = TRUE.
plot(fit, xvar = "lambda", label = TRUE);plot(fit, xvar = "lambda")
#说明6:顶端的横坐标应该是当前Lambda下非零变量的个数:The axis above indicates the number of nonzero coefficients at the current Lambda,
#which is the effective degrees of freedom (df ) for the lasso
cvfit = cv.glmnet(x, y, type.measure = "mse", nfolds = 20)
#说明7:glmnet返回的是一系列不同Lambda对应的值(一组模型),需要user来选择一个Lambda,交叉验证是最常用挑选Lambda的方法
#The function glmnet returns a sequence of models for the users to choose from. In many cases, users may
#prefer the software to select one of them. Cross-validation is perhaps the simplest and most widely used
#method for that task.cv.glmnet is the main function to do cross-validation here, along with various supporting methods such as
#plotting and prediction. 
#关于lambda.1se、lambda.min的一种解释
#lambda.min is the value of Lambda that gives minimum mean cross-validated error. The other Lambda saved is lambda.1se,
#which gives the most regularized model such that error is within one standard error of the minimum. To use
#that, we only need to replace lambda.min with lambda.1se above.
#关于lambda.1se、lambda.min的另一种解释
#Functions coef and predict on cv.glmnet object are similar to those for a glmnet object, except that two
#special strings are also supported by s (the values of Lambda requested): * “lambda.1se”: the largest Lambda at which
#the MSE is within one standard error of the minimal MSE. “lambda.min”: the Lambda at which the minimal MSE is achieved.
coef(cvfit, s = "lambda.min");as.matrix(coef(cvfit, s = "lambda.min"))
plot(cvfit)
#说明7.1:lambda.min(误差最小)、lambda.1se(误差最小一个标准差内,模型最简单),对应图中两根竖线的地方
#It includes the cross-validation curve (red dotted line), and upper and lower standard deviation(标准差) curves along
#the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below).
#说明8:结果是个稀疏矩阵,用as.matirx后就可以变成正常矩阵
predict(cvfit, newx = x[1:5,], s = 'lambda.min')
predict(cvfit, newx = x[1:5,], s = c(0.1,0.2))
predict(cvfit, newx = x[1:5,], s= c("lambda.1se","lambda.min")) #"lambda.1se","lambda.min"调换顺序后程序报错,不知道为什么
#说明9:如上是用来做预测的,如果s是向量(多个值)则输出结果为矩阵
#说明10:前面部分都是在将如何选择合适的lambda,其实也可设置不同的alpha
#Users can control the folds used. Here we use the same folds so we can also select a value for alpha.
foldid=sample(1:10,size=length(y),replace=TRUE)
cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)
cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)
cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)
#说明10.1:标准的函数不能画到1个图中
#There are no built-in plot functions to put them all on the same plot, so we are on our own here:
par(mfrow=c(2,2))
plot(cv1);plot(cv.5);plot(cv0)
plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)
points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")
points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")
legend("topleft",legend=c("alpha= 1","alpha= .5","alpha 0"),pch=19,col=c("red","grey","blue"))
#说明11:应该是因为lasso模型满足如下条件:误差很小,同时变量也较少,所以说是最棒的模型
#We see that lasso (alpha=1) does about the best here. We also see that the range of lambdas used differs
#with alpha.
#说明12:也可以限定参数的范围,比如要求参数在如下区间(-0.7,0.5)
tfit=glmnet(x,y,lower=-.7,upper=.5)
plot(tfit)
#说明13:可以强制要求某些变量留在模型中,Penalty factors
#This is very useful when people have prior knowledge or preference over the variables. In many cases, some
#variables may be so important that one wants to keep them all the time, which can be achieved by setting
#corresponding penalty factors to 0:
p.fac = rep(1, 20)
p.fac[c(5, 10, 15)] = 0
pfit = glmnet(x, y, penalty.factor = p.fac)
plot(pfit, label = TRUE)
#说明13.1:结果解读
#We see from the labels that the three variables with 0 penalty factors always stay in the model, while the
#others follow typical regularization paths and shrunken to 0 eventually.
#Some other useful arguments. exclude allows one to block certain variables from being the model at all. Of
#course, one could simply subset these out of x, but sometimes exclude is more useful, since it returns a full
#vector of coefficients, just with the excluded ones set to zero. There is also an intercept argument which
#defaults to TRUE; if FALSE the intercept is forced to be zero.
#说明14:进行交叉验证,无论是线性回归还是logistic回归都可以使用并行计算
#Parallel computing is also supported by cv.glmnet. To make it work, users must register parallel beforehand.
#We give a simple example of comparison here.
#但是doMC这个包我没有安装成功,查看cran显示该包的状态为not available
##################################################################################################

#Linear Regression中仍然不明白的问题,主要alpha的选择

#1、glmnet方法的参数中有alpha,默认值为1(侧面说明该包还是偏向LASSO),为何到了cv.glmnet方法其参数没有了alpha,这是为什么?
#2、说明10,标明user也可以自己强制设置alpha,难道在实际操作中就是这样处理的吗?由用户自己设置不同的alpha、lambda组合,
#按照这种评价方式来找到最合适的alpha和Lambda?
TO CSDN的博友,可在评论中说说你们的理解,谢谢!
#3、网上一篇文章提到:glmnet只能接受数值矩阵作为模型输入,如果自变量中有离散变量的话,需要把这一列离散变量
#转化为几列只含有0和1的向量,这个过程叫做One Hot Encoding。我个人认为是这样的,既然已经是矩阵,可以看看Introduction
#肯定是需要对离散变量进行处理的。
#4、看看这篇文章:http://blog.csdn.net/qiao1245/article/details/53021465

##################################################################################################

#二、Logistic Regression(Binomial Models)

#载入数据,使用data()可以发现x、y是glmnet包中的2个数据集
data(BinomialExample)
head(x) #数据的维度都是比较高的,都是比较wide的
head(y)
fit = glmnet(x, y, family = "binomial")
plot(fit, xvar = "dev", label = TRUE)
cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
cvfit1 = cv.glmnet(x, y, family = "binomial", type.measure = "response",type.measure="auc")
#说明1:试了一下response确实是不能用的
#Prediction is a little different for logistic from Gaussian, mainly in the option type. “link” and “response” are never
#equivalent and “class” is only available for logistic regression. In summary, * “link” gives the linear predictors
plot(cvfit)
cvfit$lambda.min
cvfit$lambda.1se
#如下是变量的系数
coef(cvfit, s = "lambda.min")
predict(cvfit, newx = x[1:10,], s = "lambda.min", type = "class")

你可能感兴趣的:(R语言,机器学习,数据挖掘理论,r语言,岭回归,lasso,Logistic)