调用stats 包的中的glm.fit函数做逻辑回归的时候遇到以下问题
一,加载数据
library(C50)
# load data
data(churn)
str(churnTrain)
查看数据结构
'data.frame': 3333 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
$ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
$ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
$ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
$ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
$ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
$ total_day_minutes : num 265 162 243 299 167 ...
$ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
$ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
$ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
$ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
$ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
$ total_night_minutes : num 245 254 163 197 187 ...
$ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
$ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
$ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
$ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
$ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
$ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
$ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
划分训练集和预测集
churnTrain = churnTrain[ ,! names(churnTrain) %in% c("state","area_code", "account_length") ]
set.seed(2)
ind = sample(2, nrow(churnTrain), replace = T ,prob = c(0.7,0.3))
trainset = churnTrain[ind == 1,]
testset = churnTrain[ind == 2,]
trainset1 <- trainset[,-17] # -c(1,2,17)
yc <- trainset$churn
运行逻辑回归函数做分类
library(stats)
logicon <- glm.control(#epsilon = 1e-8, # positive convergence tolerance ε;
maxit = 500, # integer giving the maximal number of IWLS iterations
# trace = FALSE # logical indicating if output should be produced for each iteration.
)
logireg <- glm.fit(x=trainset1,y=yc,
# weights = rep(1,length(yf)),
start = NULL,etastart = NULL,
mustart = NULL,
# offset = rep(0,length(yf)),
family = binomial(link = "logit"),
control = logicon,
#intercept = TRUE
)
Error in x[good, , drop = FALSE] * w :
non-numeric argument to binary operator
出现这个问题,主要是数据集中的变量中包含了因子类型的数据
解决:
删掉包含因子类型的自变量 (此数据集第一、二个自变量为因子类型)
trainset1 <- trainset[,-c(1,2,17)]
或者用 glm 函数,则包含因子类型自变量的数据集不会导致该错误
fit <- glm(churn~.,data = trainset,family = binomial(link = "logit"))