A tutorial on using the rminer R package for data mining tasks

数据挖掘

这是一个数据挖掘的常规流程:

  1. 业务理解 :背景是什么,问题的目的是什么
  2. 数据理解 :有哪些数据,那些数据相关,数据是否充分,数据对不对
  3. 数据预处理:数据的清洗,数据的转换,包括特征的选择
  4. 建立模型:建立分类模型,回归模型
  5. 评估模型:模型效果如何,ks ,auc
  6. 模型部署,使用建立好的模型


    A tutorial on using the rminer R package for data mining tasks_第1张图片
    image.png

数据处理

输出数据的行列

# simple show rows x columns function
nelems=function(d) paste(nrow(d),"x",ncol(d))

缺失值处理

# 1.直接删除
bank4=na.omit(bank3)

# 2.用平均值填充
bank5=imputation("value",bank3,"age",Value=meanage)

# 3.substitute NA values by the values found in most similar case (1-nearestneighbor):
bank6=imputation("hotdeck",bank3,"age")

建模

fit函数:训练模型,调参数
predict: 函数,进行预测
mining :根据验证方法和运行次数执行几次拟合并预测执行。

library(rminer)
# ctree
B2=fit(schoolsup~.,math[,c(inputs,bout)],model="ctree")
# rpart 
B1=fit(schoolsup~.,math[,c(inputs,bout)],model="rpart")

B3=fit(schoolsup~.,math[,c(inputs,bout)],model="mlpe") 

B4=fit(schoolsup~.,math[,c(inputs,bout)],model="ksvm")

C3=fit(Mjob~.,cmath,model="randomForest")

你修改model就好了

评估

B1=fit(schoolsup~.,math[,c(inputs,bout)],model="rpart")
test <- math[,c(inputs,bout)]
y <- test$schoolsup.1
P1=predict(B1,test)

m=mmetric(y,P1,metric=c("ALL"))

这样就会得出所有的指标

如何查看model有哪些模型:

  • naive most common class (classification) or mean output value (regression)

  • ctree – conditional inference tree (classification and regression, uses [ctree](http://127.0.0.1:10074/help/library/rminer/help/ctree)from party package)

  • cv.glmnet – generalized linear model with lasso or elasticnet regularization (classification and regression, uses [cv.glmnet](http://127.0.0.1:10074/help/library/rminer/help/cv.glmnet) from glmnet package; note: cross-validation is used to automatically set the lambda parameter that is needed to compute the predictions)

  • rpart or dt – decision tree (classification and regression, uses [rpart](http://127.0.0.1:10074/help/library/rminer/help/rpart) from rpart package)

  • kknn or knn – k-nearest neighbor (classification and regression, uses [kknn](http://127.0.0.1:10074/help/library/rminer/help/kknn)from kknn package)

  • ksvm or svm – support vector machine (classification and regression, uses [ksvm](http://127.0.0.1:10074/help/library/rminer/help/ksvm) from kernlab package)

  • mlp – multilayer perceptron with one hidden layer (classification and regression, uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package)

  • mlpe – multilayer perceptron ensemble (classification and regression, uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package)

  • randomForest or randomforest – random forest algorithm (classification and regression, uses [randomForest](http://127.0.0.1:10074/help/library/rminer/help/randomForest) from randomForest package)

  • xgboost – eXtreme Gradient Boosting (Tree) (classification and regression, uses [xgboost](http://127.0.0.1:10074/help/library/rminer/help/xgboost) from xgboost package; note: nrounds parameter is set by default to 2)

  • bagging – bagging (classification, uses [bagging](http://127.0.0.1:10074/help/library/rminer/help/bagging) from adabag package)

  • boosting – boosting (classification, uses [boosting](http://127.0.0.1:10074/help/library/rminer/help/boosting) from adabag package)

  • lda – linear discriminant analysis (classification, uses [lda](http://127.0.0.1:10074/help/library/rminer/help/lda) from MASS package)

  • multinom or lr – logistic regression (classification, uses [multinom](http://127.0.0.1:10074/help/library/rminer/help/multinom) from nnet package)

  • naiveBayes or naivebayes – naive bayes (classification, uses [naiveBayes](http://127.0.0.1:10074/help/library/rminer/help/naiveBayes)from e1071 package)

  • qda – quadratic discriminant analysis (classification, uses [qda](http://127.0.0.1:10074/help/library/rminer/help/qda) from MASSpackage)

  • cubist – M5 rule-based model (regression, uses [cubist](http://127.0.0.1:10074/help/library/rminer/help/cubist) from Cubistpackage)

  • lm – standard multiple/linear regression (uses [lm](http://127.0.0.1:10074/help/library/rminer/help/lm))

  • mr – multiple regression (regression, equivalent to [lm](http://127.0.0.1:10074/help/library/rminer/help/lm) but uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package with zero hidden nodes and linear output function)

  • mars – multivariate adaptive regression splines (regression, uses [mars](http://127.0.0.1:10074/help/library/rminer/help/mars) from mda package)

  • pcr – principal component regression (regression, uses [pcr](http://127.0.0.1:10074/help/library/rminer/help/pcr) from plspackage)

  • plsr – partial least squares regression (regression, uses [plsr](http://127.0.0.1:10074/help/library/rminer/help/plsr) from plspackage)

  • cppls – canonical powered partial least squares (regression, uses [cppls](http://127.0.0.1:10074/help/library/rminer/help/cppls) from pls package)

  • rvm – relevance vector machine (regression, uses [rvm](http://127.0.0.1:10074/help/library/rminer/help/rvm) from kernlabpackage)

分享资料:

https://repositorium.sdum.uminho.pt/bitstream/1822/36210/1/rminer-tutorial.pdf

你可能感兴趣的:(A tutorial on using the rminer R package for data mining tasks)