数据回归分类预测的基本算法及python实现
关于数据的回归和分类以及分析预测。讨论分析几种比较基础的算法,也可以算作是比较简单的机器学习算法。
一.KNN算法
邻近算法,可以用来做回归分析也可以用来做分类分析。主要思想是采取K个最为邻近的自变量来求取其应变量的平均值,从而做一个回归或者是分类。一般来说,K取值越大,output的var会更小,但bias相应会变大。反之,则可能会造成过拟合。因此,合理的选取K的值是KNN算法当中一个很重要的步骤。
Advantages
First, it is simple and effective. Second, the cost of retraining is low (changes in the category system and training set changes are common in Web environments and e-commerce applications). Third, the calculation of time and space is linear to the size of the training set (in some cases not too large). Fourth, since the KNN method mainly depends on the neighboring limited samples, rather than determining the category by means of discriminating the class domain, the KNN method is better than the other for the sample sets that have overlapping or overlapping classes. The method is more suitable. Fifth, this algorithm is more suitable for the automatic classification of class domains with large sample sizes, and those class domains with smaller sample sizes are more prone to misclassification using this algorithm.
Disadvantages
• The estimate of the regression function can be highly unstable as it is an average of only a few points. This is the price that we pay for flexibility.
• Curse of dimensionality.
• Generating predictions is computationally expensive
总结来说,就是邻近算法简单易懂,但是当处理大数据量和多维度的时候,算法计算量会增大很多,因此在这种情况下不推荐使用。
KNN的python 实现:
(选取数据)
或者有两个变量:
(以k取值2和50 为例)
(最后,对于模型的评估)
(或者)
值得一提的是,K的取值是需要抉择的。
可以通过列举出各种K的取值来找出Test 数据集中的rmse最小值。(training 数据集中的rmse 会随K 的增大而增大)
二.正则化回归
我们在做回归的时候,主要要考虑的就是var和bias两个方面的东西。当我们采取ols的方法时,仅仅考虑到了他的bias,随着predictor的增加还有复杂度的增加,bias会越来越小但是与之相对的var就会增大,从而对我们的预测产生很大的影响。如何平衡这两者,是最为值得要考虑的。
因此,就有了正则化的概念(regularization)
2.1 ridge regression
第二项也称之为l2 regularisation
Advantages
Solving multilinearity is one of the advantages of ridge regression. Using the ridge model can improve predicted performance. Another advantage is that the ridge model can significantly solve the over-adjustment problem by introducing a penalty term. Therefore, the unimportant features from the use of burrs to the regularization of features become infinitely close to zero, efficiently reducing the variance and improving the performance of the prediction model.
Disadvantages
Since the coefficients of the penalty term can become infinitely close to zero but it can not be zero, there are still many features that can not be explained completely.
总结来说,能解决多重共线性问题(多重共线性是指线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确)。对于过拟合也有惩罚性,实际应用中可以尝试看看具体误差的大小。
Python 的实现:
2.2 Lasso regression
第二项也称之为l1 regularisation
优缺点和ridge regression 相类似。
Python 的实现:
值得一提的是,python的好处在于,输入从cv的值,python会自动交互式选取出最佳的lamda。
三.XGB BOOST
XGB boost是目前为止,对于数据分类预测最为有效的的实现方法。其准确率是在所有方法中独占鳌头的,因此具有很大的现实意义。
XGB boost的原理主要是基于决策树的分类方式。不同决策树的累加求得最后的分类。对于普通的决策树而言,首先是建立尽可能大的树,然后开始用贪心算法开始裁剪。而XGB有所不同的点在于,他新添加的每一棵树都是用了最优的添加。使得最后的结果能达到一个最优解。另一方面,XGB加入了复杂度的惩罚,即正则项,正则项里包含了树的叶子节点个数、每个叶子节点上输出的score的L2模的平方和(对于其具体的原理,理解的还不够透彻)。
Algorithm
Advantages
1. Comparing with gradient boost, XGBoost is faster, since the weight of XGBoost is known as Newton “step”, which does not need line search, the step length has been naturally known as ‘1’.
2. Advantage in characteristics rank, since XGBoost ranks the data and set the result as block types before the training, the block data type can be used repeatedly in further boosting.
3. XGBoost dealing with bias-variance tradeoff, the result of regularization term can control the complex level, and avoiding overfitting.
总结来说,XGB 就是一种很好用的算法。
Python的实现:
(选取参数)
如何调参是很关键的一步,在给定一定范围条件下python会自动选取出其中最优的解。
四LGB boost
LGB boost是微软公司2016推出的算法,其是在XGB算法上面的改进。主要提升了XGB算法的运行的速度,与之相对应的代价就是精度的损失。
Algorithm
The algorithm is similar with XGBoost, except the tree learning growth direction, when the data is small, LightGBM is to growth trees leaf-wise. The other traditional algorithm is to grow trees by depth-wise. The parallel features which is the most different with the other has been shown below (Sphinx):
1. Workers find local best split point {feature, threshold} on local feature set.
2. Communicate local best splits with each other and get the best one.
3. Perform the optimum split.
Advantages
1. Optimization in speed and reducing memory usage, especially large number data training.
2. Optimization in accuracy, differ with the most tree learning algorithms, LightGBM does not grow trees by depth-wise, it grows trees leaf-wise, when the data is small.
3. Optimal split for categorical features, since LightGBM uses its accumulated values to sorts the histogram, and then benefit from this idea, the best split on the sorted histogram has been found.
Python 的实现:
(本文中所有的英语部分是摘自学习过程中与小组成员共同完成的报告)