墙外的一篇文章,讲得挺好的,转过来了,原文
The post is structured as follows :
It is an optimisation technique to reduce the loss function by following the slope through a fixed step size.
Boosting is sequential ensembling technique where hard to classify instances are given more weights. This essentially means subsequent learners would put more emphasis on learning mis-classified data instances. Final model would be weighted average of n weak learners.
Subsequent learners give more emphasis on undertrained / wrongly classified samples. Variation in dot size is representative of the weights given to that instance. Larger the size of dot larger is the weight assigned to it. Do note that all models are sequential.
GBDT (Gradient Boosting Decision Tree) is an ensemble model of decision trees which are trained in sequence(i.e. an ensemble model of boosting). In each iteration GBDT learns the decision tree by fitting the residual errors (errors upto the current iteration). This means every subsequent learner tries to learn the difference between actual output and weighted sum of predictions until an iteration before. The errors are minimised using the gradient method.
This brings us to second part of the post. The costliest operation in GBDT is training the decision tree and the most time consuming task is to find the optimum split points.
What are split points?
Split points are the feature values depending on which data is divided at a tree node. In the above example data division happens at node1 on Height ( 180 ) and at node 2 on Weight ( 80 ). The optimum splits are selected from a pool of candidate splits on the basis of information gain. In other words split points with maximum information gain are selected.
Split finding algorithms are used to find candidate splits.
One of the most popular split finding algorithm is the Pre-sorted algorithm which enumerates all possible split points on pre-sorted values. This method is simple but highly inefficient in terms of computation power and memory .
The second method is the Histogram based algorithm which buckets continuous features into discrete bins to construct feature histograms during training. It costs O(#data * #feature) for histogram building and O(#bin * #feature) for split point finding. As bin << data histogram building will dominate the computational complexity.
(Binning example : Binning has greatly reduced the number of candidate splits)
Both LightGBM and xgboost utilise histogram based split finding in contrast to sklearn which uses GBM ( One of the reasons why it is slow). Let’s begin with the crux of this post
LightGBM aims to reduce complexity of histogram building ( O(data * feature) ) by down sampling data and feature using GOSS and EFB. This will bring down the complexity to (O(data2 * bundles)) where data2 < data and bundles << feature.
GOSS (Gradient Based One Side Sampling) is a novel sampling method which down samples the instances on basis of gradients. As we know instances with small gradients are well trained (small training error) and those with large gradients are under trained. A naive approach to downsample is to discard instances with small gradients by solely focussing on instances with large gradients but this would alter the data distribution. In a nutshell GOSS retains instances with large gradients while performing random sampling on instances with small gradients.
Remember histogram building takes O(#data * #feature). If we are able to down sample the #feature we will speed up tree learning. LightGBM achieves this by bundling features together. We generally work with high dimensionality data. Such data have many features which are mutually exclusive i.e they never take zero values simultaneously. LightGBM safely identifies such features and bundles them into a single feature to reduce the complexity to O(#data * #bundle) where #bundle << #feature.
Formal algorithm for bundling features
EFB is merging the features to reduce the training complexity. In order to keep the merge reversible we will keep exclusive features reside in different bins.
In the example below you can see that feature1 and feature2 are mutually exclusive. In order to achieve non overlapping buckets we add bundle size of feature1 to feature2. This makes sure that non zero data points of bundled features ( feature1 and feature2 ) reside in different buckets. In feature_bundle buckets 1 to 4 contains non zero instances of feature1 and buckets 5,6 contain non zero instances of feature2.
(Intuitive explanation for merging features)
博主注:其实我一直搞不清楚的就是这块,究竟是怎么把bundle里的特征merge到一起的,结合上面的例子来看下面的算法,bin就是一个桶,比如说一个特征的取值范围是(0,20),假如分两个bin,那就是0到10第一个bin,10到20第二个bin。又例如上面的例子,feature1的binRange就是4,feature2的binRange就是2。于是经过第一个for之后,binRanges就是{0,4,6},totalBin就是6。newBin,就是merge之后的特征的取值范围,以上面为例,返回的就是{1,2,3,4,5,6}。
With this we have covered most of the optimisations proposed in the original paper. I hope this post would give you gain some understanding behind the core concepts of LightGBM library.
With this I complete my first attempt at writing a blog. Please feel free to share your thoughts, feedback or suggestions below.