A group of predictions is called ensemble. The technique that aggregates ensemble to make better predictions is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.
An ensemble of Decision Trees is called Random Forest.
The most popular Ensemble methods include bagging, boosting, stacking, and a few others.
The majority-vote classifier is called a hard voting classifier.
Even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.
Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving
the ensemble’s accuracy.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)
voting_clf=VotingClassifier(
estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
voting='hard'
)
voting_clf.fit(X_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba() \verb+dict_proba()+ dict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting.
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
#setting parameter probability=True enable the SVC to have a #predict_prob() method
svm_clf = SVC(gamma="auto", probability=True, random_state=42)
voting_clf=VotingClassifier(
estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
voting='soft'
)
One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.
Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression.
Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier class (or BaggingRegressor \verb+BaggingRegressor+ BaggingRegressor for regression). The following code trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False \verb+bootstrap=False+ bootstrap=False). The n_jobs \verb+n_jobs+ n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (–1 tells Scikit-Learn to use all available cores):
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf=BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,
max_samples=100,bootstrap=True,n_jobs=-1
)
bag_clf.fit(X_train,y_train)
y_pred=bag_clf.predict(X_test)
The BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() \verb+predict_proba()+ predict_proba() method), which is the case with Decision Trees classifiers.
The training instances that are not sampled are called out-of-bag (oob) instances. As they are not seen by trained predictor, they can be used to evaluate the ensemble. In Scikit-Learn, you can set oob_score=True \verb+oob_score=True+ oob_score=True when creating a BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier to request an automatic oob evaluation after training.
bag_clf=BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,
max_samples=100,bootstrap=True,n_jobs=-1,oob_score=True
)
bag_clf.fit(X_train,y_train)
bag_clf.oob_score_ #0.9175
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)# 0.9175
#lower than the results of 0.93/0.936 given by the authors
bag_clf.oob_decision_function_
The BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: max_features \verb+max_features+ max_features and bootstrap_features \verb+bootstrap_features+ bootstrap_features.
Sampling both instances and features is called the Random Patches method. Keeping all training instances (i.e., bootstrap=False \verb+bootstrap=False+ bootstrap=False and max_samples=1.0 \verb+max_samples=1.0+ max_samples=1.0) but sampling features (i.e., bootstrap_features=True \verb+bootstrap_features=True+ bootstrap_features=True and/or max_features \verb+max_features+ max_features smaller than 1.0) is called Random Subspaces method.
Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.
from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X_train,y_train)
y_pred_rf=rnd_clf.predict(X_test)
The following BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier is roughly equivalent to the previous RandomForestClassifier \verb+RandomForestClassifier+ RandomForestClassifier:
bag_clf=BaggingClassifier(
DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),
n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1
)
It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).
A forest of such extremely random trees is simply called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.
You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier \verb+ExtraTreesClassifier+ ExtraTreesClassifier class.
In a single Decision Tree, important features are likely to appear closer to the root of the tree, while unimportant features will often appear closer to the leaves (or not at all). You can
access the result using the feature_importances_ \verb+feature_importances_+ feature_importances_ variable.
from sklearn.datasets import load_iris
iris=load_iris()
rnd_clf=RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(iris.data,iris.target)
for name,score in zip(iris["feature_names"],
rnd_clf.feature_importances_):
print(name,score)
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.
To build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on.
Let’s take a closer look at the AdaBoost algorithm. Each instance weight w ( i ) w^{(i)} w(i) is initially set to 1 m \frac{1}{m} m1. A first predictor is trained and its weighted error rate r 1 r_1 r1 is computed on the training set; see Equation 7-1.
Equation 7-1. Weighted error rate of the j t h j^{th} jth predictor
$$
r_j=\frac{\mathop{\sum_{i=1}^m}\limits_{\widehat y_j^{(i)}\neq y{(i)}}w{(i)}}{\sum_{i=1}^m w^{(i)}}
\textrm{ where } \widehat y_j^{(i)}\textrm{ is the }j^{th} \textrm{ predictor’s prediction for the }i^{th} \textrm{ instance.}
$$
The predictor’s weight α j \alpha_j αj is then computed using Equation 7-2, where η \eta η is the learning rate hyperparameter (defaults to 1).
Equation 7-2. Predictor weight
α j = η log 1 − r j r j \alpha_j=\eta\log \frac{1-r_j}{r_j} αj=ηlogrj1−rj
Next the instance weights are updated using Equation 7-3: the misclassified instances are boosted.
Equation 7-3. Weight update rule
for i = 1 , 2 , ⋯   , m w ( i ) ← { w ( i ) if y ^ j ( i ) = y ( i ) w ( i ) exp ( α j ) if y j ( i ) ≠ y ( i ) \textrm{for }i=1,2,\cdots,m\\ w^{(i)}\leftarrow\left\{\begin{array}{ll} w^{(i)}& \textrm{ if } \widehat y_j^{(i)}=y^{(i)}\\ w^{(i)}\exp\left(\alpha_j\right)& \textrm{ if }y_j^{(i)}\neq y^{(i)} \end{array}\right. for i=1,2,⋯,mw(i)←{w(i)w(i)exp(αj) if y j(i)=y(i) if yj(i)̸=y(i)
Then, all the instance weights are normalized (i.e., divided by ∑ i = 1 m w ( i ) \sum_{i=1}^m w^{(i)} ∑i=1mw(i)) .
Finally, a new predictor is trained using the updated weights, and the whole process is repeated (the new predictor’s weight is computed, the instance weights are updated, then another predictor is trained, and so on). The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.
To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights α j \alpha_j αj. The predicted class is the one that receives the majority of weighted votes (see Equation 7-4).
Equation 7-4. AdaBoost predictions
y ^ ( x ) = arg max k ∑ j = 1 N y ^ j ( x ) = k α j where N is the number of predictors. \widehat y(\textbf x)=\mathop{\arg\max}\limits_{k}\mathop{\sum_{j=1}^N}\limits_{\widehat y_j(\textbf x)=k}\alpha_j \textrm{ where } N \textrm{ is the number of predictors.} y (x)=kargmaxy j(x)=kj=1∑Nαj where N is the number of predictors.
Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the predictors can estimate class probabilities (i.e., if they have a predict_proba() \verb+predict_proba()+ predict_proba() method), Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally performs better.
from sklearn.ensemble import AdaBoostClassifier
ada_clf=AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),n_estimators=200,
algorithm="SAMME.R",learning_rate=0.5
)
ada_clf.fit(X_train,y_train)
y_pred=ada_clf.predict(X_test)
accuracy_score(y_test,y_pred)#0.87
Instead of tweaking the instance weights at every iteration like AdaBoost does, Gradient Boosting tries to fit the new predictor to the residual errors made by the previous predictor.
Let’s go through a simple regression example using Decision Trees as the base predictors (of course Gradient Boosting also works great with regression tasks). This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT). First, let’s fit a DecisionTreeRegressor \verb+DecisionTreeRegressor+ DecisionTreeRegressor to the training set (for example, a noisy quadratic training set):
import numpy as np
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)
X_new = np.random.rand(200, 1)
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRegressor \verb+GradientBoostingRegressor+ GradientBoostingRegressor class.
from sklearn.ensemble import GradientBoostingRegressor
gbrt=GradientBoostingRegressor(max_depth=2,n_estimator=3,learning_rate=1.0)
gbrt.fit(X,y)
The learning_rate \verb+learning_rate+ learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage.
In order to find the optimal number of trees, you can use early stopping (see Chapter 4). A simple way to implement this is to use the staged_predict() \verb+staged_predict()+ staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train,X_val,y_train,y_val=train_test_split(X,y)
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)
gbrt.fit(X_train,y_train)
errors=[ mean_squared_error(y_val,y_pred)
for y_pred in gbrt.staged_predict(X_val)]
best_n_estimators=np.argmin(errors)
gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=best_n_estimators)
gbrt_best.fit(X_train,y_train)
It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). You can do so by setting warm_start=True \verb+warm_start=True+ warm_start=True, which makes ScikitLearn keep existing trees when the fit() method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train,X_val,y_train,y_val=train_test_split(X,y)
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)
gbrt.fit(X_train,y_train)
errors=[ mean_squared_error(y_val,y_pred)
for y_pred in gbrt.staged_predict(X_val)]
best_n_estimators=np.argmin(errors)
gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=best_n_estimators)
gbrt_best.fit(X_train,y_train)
The GradientBoostingRegressor \verb+GradientBoostingRegressor+ GradientBoostingRegressor class also supports a subsample \verb+subsample+ subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this trades a higher bias for a lower variance. It also speeds up training considerably. This technique is called Stochastic Gradient Boosting.
It is possible to use Gradient Boosting with other cost functions. This is controlled by the loss \verb+loss+ loss hyperparameter.
Stacking (short for stacked generalization) is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble,
why don’t we train a model to perform this aggregation? Each of predictors predicts a different value, and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction.
To train the blender, a common approach is to use a hold-out set. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer. The second subset is sent to predictors in the first layer to obtain predicted value. Then these predicted values are combined together with the target value to form a training set to train a blender in the second layer.
To sum up, ensemble is a technique that aggregates predictions of multiple models to make better prediction.
Voting Classifiers: use different classifiers, but same training set. VotingClasifier \verb+VotingClasifier+ VotingClasifier.
Bagging and Pasting: use different random subsets of the training set, but same model. The aggregating function is the statistical mode (most frequent prediction), or the average for regression. Allow to train in parallel. BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier, BaggingRegressor \verb+BaggingRegressor+ BaggingRegressor.
Random Patches and Random Subspaces: sampling the features. BaggingClassifier \verb+BaggingClassifier+ BaggingClassifier with max_features \verb+max_features+ max_features and bootstrap_features \verb+bootstrap_features+ bootstrap_features hyperparameters.
Random Forests: only a random subset of the features is considered for splitting. RandomForestClassifier \verb+RandomForestClassifier+ RandomForestClassifier, RandomForestRegressor \verb+RandomForestRegressor+ RandomForestRegressor.
Boosting (hypothesis boosting): train predictors sequentially, each trying to correct its predecessor. It cannot be parallelized.
AdaBoost (Adaptive Bossting): new predictors pay more attention to the training instances that the predecessor underfitted. AdaBoostClassifier \verb+AdaBoostClassifier+ AdaBoostClassifier with parameter algorithm="SAMME.R" \verb+algorithm="SAMME.R"+ algorithm="SAMME.R".
Gradient Boosting: tries to fit the new predictor to the residual errors made by the previous predictor. GradientBoostingRegressor \verb+GradientBoostingRegressor+ GradientBoostingRegressor.
The error of a sample is the deviation of the sample from the (unobservable) population mean or actual function, while the residual of a sample is the difference between the sample and either (1) the (observed) sample mean or (2) the regressed (fitted) function value.
https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics
Stacking (stacked generalization): train a model to perform aggregation. The training set is divided into two subsets. The first subset is used to train several regressors. Then these regressors make predictions on the second subset. The prediction values in combination with the target value are used as the training set of a final regressor.
difference between eatimators \verb+eatimators+ eatimators and eatimators_ \verb+eatimators_+ eatimators_ Ensemble classes: