[机器学习] 小傻学集成学习

集成学习-Ensemble

  • 什么是集成学习
    • 集成模型思想
  • 集成学习策略
    • Max Voting
    • Averaging
    • Weighted Averaging
    • stacking
    • bagging
    • boosting
    • Bagging VS Boosting

什么是集成学习

集成学习也叫分类器集成,通过构建并结合多个学习器来完成学习任务。
一般结构:先产生一组“个体学习器”,再用某种策略将他们结合起来。结合策略主要有平均法、投票法和学习法等。

集成模型思想

三个臭皮匠顶个诸葛亮。
弱分类器:分类能力不起那个,但其效果又比随机选但效果稍微好一点,类似于“臭皮匠”。
强分类器:具有很强但分类能力,也就是把特征仍给它,它能分的比较准确,类似于“诸葛亮”。
[机器学习] 小傻学集成学习_第1张图片

集成学习策略

Max Voting

少数服从多数
[机器学习] 小傻学集成学习_第2张图片

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

model1 = LogisticRegression(random_state=1)
model2 = DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(X_train, y_train)
model.score(X_test, y_test)

voting: str, {‘hard’, ‘soft’}(default=‘hard’)
if ‘hard’, use predicted class labels or majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

Averaging

[机器学习] 小傻学集成学习_第3张图片

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3 = LogisticRegression()

model1.fit(X_train, y_trian)
model2.fit(X_train, y_trian)
model3.fit(X_train, y_trian)

pred1 = model1.predict(X_test_1)
pred2 = model2.predict(X_test_2)
pred3 = model3.predict(X_test_3)

final_pred = (pred1 + pred2 + pred3)/3

Weighted Averaging

[机器学习] 小傻学集成学习_第4张图片

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3 = LogisticRegression()

model1.fit(X_train, y_trian)
model2.fit(X_train, y_trian)
model3.fit(X_train, y_trian)

pred1 = model1.predict(X_test_1)
pred2 = model2.predict(X_test_2)
pred3 = model3.predict(X_test_3)

final_pred = (a1*pred1 + a2*pred2 + a3*pred3)/3 # a1 + a2 + a3 = 1

stacking

stacking is an ensemble learning technique that uses predictions from multiple models (for example: decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set.
聚合多个分类或回归模型(可以分阶段来做)
[机器学习] 小傻学集成学习_第5张图片

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# model1
model1 = DecisionTreeClassifier(random_state=1)
test_pred1, train_pred1 = Stacking()
train_pred1 = pd.DataFrame(train_pred1)
test_pred1 = pd.DataFrame(test_pred1)

# model2
model2 = KNeighborsClassifier()
test_pred2, train_pred2 = Stacking()
train_pred2 = pd.DataFrame(train_pred2)
test_pred2 = pd.DataFrame(test_pred2)

# to obtain the result
df = pd.concat([train_pred1, train_pred2], axis=1)
df_test = pd.concat([test_pred1, test_pred2], axis=1)
model = LogisticRegression(random_state=1)
model.fit(df, y_train)
model.score(df_test, y_test)

bagging

(基于bagging算法:Bagging meta-estimator, Random forest)
the idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result.
the models run in parallel and are independent of each other.
训练多个分类器取平均
[机器学习] 小傻学集成学习_第6张图片

boosting

(基于boosting算法:Adaboost, GBM, XGBM, Light GBM, CatBoost)
Motivation: if a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictinos provide better results?

Definition: Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

通过集成式的方式来提高模型的稳定性,好比多买几只股票来降低风险;
利用样本训练多个模型,利用每个样本进行预测,然后得出最终预测结果(模型间可并行进行预算,最后进行平均之类的计算方式进行预测)
从弱学习器开始加强,通过加权来进行训练
使方差或标准差小:假设有N个模型,每个模型在预测时的方差为u^2, 则通过N个模型一起预测时的方差是(平均):u^2/N
基于第一次的训练结果集训进行
XGBoost优势:
1. 算法可以并行,训练效率高;
2. 比起其他算法,实际效果好;
3. 可控参数多,可以灵活调整[机器学习] 小傻学集成学习_第7张图片
[机器学习] 小傻学集成学习_第8张图片

Bagging VS Boosting

Bagging: leverages unstable base learners that are weak because of overfitting
Boosting: leverages stable base learners that are weak because of underfitting
XGBoost学习路径:
1. 如何构造目标函数(目标函数不清晰
2. 目标函数直接优化难,如何近似?
3. 如何把树的结构引入到目标函数?
4. 仍然难优化,使用贪心算法?

你可能感兴趣的:(机器学习,机器学习,python,人工智能,算法)