本文代码及数据集来自《Python大数据分析与机器学习商业案例实战》
AdaBoost算法(Adaptive Boosting)是一种有效而实用的Boosting算法,它以一种高度自适应的方式按顺序训练弱学习器。针对分类问题,AdaBoost算法根据前一次的分类效果调整数据的权重,在上一个弱学习器中分类错误的样本的权重会在下一个弱学习器中增加,分类正确的样本的权重则相应减少,并且在每一轮迭代时会向模型加入一个新的弱学习器。不断重复调整权重和训练弱学习器,直到误分类数低于预设值或迭代次数达到指定最大值,最终得到一个强学习器。
# 1.读取数据
import pandas as pd
df = pd.read_excel('信用卡精准营销模型.xlsx')
# 2.提取特征变量和目标变量
X = df.drop(columns='响应')
y = df['响应']
# 3.划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# 4.模型训练及搭建
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(random_state=123)
clf.fit(X_train, y_train)
# # 9.2.3 模型预测及评估
y_pred = clf.predict(X_test)
print(y_pred)
a = pd.DataFrame()
a['预测值'] = list(y_pred)
a['实际值'] = list(y_test)
a.head()
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print(score)
y_pred_proba = clf.predict_proba(X_test)
print(y_pred_proba[0:5])
from sklearn.metrics import roc_curve
fpr, tpr, thres = roc_curve(y_test.values, y_pred_proba[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr, tpr)
plt.show()
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, y_pred_proba[:,1])
print(score)
print(clf.feature_importances_)
features = X.columns # 获取特征名称
importances = clf.feature_importances_ # 获取特征重要性
importances_df = pd.DataFrame()
importances_df['特征名称'] = features
importances_df['特征重要性'] = importances
importances_df.sort_values('特征重要性', ascending=False)
运行结果:
adaboost分类参数:
adaboost回归参数还有:
GBDT是Gradient Boosting Decision Tree(梯度提升树)的缩写。AdaBoost算法根据分类效果调整权重并不断迭代,最终生成强学习器;GBDT算法则将损失函数的负梯度作为残差的近似值,不断使用残差迭代和拟合回归树,最终生成强学习器。简单来说,AdaBoost算法是调整权重,而GBDT算法则是拟合残差。
# 1.读取数据
import pandas as pd
df = pd.read_excel('产品定价模型.xlsx')
df['类别'].value_counts()
df['彩印'].value_counts()
df['纸张'].value_counts()
# 2.分类型文本变量处理
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['类别'] = le.fit_transform(df['类别']) # 处理类别
df['类别'].value_counts()
# df['类别'] = df['类别'].replace({'办公类': 0, '技术类': 1, '教辅类': 2})
# df['类别'].value_counts()
le = LabelEncoder()
df['纸张'] = le.fit_transform(df['纸张'])
# 3.提取特征变量和目标变量
X = df.drop(columns='价格')
y = df['价格']
# 4.划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# 5.模型训练及搭建
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(random_state=123)
model.fit(X_train, y_train)
# 9.4.3 模型预测及评估
y_pred = model.predict(X_test)
print(y_pred[0:50])
a = pd.DataFrame() # 创建一个空DataFrame
a['预测值'] = list(y_pred)
a['实际值'] = list(y_test)
a.head()
print(model.score(X_test, y_test))
from sklearn.metrics import r2_score
r2 = r2_score(y_test, model.predict(X_test))
print(r2)
print(model.feature_importances_)
features = X.columns # 获取特征名称
importances = model.feature_importances_ # 获取特征重要性
importances_df = pd.DataFrame()
importances_df['特征名称'] = features
importances_df['特征重要性'] = importances
importances_df.sort_values('特征重要性', ascending=False)