一、背景—工艺过程—字段意义
二、特征工程做法
三、baseline
四、模型stack—用到了 BayesianRidge回归
五、比赛入门与跟随1
比赛入门与跟随2
数据挖掘进阶:kaggle竞赛top代码分享
六、做Kaggle比赛,Jdata,天池的经验,看完我这篇就够了。
1、删除收率异常
2、数据集的时间也有异常值
A11-A12=1:00 (有时候会有异常)
3、转换为时间区间
4、将收率进行分箱,然后构造每个特征中的类别对应不同收率的均值
待优化
1、多模型结果
2、mean 特征
3、对时间部分 优化
通用模板-主流机器学习模型模板代码+经验分享[xgb, lgb, Keras, LR]
param = {'num_leaves': 30,
'min_data_in_leaf': 30,
'objective':'regression',
'max_depth': -1,
'learning_rate': 0.01,
"min_child_samples": 30,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9 ,
"bagging_seed": 11,
"metric": 'mse',
"lambda_l1": 0.1,
"verbosity": -1}
# 6折交叉验证
#[2095] training's l2: 0.000183012 valid_1's l2: 0.000238902
#CV score: 0.00031
# training's l2: 0.000179558 valid_1's l2: 0.000251254
#CV score: 0.00031
folds = KFold(n_splits=10, shuffle=True, random_state=2018)
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
print("fold n°{}".format(fold_+1))
trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
num_round = 10000
clf = lgb.train(param,
trn_data,
num_round,
valid_sets = [trn_data, val_data],
verbose_eval = 200,
early_stopping_rounds = 200)
oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)))
xgb_params = {'eta': 0.005, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
folds = KFold(n_splits=15, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
print("fold n°{}".format(fold_+1))
trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params)
oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, target)))
train_stack = np.vstack([oof_lgb,oof_xgb]).transpose()
test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
oof_stack = np.zeros(train_stack.shape[0])
predictions = np.zeros(test_stack.shape[0])
for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,target)):
print("fold {}".format(fold_))
trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values
clf_3 = BayesianRidge()
clf_3.fit(trn_data, trn_y)
oof_stack[val_idx] = clf_3.predict(val_data)
predictions += clf_3.predict(test_stack) / 10
mean_squared_error(target.values, oof_stack)
data[‘B11’]=1.0
提交结果线下 0.00011862717410224899 线上
data[‘B11’] 取根据b10 加一个小时 pad
提交结果线下线下 0.00011745711197741756
调参对于模型准确率的提高有一定的帮助,但这是有限的。最重要的还是要通过数据清洗,特征选择,特征融合,模型融合等手段来进行改进!
参数解释
调参指导_网格搜索
调参指导2
https://blog.csdn.net/sb19931201/article/details/52577592