Coursera, How to win a competition 课程笔记

How to win a data science competition

课程简介

课程收获

  • how to preprocess the data
  • extract features
  • how to set up the validation correctly
  • optimize the given metric
  • A truly unique opportunity to see the detailed explanations of the winning solutions.

课程日程安排

week 2 basic pipeline

  • EDA
  • Validation
  • Data leaks

week 3 improve model

  • Metrics (评估标准)

  • mean encoding 平均数编码

    如果某一个特征是定性的(categorical),而这个特征的可能值非常多(高基数),那么平均数编码(mean encoding)是一种高效的编码方式。在实际应用中,这类特征工程能极大提升模型的性能。

week4 improve model

  • Advanced features
  • Hyperparameter optimization
  • Ensembles

Competition Mechnics

Competition Mechnics

  • Data

  • Model

    • produce best predication
    • reproducible
  • Submission

  • Evaluation

    • value 公式

    • 测试集合

      [外链图片转存失败(img-jXbx8bHS-1566274040508)(How to win a data science competition.assets/1564806789399.png)]

  • Leaderboard

正常流程:

1, analyze data

2, fit model

3, submit

4, see public score

5, repeat to 1

为什么要参加竞赛

  • great opportunity for learning
  • 全局意识
  • 加入进社群
  • 赚钱
    • 不能是首要目标

竞赛与真实问题的区别

真实问题

  • 理解业务
  • 抽象问题
  • 收集数据
  • 数据清洗
  • 建模
  • 评价模型
  • 部署模型

竞赛

  • 数据清洗
  • 建模

[外链图片转存失败(img-ShORRpbY-1566274040511)(How to win a data science competition.assets/1564807498515.png)]

Recap of main ML algorithms

Linear model

[外链图片转存失败(img-WbhS98i6-1566274040513)(How to win a data science competition.assets/1564967587509.png)]

  • 缺点:
    • 很多case并不能用一条直线分开

Tree-based

基本原理就是分治策略

[外链图片转存失败(img-Uc0PeCjB-1566274040516)(How to win a data science competition.assets/1564967871312.png)]

先做一个策略进行区分,再用另一个策略

  • 针对于表格性的数据是非常有用的

缺点:

  • 很难获得linear dependencies,因为需要太多的分割

K-NN

K- nearest Neighbors

相邻的点总是有相近的label

Neural Networks

黑盒

注意

  • 没有一个算法会比其他算法更好

  • 我们不能使用一种简单的算法就赢得竞赛

结论:

[外链图片转存失败(img-LQzCfpS2-1566274040517)(How to win a data science competition.assets/1564968727633.png)]

Exploratory data analysis

Exploratory Data Analysis: what and why?

EDA可以带来

  • 更好的理解数据
  • 对数据更有直觉
  • 生成假设
  • 找到内在规律

Understand the data

  • columns 代表是什么
  • 数据是否有意义
  • 检查数据异常
    • 如果数据异常,也不用删除,加一列进行标记,让机器自动学习是不错的

探索 无个性特质的数据

  • 特质数据是加密数据,但是都保持原有数据的特性,比如线性关系就是线性关系
  • 可以通过一些技巧来解密线性关系

Visualization data

  • 直方图

    • 注意点

      • 需要注意数据的分割,接近于0值不是真的0值
      • 永远不要根据一个图就做出一个结论
    • 找出问题

      • [外链图片转存失败(img-hDpZl19F-1566274040519)(How to win a data science competition.assets/1564992627145.png)]

      可能是把空值填成了平均值

  • 创意的分析方法 plot

    [外链图片转存失败(img-muJFw5J2-1566274040521)(How to win a data science competition.assets/1564992821392.png)]

    • 横向直线代表很多完全相同的数据

    [外链图片转存失败(img-YckwJWYX-1566274040522)(How to win a data science competition.assets/1564993092569.png)]

    • 颜色加上分类

    [外链图片转存失败(img-myNC0US4-1566274040523)(How to win a data science competition.assets/1564993152640.png)]

    • 画出异常值数据
  • 统计数据

    • describe
  • 散点图

    • 如何使用
      • 画出一个feature与另一个feature的关系

    [外链图片转存失败(img-epa7W9c6-1566274040524)(How to win a data science competition.assets/1564993487835.png)]

    • 如果是回归问题,那么可以用点的大小来表达数据
    • 可以用来验证测试数据和训练数据是否是同样的分布

    [外链图片转存失败(img-WAVmwx0g-1566274040525)(How to win a data science competition.assets/1564993574260.png)]

    • 另一个应用办法

    [外链图片转存失败(img-Exb4iqck-1566274040526)(How to win a data science competition.assets/1564993791759.png)]

    • 如何使用
      • tree-based model,可以创建一个新的feature: difference or ratio between X1 and X2

    [外链图片转存失败(img-VrvxOLLG-1566274040527)(How to win a data science competition.assets/1564993918148.png)]

    • 创造新的feature: 判断新的数据属于哪一个三角形
  • correlation metric

  • 计算有多少有意义的feature combination 特质有的

[外链图片转存失败(img-H3DPuEHt-1566274040528)(How to win a data science competition.assets/1564994328361.png)]

  • matshow function画出这个图
    • 然后是用kmeans on 这个图,然后再reoder一下这些特征
  • 结果
    • [外链图片转存失败(img-MeNGs8ym-1566274040530)(How to win a data science competition.assets/1564994441608.png)]

[外链图片转存失败(img-14pXWO6k-1566274040532)(How to win a data science competition.assets/1564994476802.png)]

平均后再sort,可以构造初新的feature

调用以上方法

  • 直方图

plt.hist(x)

  • plot

plt.plot(x, ‘.’)

  • statics

df.describe()

  • feature之间的关系

plt.scatter(x1, x2)

pd.scatter_matrix(df)

df.corr()

plt.matshow()

df.mean().sort_values().plot(style=’,’)

[外链图片转存失败(img-exPxLu39-1566274040536)(How to win a data science competition.assets/1564994663218.png)]

axis = 1, axis = 0理解一下怎么回事?

删除只有一个值的column

nunique多少个重复值

feats_counts = train.nunique(dropna = False)
feats_counts.sort_values()[:10]
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

去掉重复的column

值是完全重复的

dup_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    for c2 in train_enc.columns[i + 1:]:
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
            
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

查看一列有多少种不同的值

nunique = train.nunique(dropna=False)
plt.figure(figsize=(14,6))
_ = plt.hist(nunique.astype(float)/train.shape[0], bins=100)

模型的影响

如果两个feature之间的关系是线性的,

  • 那么nn 和线性回归会把这个关系找出来,
  • 但是树结构的不行

Validation and overfitting

[外链图片转存失败(img-G1cgcNbe-1566274040537)(How to win a data science competition.assets/1565084497615.png)]

预测不可知的数据

[外链图片转存失败(img-efBWpipy-1566274040539)(How to win a data science competition.assets/1565084624222.png)]

比赛如何设置

[外链图片转存失败(img-vCjuWgKL-1566274040541)(How to win a data science competition.assets/1565084719053.png)]

在比赛中,overfit的不同定义

比赛中的test data的质量可能更差,所以

比赛中的overfit 指的是 测试数据比validation数据的表现更差

Validation strategies

  • holdout

[外链图片转存失败(img-qjxhImVn-1566274040542)(How to win a data science competition.assets/1565085483344.png)]

  • K-fold

[外链图片转存失败(img-a6l3JI8k-1566274040544)(How to win a data science competition.assets/1565085600069.png)]

  • Leave-one-out

[外链图片转存失败(img-AUB2CR0U-1566274040546)(How to win a data science competition.assets/1565085748175.png)]

测试的时候注意分层

[外链图片转存失败(img-XQrDTRZM-1566274040548)(How to win a data science competition.assets/1565085937159.png)]

time-based validation

[外链图片转存失败(img-q0E5Z7Cl-1566274040549)(How to win a data science competition.assets/1565089889403.png)]

id baseed validation

[外链图片转存失败(img-tvGuwK3N-1566274040551)(How to win a data science competition.assets/1565090081665.png)]

  • 做VALIDATION的适合,要注意把Train user也这样区分

Validation problems

  • Validation Stage
  • Submission stage

[外链图片转存失败(img-0D2GgwjX-1566274040552)(How to win a data science competition.assets/1565092894714.png)]

[外链图片转存失败(img-8GvuNZSY-1566274040553)(How to win a data science competition.assets/1565094108447.png)]

Data leakage

排名探针

[外链图片转存失败(img-1mxu0ynU-1566274040554)(How to win a data science competition.assets/1565095199845.png)]

使用分类值来预测

某一类的值标志Y的值是某一特定值

使用某一种值来猜测另个一值

[外链图片转存失败(img-04QiJR7I-1566274040555)(How to win a data science competition.assets/1565095746726.png)]

好TMD高级的公式

Metric

绝对权值

  • MSE
  • RMSE

[外链图片转存失败(img-lN4on4Os-1566274040556)(How to win a data science competition.assets/1565161280586.png)]

  • R-squared

[外链图片转存失败(img-c8L9qlNR-1566274040558)(How to win a data science competition.assets/1565161370077.png)]

  • MAE

[外链图片转存失败(img-OQALTazB-1566274040559)(How to win a data science competition.assets/1565161394168.png)]

  • MAE的迭代方向

[外链图片转存失败(img-Jbs5W8tU-1566274040560)(How to win a data science competition.assets/1565161549810.png)]

  • MAE vs MSE

[外链图片转存失败(img-RmrTfXWD-1566274040561)(How to win a data science competition.assets/1565161731245.png)]

加上权值的metic

  • MSPE
  • MAPE

[外链图片转存失败(img-q71fT4Z1-1566274040562)(How to win a data science competition.assets/1565162228855.png)]

  • RMSLE

[外链图片转存失败(img-nBcCsMms-1566274040563)(How to win a data science competition.assets/1565162517464.png)]

[外链图片转存失败(img-C4RVTDkp-1566274040564)(How to win a data science competition.assets/1565162577824.png)]

常用的优化办法

loss 和 metric的区别

  • Target metric 是我们想要优化的目标
  • Optimization loss是模型优化的方法

Metric优化的方法

[外链图片转存失败(img-AawhLJW0-1566274040565)(How to win a data science competition.assets/1565163512904.png)]

  • MSE, logloss基本上都能直接作为模型的损失函数

  • 但是MSPE,MAPE,RMSLE不行

    • 比如MSPE就不能直接用在XGBoost上面
  • 手写的XGBOOST损失函数

[外链图片转存失败(img-PzeCOopD-1566274040566)(How to win a data science competition.assets/1565163573696.png)]

  • 提前结束训练

    [外链图片转存失败(img-R6jHFe9O-1566274040570)(How to win a data science competition.assets/1565163638245.png)]

Reggresion metrics 优化

  • 支持MSE作为loss的库

    [外链图片转存失败(img-ZCTdkK4u-1566274040572)(How to win a data science competition.assets/1565163820161.png)]

  • MAE作为LOSS的库

    • MAE也被称为L1

[外链图片转存失败(img-9mGowgsV-1566274040573)(How to win a data science competition.assets/1565163961836.png)]

  • MSPE / MAPE

给sample加上权重

[外链图片转存失败(img-L21ygC7r-1566274040574)(How to win a data science competition.assets/1565164580677.png)]

  • RMSLE

要改变数据集合的值

提前改好

[外链图片转存失败(img-g046I4Zm-1566274040575)(How to win a data science competition.assets/1565164657444.png)]

Mean encodings

  • 为集合加上有意义的参数

[外链图片转存失败(img-bslsoMGe-1566274040576)(How to win a data science competition.assets/1565166225520.png)]

  • light GBM非常的有用

加入类似的参数

[外链图片转存失败(img-hXzpP2dS-1566274040577)(How to win a data science competition.assets/1565166637267.png)]

meanning coding example

[外链图片转存失败(img-NwFEeuwl-1566274040578)(How to win a data science competition.assets/1565166711095.png)]

[外链图片转存失败(img-rY2SrR5d-1566274040579)(How to win a data science competition.assets/1565166768802.png)]

正则化 避免overfit的办法

KFOLD用法

KFOLD添加参数,来验证参数是否只是在局部有效

[外链图片转存失败(img-zIkdejQ7-1566274040580)(How to win a data science competition.assets/1565167728141.png)]

Smoothing

[外链图片转存失败(img-PUndk2de-1566274040581)(How to win a data science competition.assets/1565168768544.png)]

Noise

[外链图片转存失败(img-skPMqtRR-1566274040583)(How to win a data science competition.assets/1565168809896.png)]

降低 train data的质量,这个可以用在我的项目里面

Expanding mean

[外链图片转存失败(img-jxWEpBT5-1566274040585)(How to win a data science competition.assets/1565168900602.png)]

Extensions and generalizations

regression可以提取

  • medium,
  • percentile,
  • std,
  • 正太分布feature

都需要正则化数据

把数据分类进行提取

  • 比如时间来聚合,
  • 或者前几天来聚合

[外链图片转存失败(img-byJLlWrJ-1566274040586)(How to win a data science competition.assets/1565170642872.png)]

数据与数据之间的关系来聚合

  • 如何分类数据
    • 如果一个数据的自述树很多,那么这个点就值得被分类
    • 比如feature 1 和 feature 2的子树很多

[外链图片转存失败(img-rmSRiupW-1566274040587)(How to win a data science competition.assets/1565170832402.png)]

features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:]  # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
  • 如何有效的合并数据

调模型顺序

[外链图片转存失败(img-VfBauxum-1566274040588)(How to win a data science competition.assets/1565171469979.png)]

Meaning coding

[外链图片转存失败(img-LUbzmndV-1566274040590)(How to win a data science competition.assets/1565171823713.png)]

Hyperparameter optimization

找到最有影响的参数

去网页上找文档

  • 文档里会写哪个优先被调
  • 每个参数的含义

可以自动找到最佳参数的库

[外链图片转存失败(img-hoULbvk8-1566274040592)(How to win a data science competition.assets/1565177062062.png)]

一个自动化调整参数的例子

[外链图片转存失败(img-7RbtfhLq-1566274040593)(How to win a data science competition.assets/1565177126951.png)]

参数如何影响模型

  • ubderfitting

can not learn the train set

  • good fit
  • overfitting

[外链图片转存失败(img-RbBk6yIB-1566274040594)(How to win a data science competition.assets/1565177434556.png)]

调整树的 参数!

Model Where
GBTD XGBoost, lightGBM
RandomForest, ExtraTrees scikit-learn
others RGF

XGBoost / lightGBM

  • max_depth
    • 层数多的话,帮助构造合适的feature联合体
    • 一般7层就可以了
  • sub sample / bagging_fraction
    • 分类的下一层比率,用的越少越不会over fit
  • colsample_bytree / colsample_bylevel / feature_fraction
  • min_child_weight / min_data_in_leaf
    • 最重要
  • eta / learning_rate, num_round / num_iterations 进步率和迭代数
    • 可以使用early stop,如果loss开始上升,则停止训练

[外链图片转存失败(img-TCEaplLz-1566274040596)(How to win a data science competition.assets/1565178589233.png)]

RandomForest / ExtraTrees

[外链图片转存失败(img-oY1t92fC-1566274040597)(How to win a data science competition.assets/1565224495798.png)]

  • N_estimators (越多越好)

    • 决策树的数量
    • 设置个10,逐渐增多然后观察对metric的值影响
  • Max_depth

    • 最深的层次
  • max_features

    • 用多少feature用于训练
  • min_samples_leaf

  • n_jobs

    • 多少个进程跑

Neural networks

  • pytorch 和 keras
  • number of nrurons per layer
    • 每一层的神经元
  • number of layers
    • 层数
  • Batch size
    • 每一次训练的个数?
  • learning rage
    • 要合适

[外链图片转存失败(img-KW2MmrMc-1566274040598)(How to win a data science competition.assets/1565179981939.png)]

Linear models

SVM几乎不需要任何调整参数

  • Regularization
    • L1
    • L2
  • L1 可以用于特征选择

[外链图片转存失败(img-8LtE9ii7-1566274040599)(How to win a data science competition.assets/1565180466195.png)]

  • GBDT 和 nn训练时间很长的话会很有用

[外链图片转存失败(img-KavKJZ6r-1566274040604)(How to win a data science competition.assets/1565180514867.png)]

在提交的时候,使用相同模型的不同参数会非常的有效果

Statistics and distance based features

Groupby features

[外链图片转存失败(img-jquR0Zyg-1566274040605)(How to win a data science competition.assets/image-20190810151745037.png)]

  • 根据用户,给出最低price,和最高price

  • 根据page,给出最低价格的position

    [外链图片转存失败(img-LF3enBIi-1566274040606)(How to win a data science competition.assets/image-20190810151956560.png)]

[外链图片转存失败(img-82d0RjO3-1566274040607)(How to win a data science competition.assets/image-20190810152130330.png)]

[外链图片转存失败(img-P3lhCIH7-1566274040608)(How to win a data science competition.assets/image-20190810152241637.png)]

  • 尽可能的多想出feature

Neighbors

[外链图片转存失败(img-CiB3037n-1566274040609)(How to win a data science competition.assets/image-20190810152504018.png)]

用在我这里就是

  • it energy 在什么一定范围内的max,min
  • humid 在一定范围内的max,min

Matrix Factorization / 降纬度

[外链图片转存失败(img-1ou1lQ5g-1566274040611)(How to win a data science competition.assets/image-20190810153730536.png)]

  • 降纬的具体方法

[外链图片转存失败(img-0bXa6azp-1566274040612)(How to win a data science competition.assets/image-20190810154357323.png)]

[外链图片转存失败(img-ztlFDKnx-1566274040613)(How to win a data science competition.assets/image-20190810154608919.png)]

  • PCA可以帮助把种类的feature变为真实值

构造feature 组合

  • 第一种,先concat 再onehot

[外链图片转存失败(img-tOhit2im-1566274040614)(How to win a data science competition.assets/image-20190810155229162.png)]

  • 先onehot 再组合

[外链图片转存失败(img-NXJajRNY-1566274040615)(How to win a data science competition.assets/image-20190810155315767.png)]

一个数值的例子

[外链图片转存失败(img-oWqb8KXD-1566274040617)(How to win a data science competition.assets/image-20190810155402260.png)]

常用的数值组合办法

  • 乘法

这种方法非常适合树结构的算法

执行步骤

[外链图片转存失败(img-27ZjDiKo-1566274040618)(How to win a data science competition.assets/image-20190810160204445.png)]

搞不懂,可以尝试一下

[外链图片转存失败(img-1Kq8O85H-1566274040622)(How to win a data science competition.assets/image-20190810160139421.png)]

t-SNE

  • 可用于可视化
  • 结果可以用来作为一个feature(类似于分类器)
  • perplexity 参数很重要
  • 注意解读结果

Ensembling

什么是Ensembling

  • combinning different machine learning model to get a better prediction

Average

[外链图片转存失败(img-rGpOJJKt-1566274040623)(How to win a data science competition.assets/image-20190810162705183.png)]

  • 简单的组合两个表现不同的组合

Weighted Average

  • 给不同的模型加上不同的权重

Conditional averaging

  • 在某种条件下用模型1,某种条件下用模型2

Bagging

平均很多个版本稍微不同的模型来预测结果

example: random forest

为啥用Bagging

  • Errors due to Bias (underfitting)
  • Errors due Variance (Overfit)

bagging 重要的参数

[外链图片转存失败(img-a1JRgvzW-1566274040625)(How to win a data science competition.assets/image-20190810164404876.png)]

  • seed
    • 模型之间有多么不同
    • 行sampling
    • 随机
    • 列sampling
    • 模型特别的参数
    • 多少个模型
    • 同时跑

手写的一个bagging

[外链图片转存失败(img-LMzwX8Fe-1566274040626)(How to win a data science competition.assets/image-20190810164822859.png)]

Boosting

什么是boosting

  • 一种权值model的方式,前面的model做的怎么样,后面的model再跟上预测

主要的bagging方式

Weight based

[外链图片转存失败(img-DbaGtL8H-1566274040627)(How to win a data science competition.assets/image-20190810170057889.png)]

原理:

根据预测结果的偏差,留给下一个模型学习

重要的参数

[外链图片转存失败(img-YWAK3KMB-1566274040628)(How to win a data science competition.assets/image-20190810170152469.png)]

Residual based

最重要的模型,基本上所有的竞赛都用这个

[外链图片转存失败(img-47SBLTfa-1566274040629)(How to win a data science competition.assets/image-20190810170605380.png)]

步骤:

  • 先预测一个模型
  • 留下一个error
  • 后面的模型预测这个error
  • 最后的结果所为所有模型的和

[外链图片转存失败(img-lHuQIvO0-1566274040630)(How to win a data science competition.assets/image-20190810170725459.png)]

有名的residual based boosting

  • Xgboost
  • lightGBM
  • H2O
  • CATBOOSTING
    • 优势是不会花太多时间来调整模型
  • SKLEARN
    • 可以使用所有的sklearn 模型来作为模型

Stacking

原理

不同的模型在不同的领域表现不一样,使用一个模型来预测那个模型更好,然后用权值来分配给不同的模型。

[外链图片转存失败(img-5h7TwO6a-1566274040632)(How to win a data science competition.assets/image-20190810173933730.png)]

An example

[外链图片转存失败(img-xO8vY259-1566274040634)(How to win a data science competition.assets/image-20190810174009305.png)]

注意点

[外链图片转存失败(img-7tPDz3QU-1566274040637)(How to win a data science competition.assets/image-20190810174058196.png)]

  • 如果是Time series的问题,则不能随机
  • 模型要尽可能的不同
  • 模型的不同来自于
    • 不同的算法
    • 不同的feature
  • 模型的模型可以尽可能地简单

STACKNET

与STACK不同的 是,meta model是神经网络

[外链图片转存失败(img-PmbFKzN7-1566274040640)(How to win a data science competition.assets/image-20190810175954819.png)]

Real Example

Stacking Example

[外链图片转存失败(img-tAsK276a-1566274040641)(How to win a data science competition.assets/image-20190815110101051.png)]

你可能感兴趣的:(机器学习)