week 2 basic pipeline
week 3 improve model
Metrics (评估标准)
mean encoding 平均数编码
如果某一个特征是定性的(categorical),而这个特征的可能值非常多(高基数),那么平均数编码(mean encoding)是一种高效的编码方式。在实际应用中,这类特征工程能极大提升模型的性能。
week4 improve model
Data
Model
Submission
Evaluation
value 公式
测试集合
[外链图片转存失败(img-jXbx8bHS-1566274040508)(How to win a data science competition.assets/1564806789399.png)]
Leaderboard
正常流程:
1, analyze data
2, fit model
3, submit
4, see public score
5, repeat to 1
真实问题
竞赛
[外链图片转存失败(img-ShORRpbY-1566274040511)(How to win a data science competition.assets/1564807498515.png)]
[外链图片转存失败(img-WbhS98i6-1566274040513)(How to win a data science competition.assets/1564967587509.png)]
基本原理就是分治策略
[外链图片转存失败(img-Uc0PeCjB-1566274040516)(How to win a data science competition.assets/1564967871312.png)]
先做一个策略进行区分,再用另一个策略
缺点:
K- nearest Neighbors
相邻的点总是有相近的label
黑盒
没有一个算法会比其他算法更好
我们不能使用一种简单的算法就赢得竞赛
[外链图片转存失败(img-LQzCfpS2-1566274040517)(How to win a data science competition.assets/1564968727633.png)]
EDA可以带来
直方图
注意点
找出问题
可能是把空值填成了平均值
创意的分析方法 plot
[外链图片转存失败(img-muJFw5J2-1566274040521)(How to win a data science competition.assets/1564992821392.png)]
[外链图片转存失败(img-YckwJWYX-1566274040522)(How to win a data science competition.assets/1564993092569.png)]
[外链图片转存失败(img-myNC0US4-1566274040523)(How to win a data science competition.assets/1564993152640.png)]
统计数据
散点图
[外链图片转存失败(img-epa7W9c6-1566274040524)(How to win a data science competition.assets/1564993487835.png)]
[外链图片转存失败(img-WAVmwx0g-1566274040525)(How to win a data science competition.assets/1564993574260.png)]
[外链图片转存失败(img-Exb4iqck-1566274040526)(How to win a data science competition.assets/1564993791759.png)]
[外链图片转存失败(img-VrvxOLLG-1566274040527)(How to win a data science competition.assets/1564993918148.png)]
correlation metric
计算有多少有意义的feature combination 特质有的
[外链图片转存失败(img-H3DPuEHt-1566274040528)(How to win a data science competition.assets/1564994328361.png)]
[外链图片转存失败(img-14pXWO6k-1566274040532)(How to win a data science competition.assets/1564994476802.png)]
平均后再sort,可以构造初新的feature
plt.hist(x)
plt.plot(x, ‘.’)
df.describe()
plt.scatter(x1, x2)
pd.scatter_matrix(df)
df.corr()
plt.matshow()
df.mean().sort_values().plot(style=’,’)
[外链图片转存失败(img-exPxLu39-1566274040536)(How to win a data science competition.assets/1564994663218.png)]
axis = 1, axis = 0理解一下怎么回事?
nunique多少个重复值
feats_counts = train.nunique(dropna = False)
feats_counts.sort_values()[:10]
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)
traintest.drop(constant_features,axis = 1,inplace=True)
值是完全重复的
dup_cols = {}
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
for c2 in train_enc.columns[i + 1:]:
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)
nunique = train.nunique(dropna=False)
plt.figure(figsize=(14,6))
_ = plt.hist(nunique.astype(float)/train.shape[0], bins=100)
如果两个feature之间的关系是线性的,
[外链图片转存失败(img-G1cgcNbe-1566274040537)(How to win a data science competition.assets/1565084497615.png)]
[外链图片转存失败(img-efBWpipy-1566274040539)(How to win a data science competition.assets/1565084624222.png)]
[外链图片转存失败(img-vCjuWgKL-1566274040541)(How to win a data science competition.assets/1565084719053.png)]
比赛中的test data的质量可能更差,所以
比赛中的overfit 指的是 测试数据比validation数据的表现更差
[外链图片转存失败(img-qjxhImVn-1566274040542)(How to win a data science competition.assets/1565085483344.png)]
[外链图片转存失败(img-a6l3JI8k-1566274040544)(How to win a data science competition.assets/1565085600069.png)]
[外链图片转存失败(img-AUB2CR0U-1566274040546)(How to win a data science competition.assets/1565085748175.png)]
测试的时候注意分层
[外链图片转存失败(img-XQrDTRZM-1566274040548)(How to win a data science competition.assets/1565085937159.png)]
time-based validation
[外链图片转存失败(img-q0E5Z7Cl-1566274040549)(How to win a data science competition.assets/1565089889403.png)]
id baseed validation
[外链图片转存失败(img-tvGuwK3N-1566274040551)(How to win a data science competition.assets/1565090081665.png)]
[外链图片转存失败(img-0D2GgwjX-1566274040552)(How to win a data science competition.assets/1565092894714.png)]
[外链图片转存失败(img-8GvuNZSY-1566274040553)(How to win a data science competition.assets/1565094108447.png)]
排名探针
[外链图片转存失败(img-1mxu0ynU-1566274040554)(How to win a data science competition.assets/1565095199845.png)]
某一类的值标志Y的值是某一特定值
使用某一种值来猜测另个一值
[外链图片转存失败(img-04QiJR7I-1566274040555)(How to win a data science competition.assets/1565095746726.png)]
好TMD高级的公式
[外链图片转存失败(img-lN4on4Os-1566274040556)(How to win a data science competition.assets/1565161280586.png)]
[外链图片转存失败(img-c8L9qlNR-1566274040558)(How to win a data science competition.assets/1565161370077.png)]
[外链图片转存失败(img-OQALTazB-1566274040559)(How to win a data science competition.assets/1565161394168.png)]
[外链图片转存失败(img-Jbs5W8tU-1566274040560)(How to win a data science competition.assets/1565161549810.png)]
[外链图片转存失败(img-RmrTfXWD-1566274040561)(How to win a data science competition.assets/1565161731245.png)]
[外链图片转存失败(img-q71fT4Z1-1566274040562)(How to win a data science competition.assets/1565162228855.png)]
[外链图片转存失败(img-nBcCsMms-1566274040563)(How to win a data science competition.assets/1565162517464.png)]
[外链图片转存失败(img-C4RVTDkp-1566274040564)(How to win a data science competition.assets/1565162577824.png)]
[外链图片转存失败(img-AawhLJW0-1566274040565)(How to win a data science competition.assets/1565163512904.png)]
MSE, logloss基本上都能直接作为模型的损失函数
但是MSPE,MAPE,RMSLE不行
手写的XGBOOST损失函数
[外链图片转存失败(img-PzeCOopD-1566274040566)(How to win a data science competition.assets/1565163573696.png)]
提前结束训练
[外链图片转存失败(img-R6jHFe9O-1566274040570)(How to win a data science competition.assets/1565163638245.png)]
支持MSE作为loss的库
[外链图片转存失败(img-ZCTdkK4u-1566274040572)(How to win a data science competition.assets/1565163820161.png)]
MAE作为LOSS的库
[外链图片转存失败(img-9mGowgsV-1566274040573)(How to win a data science competition.assets/1565163961836.png)]
给sample加上权重
[外链图片转存失败(img-L21ygC7r-1566274040574)(How to win a data science competition.assets/1565164580677.png)]
要改变数据集合的值
提前改好
[外链图片转存失败(img-g046I4Zm-1566274040575)(How to win a data science competition.assets/1565164657444.png)]
[外链图片转存失败(img-bslsoMGe-1566274040576)(How to win a data science competition.assets/1565166225520.png)]
加入类似的参数
[外链图片转存失败(img-hXzpP2dS-1566274040577)(How to win a data science competition.assets/1565166637267.png)]
meanning coding example
[外链图片转存失败(img-NwFEeuwl-1566274040578)(How to win a data science competition.assets/1565166711095.png)]
[外链图片转存失败(img-rY2SrR5d-1566274040579)(How to win a data science competition.assets/1565166768802.png)]
KFOLD添加参数,来验证参数是否只是在局部有效
[外链图片转存失败(img-zIkdejQ7-1566274040580)(How to win a data science competition.assets/1565167728141.png)]
[外链图片转存失败(img-PUndk2de-1566274040581)(How to win a data science competition.assets/1565168768544.png)]
[外链图片转存失败(img-skPMqtRR-1566274040583)(How to win a data science competition.assets/1565168809896.png)]
降低 train data的质量,这个可以用在我的项目里面
[外链图片转存失败(img-jxWEpBT5-1566274040585)(How to win a data science competition.assets/1565168900602.png)]
都需要正则化数据
[外链图片转存失败(img-byJLlWrJ-1566274040586)(How to win a data science competition.assets/1565170642872.png)]
[外链图片转存失败(img-rmSRiupW-1566274040587)(How to win a data science competition.assets/1565170832402.png)]
features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:] # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
[外链图片转存失败(img-VfBauxum-1566274040588)(How to win a data science competition.assets/1565171469979.png)]
[外链图片转存失败(img-LUbzmndV-1566274040590)(How to win a data science competition.assets/1565171823713.png)]
可以自动找到最佳参数的库
[外链图片转存失败(img-hoULbvk8-1566274040592)(How to win a data science competition.assets/1565177062062.png)]
一个自动化调整参数的例子
[外链图片转存失败(img-7RbtfhLq-1566274040593)(How to win a data science competition.assets/1565177126951.png)]
can not learn the train set
[外链图片转存失败(img-RbBk6yIB-1566274040594)(How to win a data science competition.assets/1565177434556.png)]
Model | Where |
---|---|
GBTD | XGBoost, lightGBM |
RandomForest, ExtraTrees | scikit-learn |
others | RGF |
[外链图片转存失败(img-TCEaplLz-1566274040596)(How to win a data science competition.assets/1565178589233.png)]
[外链图片转存失败(img-oY1t92fC-1566274040597)(How to win a data science competition.assets/1565224495798.png)]
N_estimators (越多越好)
Max_depth
max_features
min_samples_leaf
n_jobs
[外链图片转存失败(img-KW2MmrMc-1566274040598)(How to win a data science competition.assets/1565179981939.png)]
SVM几乎不需要任何调整参数
[外链图片转存失败(img-8LtE9ii7-1566274040599)(How to win a data science competition.assets/1565180466195.png)]
[外链图片转存失败(img-KavKJZ6r-1566274040604)(How to win a data science competition.assets/1565180514867.png)]
[外链图片转存失败(img-jquR0Zyg-1566274040605)(How to win a data science competition.assets/image-20190810151745037.png)]
根据用户,给出最低price,和最高price
根据page,给出最低价格的position
[外链图片转存失败(img-LF3enBIi-1566274040606)(How to win a data science competition.assets/image-20190810151956560.png)]
[外链图片转存失败(img-82d0RjO3-1566274040607)(How to win a data science competition.assets/image-20190810152130330.png)]
[外链图片转存失败(img-P3lhCIH7-1566274040608)(How to win a data science competition.assets/image-20190810152241637.png)]
[外链图片转存失败(img-CiB3037n-1566274040609)(How to win a data science competition.assets/image-20190810152504018.png)]
[外链图片转存失败(img-1ou1lQ5g-1566274040611)(How to win a data science competition.assets/image-20190810153730536.png)]
[外链图片转存失败(img-0bXa6azp-1566274040612)(How to win a data science competition.assets/image-20190810154357323.png)]
[外链图片转存失败(img-ztlFDKnx-1566274040613)(How to win a data science competition.assets/image-20190810154608919.png)]
[外链图片转存失败(img-tOhit2im-1566274040614)(How to win a data science competition.assets/image-20190810155229162.png)]
[外链图片转存失败(img-NXJajRNY-1566274040615)(How to win a data science competition.assets/image-20190810155315767.png)]
一个数值的例子
[外链图片转存失败(img-oWqb8KXD-1566274040617)(How to win a data science competition.assets/image-20190810155402260.png)]
这种方法非常适合树结构的算法
[外链图片转存失败(img-27ZjDiKo-1566274040618)(How to win a data science competition.assets/image-20190810160204445.png)]
[外链图片转存失败(img-1Kq8O85H-1566274040622)(How to win a data science competition.assets/image-20190810160139421.png)]
什么是Ensembling
[外链图片转存失败(img-rGpOJJKt-1566274040623)(How to win a data science competition.assets/image-20190810162705183.png)]
平均很多个版本稍微不同的模型来预测结果
example: random forest
[外链图片转存失败(img-a1JRgvzW-1566274040625)(How to win a data science competition.assets/image-20190810164404876.png)]
[外链图片转存失败(img-LMzwX8Fe-1566274040626)(How to win a data science competition.assets/image-20190810164822859.png)]
[外链图片转存失败(img-DbaGtL8H-1566274040627)(How to win a data science competition.assets/image-20190810170057889.png)]
根据预测结果的偏差,留给下一个模型学习
[外链图片转存失败(img-YWAK3KMB-1566274040628)(How to win a data science competition.assets/image-20190810170152469.png)]
最重要的模型,基本上所有的竞赛都用这个
[外链图片转存失败(img-47SBLTfa-1566274040629)(How to win a data science competition.assets/image-20190810170605380.png)]
[外链图片转存失败(img-lHuQIvO0-1566274040630)(How to win a data science competition.assets/image-20190810170725459.png)]
不同的模型在不同的领域表现不一样,使用一个模型来预测那个模型更好,然后用权值来分配给不同的模型。
[外链图片转存失败(img-5h7TwO6a-1566274040632)(How to win a data science competition.assets/image-20190810173933730.png)]
[外链图片转存失败(img-xO8vY259-1566274040634)(How to win a data science competition.assets/image-20190810174009305.png)]
[外链图片转存失败(img-7tPDz3QU-1566274040637)(How to win a data science competition.assets/image-20190810174058196.png)]
[外链图片转存失败(img-PmbFKzN7-1566274040640)(How to win a data science competition.assets/image-20190810175954819.png)]
[外链图片转存失败(img-tAsK276a-1566274040641)(How to win a data science competition.assets/image-20190815110101051.png)]