博客笔记六: [Airbnb] 自动机器学习automated ml

原文标题:Automated Machine Learning — A Paradigm Shift That Accelerates Data Scientist Productivity @ Airbnb
原文地址:https://medium.com/airbnb-engineering/automated-machine-learning-a-paradigm-shift-that-accelerates-data-scientist-productivity-airbnb-f1f8a10d61f8
作者:Hamel Husain

评价:

  1. 可以代替很多工作,但是没办法完全替代,毕竟模型最终需要domain knowledge和人类评价
    The scope of AML is ambitious, however, is it really effective? The answer is it depends on how you use it. Our view is that it is difficult to perform wholesale replacement of a data scientist with an AML framework, because most machine learning problems require domain knowledge and human judgement to set up correctly.

  2. 尤其是表格样式数据格式很有用
    Also, we have found AML tools to be most useful for regression and classification problems involving tabular datasets

一些常见自动化的步骤:

  • ead画图:plotting all your variables against the target variable being predicted as well as computing summary statistics can save lots of time.
  • 特征变换选择encoding方法,encode categorical variables, impute missing values, encode sequences and text
  • 选模型
  • 看效果:Learning curves, partial dependence plots(当前变量与最后target关系)

airbnb的aml应用:

  • 选模型:Unbiased presentation of challenger models:
  • 检测 data leakage :因为这个更快所以可以很早发现是否存在。
https://blog.csdn.net/jiandanjinxin/article/details/54633475
https://zhuanlan.zhihu.com/p/24357137
以此我们可以看出,Data Leakage 基本都是在准备数据的时候,或者数据采样的时候出了问题,误将与结果直接相关的feature纳入了数据集。这样的纰漏,比较难以发现。
数据泄露就是说用了不该用的数据,比如
http://sofasofa.io/forum_main_post.php?postid=1001477
在训练模型时,利用了测试集的数据、信息
在当前使用了未来的数据
在交叉验证进行调参时,使用了验证集的信息参与模型建立
具体说下第三点,比如对特征进行标准化,正确的方法应该是在训练集上标准化,然后应用到验证集上,而非先标准化,再划分验证集。再比如说,要对数据进行pca降维,应该是在训练集上pca,然后作用到验证集上,而非对整个数据集进行pca。通常都忽略了这一点。
  • eda诊断:As mentioned earlier, canonical diagnostics can be automatically generated such as learning curves, partial dependence plots, feature importances, etc
    Tasks like exploratory data analysis, pre-processing of data, hyper-parameter tuning, model selection and putting models into production can be automated to some some extent with an Automated Machine Learning framework.

Case Study: Competitive Benchmarks With Customer Lifetime Value Models

(之前已经看过另一个完整的博文了)
判断用户价值:The features of this model include demographic, location and activity information

一开始想法:

选用xgboost, 但是其实是忍受了一些误差,但是这个模型实在很好所以tradeoff:
This algorithm performed well on closely related problems.
During model development we did ad-hoc cross validation and XGBoost seemed to do the best.
We had limited time to create this model, and spent most of our time on feature engineering, data cleaning and gluing our model into production systems — which left very little time for rigorous algorithm selection and tuning.

为了看有多大bias,进行automl,很快得到下图,发现还可以接受

最后发现线性模型其实也挺好的

感受:

  • 数据科学家要有紧迫感啦,因为越来越容易被替代;昨天和ms大佬聊天她也这么告诉我
  • domain knowledge/business sense是不可替代的。目前在多花时间积累:电商泛电商,营销,用户分析方面。

你可能感兴趣的:(机器学习,读博客笔记,官方博客笔记)