博客笔记三: [Airbnb] data science的pipline,工业级的解决

https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d
作者是:Robert Chang

1. Customer Lifetime Value (LTV)

客户生命周期价值模型, 应用场景:
At e-commerce companies like Spotify or Netflix, LTV is often used to make pricing decisions like setting subscription fees. At marketplace companies like Airbnb, knowing users’ LTVs enable us to allocate budget across different marketing channels more efficiently, calculate more precise bidding prices for online marketing based on keywords, and create better listing segments.

整个训练测试和部署的pipline,airbnb使用了很多amazing的工具,因此他们的data scientist不用关注太多data engineering的过程。pipline主要这四个步骤。
Feature Engineering: Define relevant features
Prototyping and Training: Train a model prototype
Model Selection & Validation: Perform model selection and tuning
Productionization: Take the selected model prototype to production

2. Feature Engineering:

Airbnb’s internal feature repository — Zipline,写好一些特征(150+),免得写麻烦的hive
一些业务常见特征:
Location: country, market, neighborhood and various geography features
Price: nightly rate, cleaning fees, price point relative to similar listings
Availability: Total nights available, % of nights manually blocked
Bookability: Number of bookings or nights booked in the past X days
Quality: Review scores, number of reviews, and amenities

3. Prototyping and Training

构造模型原型用sklearn和spark。哈哈他们也用sklearn
- 数据确缺失处理
- encoding:category比较少用one hot;多用ordinal encoding
两者区别

4. Performing Model Selection

  • 许多automl工具,比如
    • TPOT
    • Auto-Sklearn
    • Auto-Weka
    • Machine-JS
    • DataRobot
  • 模型比如xgboost等常见模型
  • Bias-Variance tradeoff 进行interpretability 与 complexity取舍,即准确性与过拟合的取舍。见下图。
    博客笔记三: [Airbnb] data science的pipline,工业级的解决_第1张图片

5. Production部署模型

  • Tool used: Airbnb’s notebook translation framework — ML Automator
     ML Automator 把jupyter notebook转化为他们自己的airflow pipline,见下图

     - 有时候需要用Python写hive UDF(user-defined function )以便分布式部署

6. 学习要点

  • 真的爽
  • 基本技能hive,spark要补上,毕竟不是每家公司都是airbnb有这么炫酷的工具
  • Bias-Variance tradeoff概念
  • Customer Lifetime Value (LTV/CLV)感觉是一个综合性的概念,细节还需要很多ml模型实现。
    https://www.datascience.com/blog/intro-to-predictive-modeling-for-customer-lifetime-value
    利用RFM模型做电商客户价值分析
    最近一次消费(Recency)
    消费频率(Frequency)
    消费金额(Monetary)
    https://www.jianshu.com/p/93085954ec4f

你可能感兴趣的:(机器学习,读博客笔记,官方博客笔记)