Facebook机器学习实践指南-1

from
https://research.fb.com/the-facebook-field-guide-to-machine-learning-video-series/
2018.6.28

Contents
- 1 problem definition
- 2 data
- 3 evaluation
- 4 features
- 5 models
- 6 experimentation

$1 Problem Definition

1.1 Look for

reasonable period of time
faster responses lead to faster iteration times
is it too sparse?
A good heuristic for imbalanced datasets is to consider the number of rare labels and work from this.
Different algorithms are suited for different numbers of data and different levels of sparsity.
features contain info about the event?
true desired outcome sensitive to variation?

1.2 Ask yourself these questions (e.g.)

1 Do you you care about the precision or recall?
2 Prediction right for first-time user or existing user, which is more important?

在“定义问题”阶段思考清楚，好过做完实验发现思路不对

1.3 To conclude

1 Determine the right task for your project
2 Simple is better than complicated
3 Define your label and training example precisely
4 Don't prematurely optimize

$2 Data

Contents
- 1 Data recency and real time training
- -  时间性，如2017年 vs 2017年
- -  周期性，如黑色星期五和其前一周
- 2 Training / Prediction consistency
- 3 Records and sampling

2.2 Training / Prediction consistency

For this problem, we face two challenges:

1 You need a real-time service of some kind to do the join described previously ( which means the join operation in database)
Batch query systems like Hive and Presto are typically not optimized for real-time queries.
2 You need to carefully consider the tradeoffs or adjust for inaccuracies in training data because of the delay between events.
举例：一些特征是不变且可以reload的，比如用户性别、国籍；另一些特征是不可重载的，如，到目前为止用户收到了多少赞。对于这些特征，确保时间点和特征版本对应好是很有难度的。

2.3 Records and sampling

More data is almost better form a pure machine learning perspective but it comes with increased training time.
So, we need to balance the tradoff between slower iteration cycle and heavier data and compute loads.
Techniques like importance sampling, can boost the accuracy of your model while lowering the computational costs of training.

$3 Evaluation

Usually a good flow is to use offline evalution until you have a viable candidate, and then validate this with online experiments.

3.1 Baseline model

simplest possible model

3.2 Best offline practice

Split the dataset into three parts:

training set
evaluation set
for tuning hyper-parameters
test set
to report final results only, which must not used for training or tuning

3.3 Evaluation

3.3.1 cross validation

need to shuffle dataset randomly before spliting

3.3.2 progressive evaluation

try to mimic the online system
train on past logged data and use the model to predict on incoming traffic
is a common way of making evaluation closer to the desired online use case
e.g. extract small amounts of specific data before using lots of non-specific data, this doesn't mean only the former is enough

selecting subsets

3.3.3 Metrics

Metrics should be both interpretable and sensitive to improvements in the model
e.g. if a baseline rate of positive in the data is only 0.001, then always predicting a 0 will be a very hard baseline to beat. In this case, the metrics lack sensivity.
Metrics such as the area under the curve, for classification, or r^2 for regression, have the nice property of coming with a built-in baseline -- an AUC of 0.5 / r^2 of 0. Both correspond to the best constant predictor as obtained by predicting the empirical average.

3.3.4 To conclude

In both cases, we're interested in the performance of our models in the testset, if the model performs much better on the training or evaluation set compred to the test set, we're most likely overfitting the training data and in this case our models does not generalize well to new examples.

3.4 Calibration of model

calibration =
$$\frac{sum-of-labels}{sum-of-predictions} $$

about calibration

ensure model is calibrated on both train and test set
if model is under / over calibrated on test set, then the model is probably overfitting

3.5 Spliting dataset by category

diving into the performance for different sub-sets of the evaluation data is a useful way to understand where the performance comes from and whether this is expected

spliting and observing distinctively

3.6 To conclude

evaluation summary