Machine learning pipeline机器学习项目流程和基本操作, since 2020.03.22

1 了解问题的背景 Look at the big picture

2 获取数据 Get the data

3 通过数据对问题和背景深入了解 Discover and visualise the data to gain insights

4 针对算法预处理数据 Prepare the data for machine learning algorithms

5 选择模型,训练 Select a model and train it

6 调节模型 Fine-tune your model

7 展示解决方案 Present the solution

8 启用/监控/维护 Launch, monitor, and maintain your system


1 了解问题的背景 Look at the big picture

(2020.04.04)

项目流程 Machine learning pipeline: a sequence of data processing components. Components in a pipeline typically run asynchronously.

了解问题后,设计系统: 1) unsupervised/supervised/reinforcement learning, 2) classification/regression/others, 3)batch learning/online learning.

问题形成阶段的checklist. Frame the problem: the checklist:

1) Define the objective in business terms?

2) How would your solution be used?

3) What are the current solutions/workaround(if any)? 替代方法

4) How should you frame the problem(supervised/unsupervised, classification/regression/others, batch/online learning and etc)?

5) how should performance be measured?

6) Is the performance measure aligned with the business objective?

7) What should be minimum performance needed to reach the business objective?

8) What are the comparable problems? Can you reuse experience or tools?

9) Is human expertise available?

10) How would you solve the problem manually?

11) List the assumptions you (or others) have made so far.

12) Verify assumptions if possible.

(2020.03.29 Sun)

Select a performance measure
一个典型的回归问题measure: Root Mean Square Error(RMSE),系统预测误差的标准差.

Mean Absolute Error (MAE): 误差绝对值的平均值,用于处理有较多离群点(outlier)的情况.

RMSE/MAE都是用来计算两个矢量的距离。

其中,RMSE计算的是平方值,对应Euclidian norm,称为norm范数。

MAE称为Manhattan norm,称为norm范数。

范数: 

范数越高,越关注大值large value,忽略小值neglect small value。因此RMSE对异常值outlier比MAE更加敏感,但当异常值exponentially rare,RMSE表现更好。


2 获取数据 Get the data

(2020.04.04)

清单 Checklist(automate as much as possible so you can easily get fresh data):

1) List the data you need and how much you need.

2) Find and document where you can get the data.

3) Check how much space it will take.

4) Check legal obligation, and get authorisation if necessary.

5) Get access authorisation.

6) Create the workspace (with enough storage space).

7) Get the data.

8) Convert the data to a format you can easily manipulate (without changing the data itself).

9) Ensure sensitive information is deleted or protected (e.g., anonymised)

10) Check the size and type of data (time series, sample, geographical, etc.).

11) Sample a test set, put it aside, and never look at it (no data snooping).

PS:

1) Data snooping bias: when you estimate the generalisation error using the test set, your estimate will be too optimistic and you will launch a system that will not perform as well as expected.


3 通过数据对问题和背景深入了解 Discover and visualise the data to gain insights

(2020.04.04)

清单checklist (try to get insights from a field expert for these steps):

1) Create a copy of the data for exploration (sampling it down to a manageable size if necessary).

2) Create a Jupyter notebook to keep a record of your data exploration.

3) Study each attribute and its characteristics: Name / Type (categorical, int/float, bounded/unbounded, text, structured, etc.) / % of missing value / noisiness and types of noise (stochastic, outliers, rounding error, etc.) / possibly useful for the task / type of distribution (Gaussian, uniform, logarithmic, etc.).

4) For supervised learning task, identify the target attribute(s).

5) Visualise the data.

6) Study the correlations between attributes.

7) Study how would you solve the problem manually.

8) Identify the promising transformations you may want to apply.

9) Identify extra data that would be useful.

10) Document what you have learned.

(2020.04.06)

常规检查: a) 一些数据异常(e.g., missing value/outliers/etc.)并做清洗和处理,b)发现变量之间的相关性,特别是变量与目标变量的相关性,c)有的变量呈长尾分布,可考虑用logarithmic变换,d)变量合并(PCA/SVD/etc.).


4 针对算法预处理数据 Prepare the data for machine learning algorithms

(2020.04.04)

Notes: 

a) Work on copies of the data (keep the original dataset intact).

b) Write functions for all data transformations you apply, for 5 reasons:

    -So you can easily prepare the data the next time you get a fresh dataset

    -So you can apply these transformations in future projects

    -To clean and prepare the test set

    -To clean and prepare new data instance once your solution is alive

    -To make it easy to treat your preparation choices as hyperparameters

1) Data cleaning: 

    *Fix or remove outliers (optional)

    *Fill in missing values (e.g., with 0, mean, median...) or drop their rows (or columns)

2) Feature selection (optional):

    *Drop the attributes that provide no useful information for the task

3) Feature engineering, where appropriate:

    *Discretise continuous features

    *Decompose features (e.g., categorical, date/time, etc.)

    *Add promising transformation of features (e.g., log(x), sqrt(x), x^n, etc. For attribute with long-tail distribution, you may want logarithm. You may find interesting correlations between attributes, in particular with target attribute. Try out various attribute combinations.)

    *Aggregate features into promising new features

4) Feature scaling: standardise or normalise features


5 选择模型,训练 Select a model and train it

(2020.04.05 Sun)

Notes: a) If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware this penalises complex models such as large neural nets or Random forests). b) Once again, try to automate these steps as much as possible.

1) Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forest, neural net, etc.) using standard parameters

2) Measure and compare their performance 

    -For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds

3) Analyse the most significant variables for each algorithm

4) Analyse the types of errors the models make

    -What data would a human have used to avoid these errors?

5) Have a quick round of feature selection and engineering

6) Have one or two more quick iterations of the five previous steps

7) Short-list the top 3 to 5 most promising models, preferring models that make different types of errors.


6 调节模型Fine-tune your model

所谓训练模型,分为训练模型参数(权重)和训练模型超参数(不会随着训练进行而变化的参数)。

调节模型,即调节超参。三种方法: Grid search/Random search/Ensemble method.

(2020.04.05 Sun)

Notes:

a) You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning

b) As always automate what you can

1) Fine-tune the hyperparameters using cross-validation

    -Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with 0 or with median values? Or just drop the rows?)

    -Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimisation approach (e.g., Using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams)

2) Try ensemble methods. Combining your best models will often perform better than running them individually.

3) Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error.

PS: Don't tweak your model after measuring the generalisation error- you would just start overfitting the test set.

(2020.03.29 Sat)

Grid search: 遍历超参的所有组合。可应用Scikit-Learn的GridSearchCV实现这个功能,只需传递参数名和所有取值,使用cross-validation交叉验证。

from sklearn.model_selection import GridSearchCV

para_grid = [ {'n_estimators': [3,10,30], 'max_features': [2,3,4,6]}, {'bootstrap': [False],'n_estimator': [3,10], 'max_features':[3,4,5]}]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, para_grid, cv = 5, scoring ='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

这个案例中,para_grid的第一种情况有3*4种搭配,第二种参数组合有1*2*3种搭配,累计18种参数组合。参数cv设置5,代表每种参数组合经过5次交叉验证。所以累计有5*18轮训练。

grid_search.best_params_/grid_search.best_estimator_/grid_search.cv_results_

Randomised Search:

顾名思义,对超参的search space做随机查找。

from sklearn.model_selection import RandomizedSearchCV

Ensemble methods

e.g., Random Forest.

评估模型Evaluate models (esp., 超参)

机器学习的目的是得到可以泛化(generalise)的模型,即在前所未有的数据上表现很好的模型,而过拟合则是核心难点。

将训练集进一步分成训练集(training set)和验证集(validation set)。

三种评估(和调整超参的)方法:留出验证(hold-out validation)、K折验证(k-fold validation)和打乱数据的重复K折验证(iterated K-fold validation with shuffling)

Hold-out validation:

数据先分为训练集(其中含训练集和验证集)和测试集。验证集的存在避免了用测试集来调节模型。

流程: 打乱数据-->定义验证集/训练集-->在训练集上训练数据在验证机上评估模型-->一旦调节好超参,通常在所有非测试数据上从头开始训练最终模型。

```

model = get_model()

model.train(training_data)

validatoin_score = model.evaluation(validation_data)

```

# 开始重新训练

```

model = get_model()

model.train(np.concatenate([training_data, validation_data])

test_score = model.evaluate(test_data)

```

这种方法最简单,但遇到数据较少的情况则各部分样本过少。

K-fold validation K折验证:

对数据集全体做K个大小相同的分区,在其中的k-1个分区上做训练,1个分区上做验证。循环这个过程一共K次(相当于每个分区都做一次验证集),每次模型评估有一个性能值,K个的平均作为模型的最终分数。如果模型的性能变化较大,这个方法很有用。需要留出独立的测试集测试。得到平均分数后,再用训练集+验证集训练模型。

k = 4

nam_valid_samples = len(data) // k

v_scores = []

for fold in range(k):

    #选择验证数据分区

    #剩余数据做训练

    #创建新模型的实例(未训练)

    v_scores.append(上一步的分数)

# v_scores中所有K个值的平均值,即最终验证分数。

# 在所有非测试数据及上训练最终模型

(注: 所以为啥要去K个值得平均值?)

打乱数据的重复K折验证(iterated K-fold validation with shuffling):

与前一种方法不同的是,该方法在每次将数据划分为K个分区之前先将数据打乱。并且重复P次,所以总共训练P*K个模型。计算代价很大。在Kaggle中常用。


7 展示解决方案 Present your solution

1) Document what you have done.

2) Create a nice presentation

    -Make sure you highlight the big picture first

3) Explain why your solution achieves the business objective.

4) Don't forget to present the interesting points you noticed along the way.

    -Describe what worked and what did not

    -List your assumptions and your system's limitations.

5) Ensure your key findings are communicated through beautiful visualisations or easy-to-remember statements (e.g., 'the median income is the number-one predictor of housing prices').


8 启用/监控/维护 Launch, monitor, and maintain your system

1) Get your solution ready for production (plug into production data inputs, write unit test, etc.)

2) Write monitoring code to check your system's live performance at regular intervals and trigger alerts where it drops.

    -Beware of slow degradation too: models tend to 'rot' as data evolves

    -Measuring performance may require a human pipeline (e.g., via a crowdsourcing service)

    -Also monitor your input's quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particularly important for online learning systems.

3) Retrain your models on a regular basis on fresh data (automate as much as possible). 


基本操作

(2020.04.04)

1 查看DataFrame中各类基本信息 

import pandas as pd, import matplotlib.pyplot as plt

df.head(n): 头部n个元素,默认n =5

df.info(): quick description of data

df['some_field'].value_counts(): 各值数量

df.hist(bins = 50, figsize=(20,15))

plt.show(): 这两条指令联合使用画出各变量的直方图,bins表示柱的个数,即颗粒度

(2020.04.05)

np.random.permutation(n): 生成0到n-1之间所有整数的随机排列,用于shuffle indices.

数据打乱的最简单方法

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 42)

其中housing是含有各字段的dataframe,random_state allows you to set the random generator seed.

一种采样方法 Stratified sampling:

the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. 分层采样。该方法避免了采样偏差sampling bias.

from sklearn.model_selection import StratifiedShuffleSplit as SSS

split = SSS(n_splits= 1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing['income_cat']):

    strat_train_set = housing.loc[train_index]

    strat_test_set = housing.loc[test_index]

散点图

1) housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude')  #, alpha = 0.1)

2) from pandas.tools.plotting import scatter_matrix

attributes = ['median_house_value', 'median_income', 'total_rooms', 'hose_median_age']

scatter_matrix(housing[attributes], figsize=(12,8))

or

housing.plot(kind='scatter', x='median_income', y = 'median_house_value', alpha=0.1)

数据相关性

corr_matrix = housing.corr() #得到相关矩阵

>> corr_matrix['median_house_value'].sort_values( ascending=False) #返回median_house_value与其他变量的相关性

数据清洗 Data cleaning

处理missing values的方法

1) housing.dropna(subset=['total_bedrooms']) #抛弃

2) housing.drop('total_bedrooms', axis = 1), median = housing['total_bedrooms'].median()

3) housing['total_bedrooms'].fillna(median) #填充

针对numerical变量(非text),还可以使用Imputer

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy = 'median')

housing_num = housing.drop('ocean_proximity', axis = 1)

imputer.fit(housing_num)

>> imputer.statistics_   >> housing_num.median().values

之后用经过训练的imputer代替missing values,用median

x = imputer.transform(housing_num) # type(x) = np.array

housing_tr = pd.DataFrame(x, columns = housing_num.columns)

处理文本和非数值型变量 Handle text and categorical attributes

将text labels转换成numbers

>> from sklearn.preprocessing import LabelEncoder as LE

>> encoder =  LE() # encoder.classes_ 可查看内容

>> housing_cat = housing['ocean_proximity']

>> housing_cat_encoded = encoder.fit_transform(housing_cat) # type(ho.._c_e..) = array

转换成数字的问题: ML算法会假定两个相近的值比两个相较远的值更加相似

对策: one-hot encoding, i.e., only one attribute will be equal to 1 (hot), while the others will be 0 (cold).

>> from sklearn.preprocessing import OneHotEncoder as ohe

>> encoder = ohe()

>> housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))

>> housing_cat_1hot 返回spart matrix of type

>> housing_cat_1hot.toarray() 返回一个sparse matrix

text转换成二元binary值

>> from sklearn.preprocessing import LabelBinarizer as lb

>> encoder = lb()

>> housing_cat_1hot = encoder.fit_transform(housing_cat) # type(housing_cat_1hot) = array

特征尺度变换 Feature scaling

两种,min-max (normalisation) scaling和standardisation.

min-max, a.k.a., normalisation: values are shifted and rescaled so that they end up ranging from 0 to 1. 公式: (x -min) / (max - min)

standardisation: first it subtracts the mean value, then it divides by the variance so the resulting distribution has unit variance. Standardised  value have a 0 mean. 优点: standardisation is much less affected by outliers.不被异常值困扰. 公式: (x -mean) / var. 注意sklearn中的StandardScaler.

变换的流程 Transform Pipeline

>> from sklearn.pipeline import Pipeline

>> from sklearn.preprocessin import StandardScaler

...

Underfitting 欠拟合

误差过大,可能是欠拟合。The features do not provide enough information to make good predictions, or that model is not powerful enough.

对策:

1) select a more powerful model

2) feed the training algorithm with better features

3) reduce the constraints on the model

(2020.04.06-08)

Overfitting 过拟合(more content later)

在训练集性能良好,在训练集性能不好。降低过拟合的方法叫做正则化regularisation

1) L1/L2正则化L1/L2 regularisation。在监督学习中使用的正则化方法,目标是降低误差函数,即实际与真实值的差值最小。在误差函数中加入惩罚项,即正则化项。

J' = J + \lambda |w| L1正则化

J' = J + \lambda |w| ^2 L2正则化

(为什么正则化方法可以减少过拟合?参考这里: 代价函数或误差函数对w求导,发现对w0有导数,而对b无导数也就是正则化对b的变化没影响。L2正则化过程中,w的系数由未正则化过程中的1转变为小于1的整数,也就是系数衰减weight decay,因此正则化项的加入,使得w有衰减,而过拟合时,权重往往较大,减小了权重可看做是减少了过拟合。(为什么会这样:)过拟合时,拟合函数需要顾及每一个点,所以形成的拟合函数波动也很大,在小区间里函数波动剧烈。这就意味着函数在小区间的导数足够大,因自变量可大可小不能控制故只能导数足够大。而正则化特别是L2正则化正是通过约束函数的权重来避免导数足够大而过拟合。 )

2) 决策树算法中用剪枝

3) 数据增广Data augmentation,人为生成新数据扩大训练集size,根据已有数据。可用于减少overfitting。在CV中通过旋转图像rotate/缩放resize等获得,在NLP中通过同义词扩充数据集,语音处理中通过对数据加入白噪声。可在训练时,边训练边生成(generate training instances on the fly),

4) 神经网络中的Dropout,在训练过程中对部分神经元进行前向传播和后向传播,另一部分神经元保持不变。该方法使得每个神经元只用样本集中的部分样本,相当于对样本集进行采样,即bagging。最终得到多个神经网络的组合。训练过程中,每个neuron,含input neuron而不包含output neurons,有一定概率(p)被暂时dropped out, e.g., it will be entirely ignored during this training step, but maybe active during the next step.概率p被称为dropout rate,常被设定成50%。训练完成后,神经元不再被dropped。考虑到每次的训练中,每个神经元要么被选中要么被dropout,所以一共有2^N中可能的神经元组合(N:神经元总数),如果训练1000步,即可认为是训练了1000种不同的神经网络(设2^N >> 1000)。但这些网络不是相互独立,因为他们共享weights。训练结果也可以认为是(训练中遇到的所有)神经网络的averaging ensemble。训练时如果发现产生overfitting,则提高dropout rate可减少overfitting;underfitting同理。对于大层(large layer),可提高dropout rate,small layer则减少dropout rate。该方法的缺点是slow down convergence,增长了训练时间,但是会得到a better model。

from tensorflow.contrib.layers import dropout


5) Max-Norm regularisatoin

用于神经网络,incoming connections的权重w满足: w的l2范数() <= r,其中的r是max-norm超参数。为实现这个方法,在每个训练步骤之后计算w的l2范数,如有必要可对w进行clip操作(?)。

降低r可以增加正则化数量(amount of regularisation),进而减小overfitting。同事Max-norm regularisation可以减轻vanishing/exploding gredient(梯度消失 梯度爆炸?)问题(如果不是Batch normalisation)

6) early stopping,即验证集validation set误差出现增大之后,或validation set performance starts dropping,提前停止训练。在Tensorflow中的实现: 每隔特定间隔(e.g., 50 steps),评估validation set上的性能,并保存一个winner snapshot if it outperforms previous winner snapshots.获得了winner snapshot之后对间隔计数(steps),设定一个上限(e.g., 2000 steps),一旦在winner snapshot之后间隔达到上限则终止训练。之后restore the last winner snapshot. Early stopping与其他regularisation技术相结合会取得更好的效果。

7) Ensemble method,bagging通过多个模型的结果,减少模型的方差,boosting不仅能减少偏差,还能减少方差

交叉验证 Cross-validation

K-fold x-validation

>> from sklearn.model_selection import cross_val_score

>> scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring = 'neg_mean_squared_error', cv= 10)

rmse_scores = np.sqrt(-scores)


Notes:

(2020.04.08)

1 矢量x的l2范数 l2 norm of x: 向量中元素的平方和的平方根 

相应的,l1范数是向量中元素的绝对值之和 

而l0范数是一个向量中非0元素的个数。l1/l2范数也可理解为矢量x到原点的距离。

reference:

1 A. Geron, Hands-on Machine Learning with Scikit-Learn & Tensorflow

2 弗朗索瓦著,张亮译,Python深度学习

3 https://zhuanlan.zhihu.com/p/38224147

你可能感兴趣的:(Machine learning pipeline机器学习项目流程和基本操作, since 2020.03.22)