Click here for notes and code
An Example: spam filter
Learn to flag spam given examples of spam emails and examples of regular emails.
Use traditional programming techniques → A long list of complex rules—pretty hard to maintain
Use Machine Learning techniques → A spam filter automatically ****learns which words and phrases are good predictors of spam by detecting unusually frequent patterns of words in the spam examples compared to the ham examples.
Machine Learning is great for:
Training to find a hypothesis……
The two core components of the learning problem:
A hypothesis space H \mathcal{H} H is a set of functions that maps x → y x\rightarrow y x→y.
We want hypothesis space that…
An example hypothesis space:
Loss function: l : y × y → R + l:y\times y\rightarrow \mathbb{R}_+ l:y×y→R+ measures the difference between h ( x ) h(x) h(x) and y y y
l ( y , h ( x ) ) = ( y − h ( x ) ) 2 (Regressison) l ( y , h ( x ) ) = 1 [ y ≠ h ( x ) ] (Classification) l(y,h(x))=(y-h(x))^2 \text{(Regressison)}\\l(y,h(x))=1[y \neq h(x)] \text{(Classification)} l(y,h(x))=(y−h(x))2(Regressison)l(y,h(x))=1[y=h(x)](Classification)
The canonical training procedure of machine learning:
Error of training:
ϵ ^ ( h ) = min ∑ i = 1 m l ( h θ ( x i ) , y i ) \hat\epsilon(h)=\min{\sum_{i=1}^m}l(h_\theta(x_i),y_i) ϵ^(h)=mini=1∑ml(hθ(xi),yi)
Virtually every machine learning algorithm has this form, just specify
Supervised learning
In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels.
classification
regression
Here are some of the most important supervised learning algorithms:
Unsupervised learning
In unsupervised learning, as you might guess, the training data is unlabeled. The system tries to learn without a teacher.
Here are some of the most important unsupervised learning algorithms:
Semisupervised learning
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning.
An Example: Google Photo
Reinforcement Learning
Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return.
Microsoft researchers have shown that, given enough data, radically different machine learning algorithms (including fairly simple ones) can perform almost exactly the same job on the complex problem of natural language ambiguity reduction.
As the authors put it: “these results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it
on corpus development.”
Small- and mediumsized datasets are still very common, and it is not always easy or cheap to get extra training data, so don’t abandon algorithms.
For example, when the GDP per capita-Life satisfaction line is fitted, when the data of some countries is missing and when the complete data is added, the line fitted is different.
If the training data is full of errors, outliers, and noise, it will make it harder for the system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data.
A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:
The below figure shows an example of a high-degree polynomial life satisfaction model that strongly overfits the training data. Even though it performs much better on the training data than the simple linear model, we cannot trust its predictions.
Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are:
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.
For example, the linear model we defined earlier has two parameters, θ 0 θ_0 θ0 and θ 1 θ_1 θ1.
In the same example of GDP above, we used the data of some missing countries for fitting, but regularization was added this time to get the blue solid line.
We can see that regularization forced the model to have a smaller slope, which fits a bit less the training data that the model was trained on, but actually allows it to generalize better to new examples.
The amount of regularization to apply during learning can be controlled by a hyperparameter.
Underftting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.
The main options to fix this problem are:
The only way to know how well a model will generalize to new cases is to actually try it out on new cases.
A better option is to split your data into two sets: the training set and the test set.
Over-estimates the test performance ("lucky"model)
This is the most common but wrong practice of machine learning.
⛳It is common to use 80% of the data for training and hold out 20% for testing.
Reserve some data for validation
Random sampling
Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split
.[document]
sklearn.model_selection.**train_test_split**(**arrays*, *test_size=None*, *train_size=None*, *random_state=None*, *shuffle=True*, *stratify=None*)
| Parameters: | *arrays: sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. |
| — | — |
| | test_size: float or int, default=None
between 0.0 and 1.0 |
| | train_size: float or int, default=None
between 0.0 and 1.0 |
| | random_state: int, RandomState instance or None, default=None |
| Returns: | splitting: list, length=2 * len(arrays) |
Examples
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
Stratified sampling
The following code uses the pd.cut()
function to create an income category attribute with 5 categories (labeled from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than $15,000), category 2 from1.5 to 3, and so on:
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()
Now you are ready to do stratified sampling based on the income category. For this
you can use Scikit-Learn’s StratifiedShuffleSplit
class:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Let’s see if this worked as expected. You can start by looking at the income category proportions in the test set:
>>> strat_test_set["income_cat"].value_counts() / len(strat_test_set)
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
Now you should remove the income_cat attribute so the data is back to its original
state:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
What if we get an unfortunate split?
Combined Algorithm Selection Hyperparameter (CASH)
optimization problem
A set of algorithms A = A ( 1 ) , . . . , A ( n ) \mathcal{A}={{A}^{(1)},...,{A}^{(n)}} A=A(1),...,A(n)
Denote the hyperparameter space of algorithm A ( i ) {A}^{(i)} A(i)as Λ ( i ) {\Lambda}^{(i)} Λ(i)
Denote the error of A ( i ) \mathcal{A}^{(i)} A(i) as L ( A λ ( i ) , D t r a i n , D v a l i d ) \mathcal{L}(A_\lambda^{(i)},D_{\rm train},D_{\rm valid}) L(Aλ(i),Dtrain,Dvalid) using λ ∈ Λ ( i ) \lambda \in \Lambda^{(i)} λ∈Λ(i) trained on D t r a i n D_{\rm train} Dtrain and evaluated on D v a l i d D_{\rm valid} Dvalid
The problem is to find optimal algorithm and its hyperparameter:
A λ ∗ ∗ = argmin A ( i ) ∈ A , λ ∈ Λ ( i ) L ( A λ ( i ) , D train , D valid ) A_{\lambda^*}^*=\underset{A^{(i)} \in \mathcal{A}, \lambda \in \boldsymbol{\Lambda}^{(i)}}{\operatorname{argmin}} \mathcal{L}\left(A_\lambda^{(i)}, D_{\text {train }}, D_{\text {valid }}\right) Aλ∗∗=A(i)∈A,λ∈Λ(i)argminL(Aλ(i),Dtrain ,Dvalid )
Complexity = ∑ A ( i ) ∈ A ∣ Λ ( i ) ∣ ⋅ K ⋅ O ( A ( i ) ) \text { Complexity }=\sum_{A^{(i)} \in \mathcal{A}}\left|\boldsymbol{\Lambda}^{(i)}\right| \cdot K \cdot O\left(A^{(i)}\right) Complexity =A(i)∈A∑ Λ(i) ⋅K⋅O(A(i))