Click here for notes and code
An Example: Electricity Prediction
Problem:Predict the peak power consumption in summer.
Date | High Temperature(F) | Peak Demand(GW) |
---|---|---|
2011-06-01 | 84.0 | 2.651 |
2011-06-02 | 73.0 | 2.081 |
2011-06-03 | 75.2 | 1.844 |
… | … | … |
Feature Engineering:
How to map from X to y?
Exploratory Data Analysis(EDA)to ease model selection: EDA
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QCPd3h2J-1671689314032)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/image-20221213192518896.png)]
Why different peak demands for the same high temperature?
It is inaccurate when using only one feature for prediction.
Training data: ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) (\boldsymbol{x_1},y_1),(\boldsymbol{x_2},y_2),...,(\boldsymbol{x_n},y_n) (x1,y1),(x2,y2),...,(xn,yn)
Add a constant dimension to feature vector x:
x = ( 1 x 1 ⋯ x d ) ∈ R d + 1 \boldsymbol{x}=\left(\begin{array}{c} 1 \\ \boldsymbol{x_1} \\ \cdots \\ \boldsymbol{x_d} \end{array}\right) \in \mathbb{R}^{d+1} x= 1x1⋯xd ∈Rd+1
Prediction of hypothesis h h h parametrized by θ \boldsymbol{\theta} θ
h θ ( x ) = θ 0 + ∑ j = 1 a θ j x j = ∑ j = 0 a θ j x j = θ ⋅ x h_{\boldsymbol{\theta}}(\boldsymbol{x})=\boldsymbol{\theta_0}+\sum_{j=1}^a \boldsymbol{\theta_j} \boldsymbol{x_j}=\sum_{j=0}^a \boldsymbol{\theta_j}\boldsymbol{x_j}=\boldsymbol{\theta} \cdot \boldsymbol{x} hθ(x)=θ0+j=1∑aθjxj=j=0∑aθjxj=θ⋅x
In Machine Learning, vectors are often represented as column vectors, which are 2D arrays with a single column.
If ∗ ∗ θ ∗ ∗ **\boldsymbol{\theta}** ∗∗θ∗∗ and ∗ ∗ x ∗ ∗ **\boldsymbol{x}** ∗∗x∗∗ are column vectors, then the prediction is: ∗ y ^ = θ T x ∗ *\hat y = \boldsymbol{\theta}^T\boldsymbol{x}* ∗y^=θTx∗
Use the squared loss (L2 loss) to compute the error on training set
ϵ ^ ( h ) = ∑ i = 1 n ( h ( x i ) − y i ) 2 \hat{\epsilon}(h)=\sum_{i=1}^n\left(h\left(\boldsymbol{x}_i\right)-y_i\right)^2 ϵ^(h)=i=1∑n(h(xi)−yi)2
Or the Mean Square Error (MSE)
MSE ( X , h θ ) = 1 m ∑ i = 1 m ( θ T x i − y i ) 2 \operatorname{MSE}\left(\mathbf{X}, h_{\boldsymbol{\theta}}\right)=\frac{1}{m} \sum_{i=1}^m\left(\boldsymbol{\theta}^T \mathbf{x}_{i}-y_{i}\right)^2 MSE(X,hθ)=m1i=1∑m(θTxi−yi)2
To simplify notations, we will just write MSE ( θ ) \text{MSE}(\boldsymbol{\theta}) MSE(θ) instead of MSE ( X , θ ) \text{MSE}(\boldsymbol{X},\boldsymbol{\theta}) MSE(X,θ)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uxGuSmrm-1671689314033)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/image-20221213194217778.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lzz3MyjU-1671689314033)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/image-20221213200243725.png)]
Optimization
min θ ∑ i = 1 n ( h θ ( x i ) − y i ) 2 \min _{\boldsymbol{\theta}} \sum_{i=1}^n\left(h_{\boldsymbol{\theta}}\left(\boldsymbol{x}_i\right)-y_i\right)^2 θmini=1∑n(hθ(xi)−yi)2
Training error of linear regression:
ϵ ^ ( θ ) = ∑ i = 1 n ( θ T x i − y i ) 2 = ∥ X θ − y ∥ 2 \begin{aligned} \hat{\epsilon}(\boldsymbol{\theta}) & =\sum_{i=1}^n\left(\boldsymbol{\theta}^T \boldsymbol{x}_i-y_i\right)^2 \\ & =\|X \boldsymbol{\theta}-y\|^2 \end{aligned} ϵ^(θ)=i=1∑n(θTxi−yi)2=∥Xθ−y∥2
Put everything in matrix form:
data matrix(design matrix) n×(d+1)
label matrix n
X = [ − x 1 T − − x 2 T − ⋮ − x n T − ] , y = [ y 1 y 2 ⋮ y n ] \boldsymbol{X}=\left[\begin{array}{c}-\boldsymbol{x}_1^T- \\ -\boldsymbol{x}_2^T- \\ \vdots \\ -\boldsymbol{x}_n^T-\end{array}\right], \quad \boldsymbol{y}=\left[\begin{array}{c}y_1 \\ y_2 \\ \vdots \\ y_n\end{array}\right] X= −x1T−−x2T−⋮−xnT− ,y= y1y2⋮yn
Computing the gradient of θ \boldsymbol{\theta} θ and setting it to zero yields the optimal parameter θ ∗ \boldsymbol{\theta}^\ast θ∗:
ϵ ^ ( θ ) = ∥ X θ − y ∥ 2 ∇ θ ϵ ^ ( θ ) = 2 X T ( X θ − y ) = 0 ⇒ X T X θ = X T y ⇒ θ = ( X T X ) − 1 X T y \begin{aligned} \hat{\epsilon}(\boldsymbol{\theta}) & =\|\boldsymbol{X} \boldsymbol{\theta}-\boldsymbol{y}\|^2 \\ \nabla_{\boldsymbol{\theta}} \hat{\epsilon}(\boldsymbol{\theta}) & =2 \boldsymbol{X}^T(\boldsymbol{X} \boldsymbol{\theta}-\boldsymbol{y})=\mathbf{0} \\ & \Rightarrow \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{\theta}=\boldsymbol{X}^T \boldsymbol{y} \\ & \Rightarrow \boldsymbol{\theta}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{aligned} ϵ^(θ)∇θϵ^(θ)=∥Xθ−y∥2=2XT(Xθ−y)=0⇒XTXθ=XTy⇒θ=(XTX)−1XTy
Let’s generate some linear-looking data to test this equation θ = ( X T X ) − 1 X T y \boldsymbol{\theta}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} θ=(XTX)−1XTy
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
>>> theta_best
array([[4.21509616],
[2.77011339]])
We would have hoped for ∗ θ 0 ∗ *\theta_0* ∗θ0∗ = 4 and ∗ θ 1 ∗ *\theta_1* ∗θ1∗ = 3 instead of ∗ θ 0 ∗ *\theta_0* ∗θ0∗ = 4.215 and ∗ θ 1 ∗ *\theta_1* ∗θ1∗ = 2.770. Close
enough, but the noise made it impossible to recover the exact parameters of the original function.
Now you can make predictions using ∗ ∗ θ ^ ∗ ∗ **\hat {\boldsymbol{\theta}}** ∗∗θ^∗∗:
>>> X_new = np.array([[0], [2]])
>>> X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
>>> y_predict = X_new_b.dot(theta_best)
>>> y_predict
array([[4.21509616],
[9.75532293]])
Plot this model’s predictions
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
Performing linear regression using Scikit-Learn is quite simple:
>>> from sklearn.linear_model import LinearRegression
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([4.21509616]), array([[2.77011339]]))
>>> lin_reg.predict(X_new)
array([[4.21509616],
[9.75532293]])
Now we will look at very different ways to train a Linear Regression model, better
suited for cases where there are a large number of features, or too many training
instances to fit in memory.
For general differentiable loss function,use Gradient Descent(GD)
Concretely, you start by filling θ with random values (random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function, until the algorithm converges to a minimum
Gradient shows direction that function varies fastest
g = ∇ w J ( w ) g j = ∇ w j J ( w ) \begin{gathered} \boldsymbol{g}=\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) \\ g_j=\nabla_{w_j} J(\boldsymbol{w}) \end{gathered} g=∇wJ(w)gj=∇wjJ(w)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5btkDVXN-1671689314034)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/tiduxiajiang-1.png)]
Progress:
First-order Taylor approximation of the objective function:
J ( w ) = J ( w 0 ) + ( w − w 0 ) T g + ⋯ J(\boldsymbol{w})=J\left(\boldsymbol{w}_0\right)+\left(\boldsymbol{w}-\boldsymbol{w}_0\right)^T \boldsymbol{g}+\cdots J(w)=J(w0)+(w−w0)Tg+⋯
Go along gradient for a step with a small rate η \eta η:
An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time
J ( w − η g ) ≈ J ( w ) − η g T g J(\boldsymbol{w}-\eta \boldsymbol{g}) \approx J(\boldsymbol{w})-\eta \boldsymbol{g}^T \boldsymbol{g} J(w−ηg)≈J(w)−ηgTg
Finally, not all cost functions are convex .There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult.
The figure below shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function.
In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if
the features have very different scales.
When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.
Partial derivatives of the cost function
∂ ∂ θ j MSE ( θ ) = 2 m i ∑ i = 1 m ( θ T x ( i ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial \theta_j} \operatorname{MSE}(\boldsymbol{\theta})=\frac{2}{m_i} \sum_{i=1}^m\left(\boldsymbol{\theta}^T \mathbf{x}^{(i)}-y^{(i)}\right) x_j^{(i)} ∂θj∂MSE(θ)=mi2i=1∑m(θTx(i)−y(i))xj(i)
Gradient vector of the cost function
∇ θ MSE ( θ ) = ( ∂ ∂ θ 0 MSE ( θ ) ∂ ∂ θ 1 MSE ( θ ) ⋮ ∂ ∂ θ n MSE ( θ ) ) = 2 m X T ( X θ − y ) \nabla_{\boldsymbol{\theta}} \operatorname{MSE}(\boldsymbol{\theta})=\left(\begin{array}{c}\dfrac{\partial}{\partial \theta_0} \operatorname{MSE}(\boldsymbol{\theta}) \\\dfrac{\partial}{\partial \theta_1} \operatorname{MSE}(\boldsymbol{\theta}) \\\vdots \\\dfrac{\partial}{\partial \theta_n} \operatorname{MSE}(\boldsymbol{\theta})\end{array}\right)=\frac{2}{m} \mathbf{X}^T(\mathbf{X} \boldsymbol{\theta}-\mathbf{y}) ∇θMSE(θ)= ∂θ0∂MSE(θ)∂θ1∂MSE(θ)⋮∂θn∂MSE(θ) =m2XT(Xθ−y)
Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill.
θ (next step ) = θ − η ∇ θ MSE ( θ ) \boldsymbol{\theta}^{\text {(next step })}=\boldsymbol{\theta}-\eta \nabla_{\boldsymbol{\theta}} \operatorname{MSE}(\boldsymbol{\theta}) θ(next step )=θ−η∇θMSE(θ)
A quick implementation of this algorithm:
eta = 0.1 # learning rate
n_iterations = 1000
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
>>> theta
array([[4.21509616],
[2.77011339]])
The first 10 steps of Gradient Descent using three different learning rates.
How to set the number of iterations?
Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance.
Due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average.
The final parameter values are good, but not optimal. When the cost function is very irregular , this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.
One solution to this dilemma is to gradually reduce the learning rate → simulated annealing
When using Stochastic Gradient Descent, the training instances must be independent and identically distributed (IID), to ensure that the parameters get pulled towards the global optimum, on average.
A simple way to ensure this is to shuffle the instances during training.
To perform Linear Regression using SGD the SGDRegressor
class of Scikit-Learn, which defaults to optimizing the squared error cost function.
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
>>> sgd_reg.intercept_, sgd_reg.coef_
(array([4.24365286]), array([2.8250878]))
Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini batch GD computes the gradients on small random sets of instances called minibatches
Comparison of algorithms for Linear Regression:
An Example: Electricity Prediction
Problem: Predict the peak power consumption in all months.
Exploratory Data Analysis(EDA): Peak demand vs.temperature plot
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ibavGyLk-1671689314036)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/image-20221214212320372.png)]
Can we use linear regression again?
First, generate some nonlinear data, based on a simple quadratic equation
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
let’s use Scikit-Learn’s PolynomialFeatures
class to transform our training data, adding the square (2nd-degree polynomial) of each feature in the training set as new features:
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly_features = PolynomialFeatures(degree=2, include_bias=False)
>>> X_poly = poly_features.fit_transform(X)
>>> X[0]
array([-0.75275929])
>>> X_poly[0]
array([-0.75275929, 0.56664654])
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X_poly, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([1.78134581]), array([[0.93366893, 0.56456263]]))
Not bad: the model estimates y ^ = 0.56 x 1 2 + 0.93 x 1 + 1.78 \hat y = 0.56x_1^2+0.93x_1+1.78 y^=0.56x12+0.93x1+1.78 when in fact the original
function was y ^ = 0.5 x 1 2 + 1.0 x 1 + 2.0 + Guassian noise \hat y = 0.5x_1^2+1.0x_1+2.0+\text{Guassian noise} y^=0.5x12+1.0x1+2.0+Guassian noise
Feature map:
x = ( x 1 , … , x d ) ∈ X ⟶ Φ z = ( z 1 , … , z d ~ ) ∈ Z z = Φ ( x ) \begin{gathered}\boldsymbol{x}=\left(x_1, \ldots, x_d\right) \in \mathcal{X} \stackrel{\Phi}{\longrightarrow} \boldsymbol{z}=\left(z_1, \ldots, z_{\tilde{d}}\right) \in \mathcal{Z} \\\boldsymbol{z}=\Phi(\boldsymbol{x})\end{gathered} x=(x1,…,xd)∈X⟶Φz=(z1,…,zd~)∈Zz=Φ(x)
Each z j = ϕ j ( x ) z_j=\phi_j(x) zj=ϕj(x) depends on some nonlinear transform ϕ j ( x ) \phi_j(x) ϕj(x).
{ ϕ j ( x ) } 1 ≤ j ≤ d \{\phi_j(x)\}_{1\leq j\leq d} {ϕj(x)}1≤j≤d is called basis functions.
Polynomial basis functions
1-D vector: z ′ = ( 1 , x 1 , x 1 2 , x 1 3 ) \boldsymbol{z'}=(1,x_1,x_1^2,x_1^3) z′=(1,x1,x12,x13)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Iy0AvLZE-1671689314037)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/Figure1.4c.png)]
2-D vector: z ′ = ( 1 , x 2 , x 1 x 2 , x 1 2 ) \boldsymbol{z'}=(1,x_2,x_1x_2,x_1^2) z′=(1,x2,x1x2,x12)
Radial basis functions (RBF)
ϕ j ( x ) = { exp ( − ∥ x − μ j ∥ 2 2 2 σ 2 ) : j = 1 , … , d ~ } \phi_j(\boldsymbol{x})=\left\{\exp \left(-\frac{\left\|\boldsymbol{x}-\boldsymbol{\mu}_j\right\|_2^2}{2 \sigma^2}\right): j=1, \ldots, \tilde{d}\right\} ϕj(x)={exp(−2σ2 x−μj 22):j=1,…,d~}
The final hypothesis is linear in the feature space Z \mathcal{Z} Z.
The final hypothesis is nonlinear in the input space X \mathcal{X} X
h ( x ) = θ ~ ⋅ z = θ ~ ⋅ Φ ( x ) = ∑ j = 1 d ~ θ ~ j ϕ j ( x ) h(\boldsymbol{x})=\widetilde{\boldsymbol{\theta}} \cdot \boldsymbol{z}=\widetilde{\boldsymbol{\theta}} \cdot \Phi(\boldsymbol{x})=\sum_{j=1}^{\tilde{d}} \widetilde{\theta}_j \phi_j(\boldsymbol{x}) h(x)=θ ⋅z=θ ⋅Φ(x)=j=1∑d~θ jϕj(x)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uyBoCt9k-1671689314037)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/image-20221214221245638.png)]
The fighure below applies a 300-degree polynomial model to the preceding training data, and compares the result with a pure linear model and a quadratic model (2nd-degree polynomial). Notice how the 300-degree polynomial model wiggles around to get as close as possible to the training instances.
How to judge the model is overfitting or underfitting the data?
cross-validation
learning curves
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(labels=['train','val'])
plt.xlabel('Train set size')
plt.ylabel('RMSE')
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
Train:
Val:
These learning curves are typical of an underfitting model. Both curves have reached a plateau; they are close and fairly high.
If your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.
Now let’s look at the learning curves of a 10th-degree polynomial model on the same data.
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
The error on the training data is much lower than with the Linear Regression model.
There is a gap between the curves.
This means that the model performs significantly better on the training data than on the validation data, which is the hall mark of an overfitting model.
However, if you used a much larger training set, the two curves would continue to get closer.
One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.
Overfitting
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Or7dPN8r-1671689314038)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/Figure1.4c.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oLVCf5yH-1671689314038)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/Figure1.4d.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3RU5O46e-1671689314038)(https://typora-1309501826.cos.ap-nanjing.myqcloud.com//%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/Figure1.5.png)]
http://gaolei786.github.io/statistics/prml1.html
How to control the model complexity?
Solution:Control norm of θ \theta θ !
A good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data.
For example, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees.
The L2-regularized linear regression (ridge regression) imposes L2-norm regularization trading off with a hyperparameter λ > 0 \lambda > 0 λ > 0, This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible
w ^ = arg min w ∈ R d ∑ i = 1 n ( w T x i − y i ) 2 + λ ∥ w ∥ 2 2 \widehat{\boldsymbol{w}}=\underset{\mathbf{w} \in \mathbb{R}^d}{\arg \min } \sum_{i=1}^n\left(\boldsymbol{w}^T \boldsymbol{x}_i-y_i\right)^2+\lambda\|\boldsymbol{w}\|_2^2 w =w∈Rdargmini=1∑n(wTxi−yi)2+λ∥w∥22
The hyperparameter λ \lambda λ controls how much you want to regularize the model.
Compute the gradient and set it to zero
∇ w ϵ ^ ( w ) = 2 X T ( X w − y ) + 2 λ w = 0 ⇒ ( X T X + λ I ) w = X T y ⇒ w = ( X T X + λ I ) − 1 X T y \begin{aligned} \nabla_w \hat{\epsilon}(\boldsymbol{w}) & =2 \boldsymbol{X}^T(\boldsymbol{X} \boldsymbol{w}-\boldsymbol{y})+2 \lambda \boldsymbol{w}=0 \\ & \Rightarrow\left(\boldsymbol{X}^T \boldsymbol{X}+\lambda I\right) \boldsymbol{w}=\boldsymbol{X}^T \boldsymbol{y} \\ & \Rightarrow \boldsymbol{w}=\left(\boldsymbol{X}^T \boldsymbol{X}+\lambda I\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{aligned} ∇wϵ^(w)=2XT(Xw−y)+2λw=0⇒(XTX+λI)w=XTy⇒w=(XTX+λI)−1XTy
Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solution.
>>> from sklearn.linear_model import Ridge
>>> ridge_reg = Ridge(alpha=1, solver="cholesky")
>>> ridge_reg.fit(X, y)
>>> ridge_reg.predict([[1.5]])
array([[1.55071465]])
Or using Gradient Descent (GD):
− w t + 1 ← w t − η ∇ w ϵ ^ ( w ) ∣ w = w t -w^{t+1} \leftarrow w^t-\left.\eta \nabla_w \hat{\epsilon}(w)\right|_{w=w^t} −wt+1←wt−η∇wϵ^(w)∣w=wt
>>> sgd_reg = SGDRegressor(penalty="l2")
>>> sgd_reg.fit(X, y.ravel())
>>> sgd_reg.predict([[1.5]])
array([1.47012588])
The figure below shows several Ridge models trained on some linear data using different α value. On the left, plain Ridge models are used, leading to linear predictions.
PolynomialFeatures(degree=10)
, then it isStandardScaler
, and the Ridge models are applied to the resulting features.The L1-regularized linear regression (lasso regression) imposes L1-norm regularization trading off with a hyperparameter λ ≥ 0 \lambda \geq 0 λ≥0:
w ^ = arg min w ∈ R d ∑ i = 1 n ( w T x i − y i ) 2 + λ ∥ w ∥ 1 \widehat{\boldsymbol{w}}=\underset{\boldsymbol{w} \in \mathbb{R}^d}{\arg \min } \sum_{i=1}^n\left(\boldsymbol{w}^T \boldsymbol{x}_i-y_i\right)^2+\lambda\|\boldsymbol{w}\|_1 w =w∈Rdargmini=1∑n(wTxi−yi)2+λ∥w∥1
An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero).
For example, the dashed line in the right plot on (with α = 10-7) looks quadratic, almost linear: all the weights for the high-degree polynomial features are equal to zero.
In other words, Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights).
The Lasso cost function is not differentiable at ∗ w i = 0 ∗ *w_i = 0* ∗wi=0∗ (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g g g ****15 instead when any ∗ w i = 0 ∗ *w_i = 0* ∗wi=0∗.
g ( w , J ) = ∇ w MSE ( w ) + α ( sign ( w 1 ) sign ( w 2 ) ⋮ sign ( w n ) ) where sign ( w i ) = { − 1 if w i < 0 0 if w i = 0 + 1 if w i > 0 g(\boldsymbol{w}, J)=\nabla_{\boldsymbol{w}} \operatorname{MSE}(\boldsymbol{w})+\alpha\left(\begin{array}{c}\operatorname{sign}\left(w_1\right) \\\operatorname{sign}\left(w_2\right) \\\vdots \\\operatorname{sign}\left(w_n\right)\end{array}\right) \\\text { where }\operatorname{sign}\left(w_i\right)=\left\{\begin{array}{cc}-1 & \text { if } w_i<0 \\0 & \text { if } w_i=0 \\+1 & \text { if } w_i>0\end{array}\right. g(w,J)=∇wMSE(w)+α sign(w1)sign(w2)⋮sign(wn) where sign(wi)=⎩ ⎨ ⎧−10+1 if wi<0 if wi=0 if wi>0
Here is a small Scikit-Learn example using the Lasso class. Note that you could instead use an SGDRegressor(penalty="l1")
.
>>> from sklearn.linear_model import Lasso
>>> lasso_reg = Lasso(alpha=0.1)
>>> lasso_reg.fit(X, y)
>>> lasso_reg.predict([[1.5]])
array([1.53788174])
It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression.
Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed.
It does not make sense for h w h_{\boldsymbol{w}} hw to take values larger than 1 or smaller than 0 when we know that y ∈ { 0 , 1 } . y\in\{0,1\}. y∈{0,1}.
We use accuracy (01-loss)when validating the classification model:
l ( y , h ( x ) ) = 1 [ y ≠ h ( x ) ] l(y,h(x))=1[y \neq h(x)] l(y,h(x))=1[y=h(x)]
h ( x ) ∈ y h(\boldsymbol{x}) \in y h(x)∈y is not a proper assumption this time.
Instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result.
p ^ = h w ( x ) = σ ( x T w ) \hat p=h_{\boldsymbol{w}}(\boldsymbol{x})=\sigma(\boldsymbol{x}^T\boldsymbol{w}) p^=hw(x)=σ(xTw)
The logistic—noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1.
σ ( t ) = 1 1 + e − t \sigma(t)=\dfrac{1}{1+e^{-t}} σ(t)=1+e−t1
How about using squared loss now as h ( x ) ∈ [ 0 , 1 ] h(\boldsymbol{x}) \in [0,1] h(x)∈[0,1] and y ∈ { 0 , 1 } y\in\{0,1\} y∈{0,1}?
l ( h ( x ) , y ) = ∑ i = 1 n ( σ ( w T x i ) − y i ) 2 + λ Ω ( w ) l(h(\boldsymbol{x}),y)= \sum\limits_{i=1}^n\left(\sigma\left(\boldsymbol{w}^T\boldsymbol{x}_i\right)-y_i\right)^2+\lambda\Omega(\boldsymbol{w}) l(h(x),y)=i=1∑n(σ(wTxi)−yi)2+λΩ(w)
Tries to match continuous probability with discrete 0/1 labels.
Under some proper distributional assumption:
Assume the conditional: P ( y = 1 ∣ x , w ) = σ ( w T x ) = 1 1 + exp ( − w T x ) P(y=1 \mid \boldsymbol{x}, \boldsymbol{w})=\sigma\left(\boldsymbol{w}^T \boldsymbol{x}\right)=\frac{1}{1+\exp \left(-\boldsymbol{w}^T x\right)} P(y=1∣x,w)=σ(wTx)=1+exp(−wTx)1
Maximum Likelihood Estimation(MLE):
max w ∏ i = 1 n ∏ c = 0 1 P ( y i = c ∣ x i , w ) 1 { y i = c } \max _{\boldsymbol{w}} \prod_{i=1}^n \prod_{c=0}^1 P\left(y_i=c \mid \boldsymbol{x}_i, \boldsymbol{w}\right)^{\mathbf{1}\left\{y_i=c\right\}} wmaxi=1∏nc=0∏1P(yi=c∣xi,w)1{yi=c}
Log-likelihood:
− 1 n ∑ i = 1 n ∑ c = 0 1 { 1 { y i = c } ⋅ log P ( y i = c ∣ x i ) } -\frac{1}{n} \sum_{i=1}^n \sum_{c=0}^1\left\{\mathbf{1}\left\{y_i=c\right\} \cdot \log P\left(y_i=c \mid \boldsymbol{x}_i\right)\right\} −n1i=1∑nc=0∑1{1{yi=c}⋅logP(yi=c∣xi)}
Derive the maximum log-likelihood estimation by plugging above in:
− 1 n ∑ i = 1 n ∑ c = 0 1 { 1 { y i = c } ⋅ log P ( y i = c ∣ x i ) } = − 1 n ∑ i = 1 n { y i log P ( y i = 1 ∣ x i ) + ( 1 − y i ) log P ( y i = 0 ∣ x i ) } = − 1 n ∑ i = 1 n { y i log σ ( w T x i ⏟ y i ) + ( 1 − y i ) log [ 1 − σ ( w T x i ⏟ y i ) ] } \begin{aligned}& -\frac{1}{n} \sum_{i=1}^n \sum_{c=0}^1\left\{\mathbf{1}\left\{y_i=c\right\} \cdot \log P\left(y_i=c \mid \boldsymbol{x}_i\right)\right\} \\= & -\frac{1}{n} \sum_{i=1}^n\left\{y_i \log P\left(y_i=1 \mid \boldsymbol{x}_i\right)+\left(1-y_i\right) \log P\left(y_i=0 \mid \boldsymbol{x}_i\right)\right\} \\= & -\frac{1}{n} \sum_{i=1}^n\{\underbrace{y_i \log \sigma\left(\boldsymbol{w}^T \boldsymbol{x}_i\right.}_{y_i})+\underbrace{\left(1-y_i\right) \log \left[1-\sigma\left(\boldsymbol{w}^T \boldsymbol{x}_i\right.\right.}_{y_i})]\}\end{aligned} ==−n1i=1∑nc=0∑1{1{yi=c}⋅logP(yi=c∣xi)}−n1i=1∑n{yilogP(yi=1∣xi)+(1−yi)logP(yi=0∣xi)}−n1i=1∑n{yi yilogσ(wTxi)+yi (1−yi)log[1−σ(wTxi)]}
ℓ ( h ( x i ) , y i ) = { − log [ σ ( w T x i ) ] y i = 1 − log [ 1 − σ ( w T x i ) ] y i = 0 \ell\left(h\left(\boldsymbol{x}_i\right), y_i\right)= \begin{cases}-\log \left[\sigma\left(\boldsymbol{w}^T \boldsymbol{x}_i\right)\right] & y_i=1 \\ -\log \left[1-\sigma\left(\boldsymbol{w}^T \boldsymbol{x}_i\right)\right] & y_i=0\end{cases} ℓ(h(xi),yi)={−log[σ(wTxi)]−log[1−σ(wTxi)]yi=1yi=0
Check! If y i = 1 y_i=1 yi=1
Logistic Regression cost function (log loss)
ϵ ^ ( w ) = − ∑ i = 1 n { y i log σ ( h w ( x ) ) + ( 1 − y i ) log [ 1 − σ ( h w ( x ) ) ] } \hat{\epsilon}(w)=-\sum_{i=1}^n\left\{y_i \log \sigma\left(h_{\boldsymbol{w}}(\boldsymbol{x})\right)+\left(1-y_i\right) \log \left[1-\sigma\left(h_{\boldsymbol{w}}(\boldsymbol{x})\right)\right]\right\} ϵ^(w)=−i=1∑n{yilogσ(hw(x))+(1−yi)log[1−σ(hw(x))]}
Logistic Regression:Regularization
A classification dataset is said to be linearly separable if there exists a hyperplane that separates the two classes.
If data is linearly separable after using nonlinear features, LR requires regularization.
Prevent weights from diverging on linearly separable data
ϵ ^ ( w ) = − ∑ i = 1 n { y i log σ ( h w ( x ) ) + ( 1 − y i ) log [ 1 − σ ( h w ( x ) ) ] } + λ ∑ j = 1 a w j 2 \hat{\epsilon}(w)=-\sum_{i=1}^n\left\{y_i \log \sigma\left(h_{\boldsymbol{w}}(\boldsymbol{x})\right)+\left(1-y_i\right) \log \left[1-\sigma\left(h_{\boldsymbol{w}}(\boldsymbol{x})\right)\right]\right\}+\lambda \sum_{j=1}^a w_j^2 ϵ^(w)=−i=1∑n{yilogσ(hw(x))+(1−yi)log[1−σ(hw(x))]}+λj=1∑awj2
Similar with regression:L1,L2…
Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature. First let’s load the data:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> list(iris.keys())
['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
>>> X = iris["data"][:, 3:] # petal width
>>> y = (iris["target"] == 2).astype(np.int) # 1 if Iris-Virginica, else 0
Now let’s train a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X, y)
Let’s look at the model’s estimated probabilities for flowers with petal widths varying
from 0 to 3 cm:
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginica")
Above about 2 cm the classifier is highly confident that the flower is an IrisVirginica (it outputs a high probability to that class), while below 1 cm it is highly confident that it is not an Iris-Virginica (high probability for the “Not Iris-Virginica”class). In between these extremes, the classifier is unsure.
The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression.
p ( y = c ∣ x ; W ) = exp ( w c T x ) ∑ r = 1 C exp ( w r T x ) p(y=c \mid \boldsymbol{x} ; \boldsymbol{W})=\frac{\exp \left(\boldsymbol{w}_c^T \boldsymbol{x}\right)}{\sum_{r=1}^C \exp \left(\boldsymbol{w}_r^T \boldsymbol{x}\right)} p(y=c∣x;W)=∑r=1Cexp(wrTx)exp(wcTx)
Categorical distribution assumption
Probability mass function:
P ( y = c ∣ q ) = q c , ∑ c = 1 C q c = 1 P(y=c \mid \boldsymbol{q})=q_c, \sum_{c=1}^C q_c=1 P(y=c∣q)=qc,c=1∑Cqc=1
Probability of flipping m m m dices with result { y 1 , . . . , y m } \{y_1,...,y_m\} {y1,...,ym}
∏ i = 1 m ∏ c = 1 c P ( y i = c ∣ q i c ) 1 { y i = c } \prod_{i=1}^m \prod_{c=1}^c P\left(y_i=c \mid q_{i c}\right)^{1\left\{y_i=c\right\}} i=1∏mc=1∏cP(yi=c∣qic)1{yi=c}
Softmax regression assumes that given parameter W \boldsymbol{W} W:
q i c = P ( y i = c ∣ x i j W ) q_{i c}=P\left(y_i=c \mid x_{i j} W\right) qic=P(yi=c∣xijW)
The likelihood of W \boldsymbol{W} W given all samples D = { ( x i , y i ) } i = 1 m \mathcal{D}=\left\{\left(\boldsymbol{x}i, y_i\right)\right\}{i=1}^m D={(xi,yi)}i=1m
L ( W ; D ) = ∏ i = 1 m ∏ c = 1 c P ( y i = c ∣ x i ; W ) 1 { y i = c } \mathcal{L}(\boldsymbol{W} ; \mathcal{D})=\prod_{i=1}^m \prod_{c=1}^c P\left(y_i=c \mid x_i ; \boldsymbol{W}\right)^{\mathbf{1}\left\{y_i=c\right\}} L(W;D)=i=1∏mc=1∏cP(yi=c∣xi;W)1{yi=c}
Maximum likelihood estimation:
L ( W ; D ) = max w 1 , … , w C ∏ i = 1 m ∏ c = 1 c P ( y i = c ∣ x i ; W ) 1 { y i = c } \mathcal{L}(\boldsymbol{W} ; \mathcal{D})=\max _{\boldsymbol{w}_1, \ldots, \boldsymbol{w}_{\mathcal{C}}} \prod_{i=1}^m \prod_{c=1}^c P\left(y_i=c \mid x_i ; \boldsymbol{W}\right)^{\mathbf{1}\left\{y_i=c\right\}} L(W;D)=w1,…,wCmaxi=1∏mc=1∏cP(yi=c∣xi;W)1{yi=c}
We use minimum negative log-likelihood estimation:
L L ( W ; D ) = min w 1 , … , w C − 1 m ∑ i = 1 m ∑ c = 1 c { 1 { y i = c } ⋅ log P ( y i = c ∣ x i ; W ) } = min w 1 , … , w C − 1 m ∑ i = 1 m ∑ c = 1 c { 1 { y i = c } ⋅ log exp ( w c T x i + b c ) ∑ r = 1 C exp ( w r T x i + b r ) } \begin{aligned}& \mathcal{L L}(\boldsymbol{W} ; \mathcal{D})=\min _{\boldsymbol{w}_1, \ldots, \boldsymbol{w}_C}-\frac{1}{m} \sum_{i=1}^m \sum_{c=1}^c\left\{\mathbf{1}\left\{y_i=c\right\} \cdot \log P\left(y_i=c \mid x_i ; \boldsymbol{W}\right)\right\} \\& =\min _{w_1, \ldots, w_C}-\frac{1}{m} \sum_{i=1}^m \sum_{c=1}^c\left\{\mathbf{1}\left\{y_i=c\right\} \cdot \log \frac{\exp \left(\boldsymbol{w}_c^T \boldsymbol{x}_i+\boldsymbol{b}_c\right)}{\sum_{r=1}^C \exp \left(\boldsymbol{w}_r^T \boldsymbol{x}_i+\boldsymbol{b}_r\right)}\right\}\end{aligned} LL(W;D)=w1,…,wCmin−m1i=1∑mc=1∑c{1{yi=c}⋅logP(yi=c∣xi;W)}=w1,…,wCmin−m1i=1∑mc=1∑c{1{yi=c}⋅log∑r=1Cexp(wrTxi+br)exp(wcTxi+bc)}
− 1 m ∑ i = 1 m ∑ c = 1 c { 1 { y i = c } ⋅ log exp ( w c T x i + b c ) ∑ r = 1 C exp ( w r T x i + b r ) } -\frac{1}{m} \sum_{i=1}^m \sum_{c=1}^c\left\{\mathbf{1}\left\{y_i=c\right\} \cdot \log \frac{\exp \left(\boldsymbol{w}_c^T \boldsymbol{x}_i+\boldsymbol{b}_c\right)}{\sum_{r=1}^C \exp \left(\boldsymbol{w}_r^T \boldsymbol{x}_i+\boldsymbol{b}_r\right)}\right\} −m1i=1∑mc=1∑c{1{yi=c}⋅log∑r=1Cexp(wrTxi+br)exp(wcTxi+bc)}
Let’s use Softmax Regression to classify the iris flowers into all three classes. ScikitLearn’s LogisticRegression
uses one-versus-all by default when you train it on more than two classes, but you can set the multi_class
hyperparameter to “multinomial
” to switch it to Softmax Regression instead.You must also specify a solver that supports Softmax Regression, such as the “lbfgs
” solver (see Scikit-Learn’s documentation for more details). It also applies ℓ2 regularization by default, which you can control using the hyperparameter C.
X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)
softmax_reg.fit(X, y)
>>> softmax_reg.predict([[5, 2]])
array([2])
>>> softmax_reg.predict_proba([[5, 2]])
array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])