机器学习学习笔记（六）梯度下降法

基础

（1）梯度下降法本身不是一个机器学习算法

（2）梯度下降法是一种基于搜索的最优化方法

（3）梯度下降法的作用：最小化一个损失函数

（4）梯度上升法：最大化一个效用函数

学习率eta如果太小，减慢收敛学习速度；太大的话，会导致不收敛。

梯度下降

eta

并不是所有函数都有唯一的极致点。

解决方案：多次运行，随机化初始点

梯度下降法的初始点也是一个超参数。

原理

封装：

def gradient_descent(initial_theta,eta,epsilon=1e-8):

theta=initial_theta

theta_history.append(initial_theta)

while True:

gradient=dJ(theta)

last_theta=theta

theta=theta-eta*gradient

theta_history.append(theta)

if(abs(J(theta)-J(last_theta ))

break

def plot_theta_history():

plt.plot(plot_x,J(plot_x))

plt.plot(np.array(theta_history),J(np.array(theta_history)),color="red",marker="+")

plt.show

调用：

eta=0.01

theta_history=[]

gradient_descent(0.,eta)

plot_theta_history()

结果：

eta过长的：

避免死循环的产生：设置n_iters最多次数

普遍说，eta默认设置0.01是普遍的方法，通过绘制的方法可以查看eta是否设置过大。

多元线性回归中的梯度下降法

封装：

(1) def J

def J(theta,X_b,y):

try:

returnnp.sum((y-X_b.dot(theta))**2)/len(X_b)

except:

returnfloat('inf')

(2) def dJ

def dJ(theta,X_b,y):

res=np.empty(len(theta))

res[0]=np.sum(X_b.dot(theta)-y)

for i inrange(1,len(theta)):

res[i]=(X_b.dot(theta)-y).dot(X_b[:,i])

return res*2/len(X_b)

(3) gradient_descent梯度下降

defgradient_descent(X_b,y,initial_theta,eta,n_iters=1e4,epsilon=1e-8):

theta=initial_theta

i_iter=0

whilei_iter

gradient=dJ(theta,X_b,y)

last_theta=theta

theta=theta-eta*gradient

if(abs(J(theta,X_b,y)-J(last_theta,X_b,y))

break

i_iter+=1

return theta

使用梯度下降法前，最好对数据进行归一化

fromsklearn.preprocessing import StandardScaler

批量梯度下降法BatchGradientDescent

实现：

def dJ_sgd(theta,X_b_i,y_i):

returnX_b_i.T.dot(X_b_i.dot(theta)-y_i)*2

def sgd(X_b,y,initial_theta,n_iters):

t0=5

t1=50

def learning_rate(t):

return t0/(t+t1)

theta=initial_theta

for cur_iter inranage(n_iters):

rand_i=np.random.randint(0,len(X_b))

gradient=dJ_sgd(theta,X_b[rand_i],y[rand_i])

theta=theta-learning_rate(cur_iter)*gradient

return theta

调用：

X_b=np.hstack([np.ones((len(X),1)),X])

initial_theta=np.zeros(X_b.shape[1])

theta=sgd(X_b,y,initial_theta,n_iters=len(X_b)//3)

关于梯度的调试

def dJ_debug(theta,X_b,y,epsilon=0.01):

res=np.empty(len(theta))

for i inrange(len(theta)):

theta_1=theta.copy()

theta_1[i]+=epsilon

theta_2=theta.copy()

theta_2[i]-=epsilon

res[i]=(J(theta_1,X_b,y)-J(theta_2,X_b,y))/(2*epsilon)

return res

总结

批量梯度下降法：BatchGradientDescent

优点：稳定，一定能向损失函数最快的方向前进

缺点：求解的速度比较慢

随机梯度下降法：StochasticGradientDescent

优点：求解的速度比较快

缺点：不稳定，损失函数方向不确定，甚至有可能向反方向前进

小批量梯度下降法：Mini-BatchGradient Descent

每次看k个样本，兼顾上两者的优点。

随机：

优点：跳出局部最优解；更快的运行速度

机器学习领域中很多算法都会使用随机的特点：例如：随机搜索，随机森林。

梯度上升法：求最大值，最大化一个效用函数

更多：

机器学习学习笔记（七）PCA

机器学习学习笔记（六）梯度下降法

你可能感兴趣的:(机器学习学习笔记（六）梯度下降法)