梯度下降法中BGD、SGD、MBGD的区别

梯度下降法

    在机器学习中我们常用梯度下降法(或其改进的方法)对算法进行训练,一般分为三步:确定初始参数,前向传播获取误差,反向传播获取梯度,根据梯度更新参数。这里首先做出几个假定:
    训练样本集为 ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)) i i i为样本编号;
    目标函数为: h θ ( x ( j ) ) = ∑ j = 0 n θ j x j h_{\theta}(x^{(j)})=\sum_{j=0}^{n}\theta_jx_j hθ(x(j))=j=0nθjxj
    对应的损失函数则为 J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\frac{1}{2n}\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θ)=2n1i=1n(hθ(x(i))y(i))2

批量梯度下降(Batch Gradient Descent,BGD)

    批量梯度下降是指每次计算误差,获取梯度时都是以同一个批次的作为整体进行的,不断更新参数,直到误差为零或在允许范围内即可。
    损失函数 J ( θ ) J(\theta) J(θ) θ j \theta_{j} θj的偏导为: ∂ J ( θ ) ∂ θ j \frac{\partial J(\theta)}{\partial \theta_{j}} θjJ(θ)
∂ J ( θ ) ∂ θ j = ∂ θ j ( 1 2 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 ) = 1 2 n ∑ i = 1 n ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) 2 = 1 2 n ∑ i = 1 n ( 2 ( h θ ( x ( i ) ) − y ( i ) ) ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) ) = 1 2 n ∑ i = 1 n ( 2 ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ) = 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \frac{\partial J(\theta)}{\partial \theta_{j}} &=\frac{\partial}{\theta_j}(\frac{1}{2n}\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})^2)\\ &=\frac{1}{2n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_j}(h_{\theta}(x^{(i)})-y^{(i)})^2\\ &=\frac{1}{2n}\sum_{i=1}^{n}(2(h_{\theta}(x^{(i)})-y^{(i)})\frac{\partial}{\theta_j}(h_{\theta}(x^{(i)})-y^{(i)}))\\ &=\frac{1}{2n}\sum_{i=1}^{n}(2(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)})\\ &=\frac{1}{n}\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned} θjJ(θ)=θj(2n1i=1n(hθ(x(i))y(i))2)=2n1i=1nθj(hθ(x(i))y(i))2=2n1i=1n(2(hθ(x(i))y(i))θj(hθ(x(i))y(i)))=2n1i=1n(2(hθ(x(i))y(i))xj(i))=n1i=1n(hθ(x(i))y(i))xj(i)
    权重的更新规则为:
θ j = θ j − α ∂ J ( θ ) ∂ θ j = θ j − α 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) = θ j + α 1 n ∑ i = 1 n ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \begin{aligned} \theta_j &=\theta_{j}-\alpha \frac{\partial J(\theta)}{\partial \theta_{j}}\\ &=\theta_{j}-\alpha \frac{1}{n} \sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\\ &=\theta_{j}+\alpha \frac{1}{n} \sum_{i=1}^{n}(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)}\\ \end{aligned} θj=θjαθjJ(θ)=θjαn1i=1n(hθ(x(i))y(i))xj(i)=θj+αn1i=1n(y(i)hθ(x(i)))xj(i)

随机梯度下降法(Stochastic gradient descent, SGD)

    随机梯度下降是指对于每一个样本的训练都进行一次参数更新,在每一次循环之前都需要打乱数据顺序。
    权重更新规则为: θ j = θ j + α ∗ ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_j = \theta_j+\alpha*(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)} θj=θj+α(y(i)hθ(x(i)))xj(i)

小批量梯度下降法(MinBatch Stochastic gradient descent, MSGD)

    MSGD是SGD和BGD的折中,SGD由于每一个样本都需要进行一次参数更新,这样更新参数过于频繁,很容易造成过大波动,同时耗时严重。BGD样本利用全体训练集更新参数,参数更新过慢。为了加快训练速度,可以采用小批量梯度下降法,该方法将 利用小批次数据作为参数更新的单位。
    三种方法的python代码实现如下:

import numpy as np
# 数据训练数据
x = np.arange(0., 10., 0.2)
nums = x.shape[0]
b = np.full(nums, 1.)
input_data = np.vstack((b, x)).T
target_data = 2 * x + 5 + np.random.randn(nums)
# 终止条件
loop_max = 1000
epsilon = 1e-5

# 权重初始化
np.random.seed(0)
w = np.random.randn(2)
# 学习率
alpha = 0.001

# 随机梯度下降法(SGD)
def func_SGD(input_data, target_data, alpha=alpha, epsilon=epsilon, loop_max=loop_max):  
    count = 0
    pre_w = w
    while count < loop_max:
	    for i in range(input_data.shape[0]):
	        diff = target_data[i] - np.dot(input_data[i], w)
	        gradient = np.dot(input_data[i].T, diff)
	        w = w + alpha * gradient
	        if np.linalg.norm(w, pre_w) < epsilon:
		        break
	        else:
		        pre_w = w
	    count += 1
# 批量梯度下降法(BGD)
def func_BGD(input_data, target_data, alpha=alpha, epsilon=epsilon, loop_max=loop_max):
    count = 0
    pre_w = w
    while count < loop_max:
        gradient_total = 0
	    for i in range(input_data.shape[0]):
	        diff = target_data[i] - np.dot(input_data[i], w)
	        gradient += np.dot(input_data[i].T, diff)
	    w = w + alpha * gradient_total / input_data.shape[0]
	    if np.linalg.norm(w, pre_w) < epsilon:
	        break
	    else:
	        pre_w = w
	    count += 1
# 小批量梯度下降法(MBGD)
def func_MSGD(input_data, target_data, batch_size=batch_size, alpha=alpha, epsilon=epsilon, loop_max=loop_max):
    count = 0
    pre_w = w
    while count < loop_max:
        N = np.ceil(input_data.shape[0]*1.0 / batch_size)
        for i in range(N):
            if i < N - 1:
                input_temp = input_data[i*batch_size: (i+1)*batch_size]
                target_temp = target_data[i*batch_size: (i+1)*batch_size] 
            else:
                input_temp = input_data[i*batch_size: input_size.shape[0]]
                target_temp = target_data[i*batch_size: input_size.shape[0]]
            gradient_all = 0
            for j in range(input_temp.shape[0]):
                diff = target_temp[j] - np.dot(input_temp[j], w)
                gradient += np.dot(input_temp[j].T, diff)
            w = w + alpha * gradient
            if np.linalg.norm(w, pre_w) < epsilon:
     		    break
 	        else:
     		    pre_w = w
        count += 1

你可能感兴趣的:(机器学习)