梯度的概念
比如函数f(x,y), 分别对x,y求偏导数,求得的梯度向量就是(∂f/∂x, ∂f/∂y)^T,简称grad f(x,y)或者▽f(x,y)。
对于在点(x0,y0)的具体梯度向量就是(∂f/∂x0, ∂f/∂y0)^T,或者▽f(x0,y0)。
如果是3个参数的向量梯度,就是(∂f/∂x, ∂f/∂y,∂f/∂z)^T,以此类推。
梯度的意义
损失函数
损失函数用来评价模型的预测值和真实值不一样的程度,损失函数越好,通常模型的性能越好。
损失函数分为经验风险损失函数和结构风险损失函数。
经验风险损失函数指预测结果和实际结果的差别。
结构风险损失函数是指经验风险损失函数加上正则项。
梯度下降与梯度上升
比如我们需要求解损失函数f(θ)的最小值,这时我们需要用梯度下降法来迭代求解。但是实际上,我们可以反过来求解损失函数 -f(θ)的最大值,这时梯度上升法就派上用场了。
比如我们在一座大山上的某处位置,由于我们不知道怎么下山,于是决定走一步算一步,也就是在每走到一个位置的时候,求解当前位置的梯度,沿着梯度的负方向(梯度减少最快的方向),也就是当前最陡峭的位置向下走一步,然后继续求解当前位置梯度,向这一步所在位置沿着最陡峭最易下山的位置走一步。
这样一步步的走下去,一直走到觉得我们已经到了山脚。当然这样走下去,有可能我们不能走到山脚,而是到了某一个局部的山峰低处。
了解梯度下降算法之前需要知道的相关概念:
步长
(Learning rate):步长决定了在梯度下降迭代的过程中,每一步沿梯度负方向前进的长度。用上面下山的例子,步长就是在当前这一步所在位置沿着最陡峭最易下山的位置走的那一步的长度。
特征
(feature):指的是样本中输入部分,比如2个单特征的样本(x(0),y(0)),(x(1),y(1)),则第一个样本特征为x(0),第一个样本输出为y(0)。
假设函数
(hypothesis function):在监督学习中,为了拟合输入样本,而使用的假设函数,记为hθ(x)。比如对于单个特征的m个样本(x(i),y(i))(i=1,2,…m),可以采用拟合函数如下:
h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x) = \theta_0+\theta_1x hθ(x)=θ0+θ1x
损失函数
(loss function):为了评估模型拟合的好坏,通常用损失函数来度量拟合的程度。损失函数极小化,意味着拟合程度最好,对应的模型参数即为最优参数。在线性回归中,损失函数通常为样本输出和假设函数的差取平方。比如对于m个样本(xi,yi)(i=1,2,…m),采用线性回归,损失函数为:
J ( θ 0 , θ 1 ) = ∑ i = 1 m ( h θ ( x i ) − y i ) 2 ( 其 中 x i 表 示 第 i 个 样 本 特 征 , y i 表 示 第 i 个 样 本 对 应 的 输 出 , h θ ( x i ) 为 假 设 函 数 ) J(\theta_0, \theta_1) = \sum\limits_{i=1}^{m}(h_\theta(x_i) - y_i)^2 ( 其中xi表示第i个样本特征,yi表示第i个样本对应的输出,hθ(xi)为假设函数) J(θ0,θ1)=i=1∑m(hθ(xi)−yi)2(其中xi表示第i个样本特征,yi表示第i个样本对应的输出,hθ(xi)为假设函数)
梯度下降算法缺点包括:
下面我们将用python实现一个简单的梯度下降算法。
场景是一个简单的线性回归的例子:假设现在我们有一系列的点,位于data.csv文件中,我们使用梯度下降算法拟合直线
data.csv文件内容
32.502345269453031,31.70700584656992 53.426804033275019,68.77759598163891 61.530358025636438,62.562382297945803 47.475639634786098,71.546632233567777 59.813207869512318,87.230925133687393 55.142188413943821,78.211518270799232 52.211796692214001,79.64197304980874 39.299566694317065,59.171489321869508 48.10504169176825,75.331242297063056 52.550014442733818,71.300879886850353 45.419730144973755,55.165677145959123 54.351634881228918,82.478846757497919 44.164049496773352,62.008923245725825 58.16847071685779,75.392870425994957 56.727208057096611,81.43619215887864 48.955888566093719,60.723602440673965 44.687196231480904,82.892503731453715 60.297326851333466,97.379896862166078 45.618643772955828,48.847153317355072 38.816817537445637,56.877213186268506 66.189816606752601,83.878564664602763 65.41605174513407,118.59121730252249 47.48120860786787,57.251819462268969 41.57564261748702,51.391744079832307 51.84518690563943,75.380651665312357 59.370822011089523,74.765564032151374 57.31000343834809,95.455052922574737 63.615561251453308,95.229366017555307 46.737619407976972,79.052406169565586 50.556760148547767,83.432071421323712 52.223996085553047,63.358790317497878 35.567830047746632,41.412885303700563 42.436476944055642,76.617341280074044 58.16454011019286,96.769566426108199 57.504447615341789,74.084130116602523 45.440530725319981,66.588144414228594 61.89622268029126,77.768482417793024 33.093831736163963,50.719588912312084 36.436009511386871,62.124570818071781 37.675654860850742,60.810246649902211 44.555608383275356,52.682983366387781 43.318282631865721,58.569824717692867 50.073145632289034,82.905981485070512 43.870612645218372,61.424709804339123 62.997480747553091,115.24415280079529 32.669043763467187,45.570588823376085 40.166899008703702,54.084054796223612 53.575077531673656,87.994452758110413 33.864214971778239,52.725494375900425 64.707138666121296,93.576118692658241 38.119824026822805,80.166275447370964 44.502538064645101,65.101711570560326 40.599538384552318,65.562301260400375 41.720676356341293,65.280886920822823 51.088634678336796,73.434641546324301 55.078095904923202,71.13972785861894 41.377726534895203,79.102829683549857 62.494697427269791,86.520538440347153 49.203887540826003,84.742697807826218 41.102685187349664,59.358850248624933 41.182016105169822,61.684037524833627 50.186389494880601,69.847604158249183 52.378446219236217,86.098291205774103 50.135485486286122,59.108839267699643 33.644706006191782,69.89968164362763 39.557901222906828,44.862490711164398 56.130388816875467,85.498067778840223 57.362052133238237,95.536686846467219 60.269214393997906,70.251934419771587 35.678093889410732,52.721734964774988 31.588116998132829,50.392670135079896 53.66093226167304,63.642398775657753 46.682228649471917,72.247251068662365 43.107820219102464,57.812512976181402 70.34607561504933,104.25710158543822 44.492855880854073,86.642020318822006 57.50453330326841,91.486778000110135 36.930076609191808,55.231660886212836 55.805733357942742,79.550436678507609 38.954769073377065,44.847124242467601 56.901214702247074,80.207523139682763 56.868900661384046,83.14274979204346 34.33312470421609,55.723489260543914 59.04974121466681,77.634182511677864 57.788223993230673,99.051414841748269 54.282328705967409,79.120646274680027 51.088719898979143,69.588897851118475 50.282836348230731,69.510503311494389 44.211741752090113,73.687564318317285 38.005488008060688,61.366904537240131 32.940479942618296,67.170655768995118 53.691639571070056,85.668203145001542 68.76573426962166,114.85387123391394 46.230966498310252,90.123572069967423 68.319360818255362,97.919821035242848 50.030174340312143,81.536990783015028 49.239765342753763,72.111832469615663 50.039575939875988,85.232007342325673 48.149858891028863,66.224957888054632 25.128484647772304,53.454394214850524
1. 定义loss函数+明确预测函数
定义一个代价(loss)函数,为此我们选用均方误差代价函数(平方误差代价函数)
J ( θ ) = 1 / 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 J(\theta) = 1/_{2m}\sum\limits_{i=1}^{m}(h_\theta(x_i) - y_i)^2 J(θ)=1/2mi=1∑m(hθ(xi)−yi)2
预测函数:
h θ ( x ) = θ 0 + θ 1 x ( θ 0 理 解 成 w , θ 1 理 解 成 b ) h_{\theta}(x) = \theta_0+\theta_1x(\theta_0理解成w,\theta_1理解成b) hθ(x)=θ0+θ1x(θ0理解成w,θ1理解成b)
2. loss函数对两个变量(b和w)分别求偏导
3. 编码
import numpy as np
'''
b : 当前b值
w : 当前w值
points : 样本点集合
功能:求当前b值和w值下的average loss值
'''
def compute_error_for_line_given_points(b, w, points):
totalError = 0 # 存放loss函数和
for i in range(0, len(points)): # 遍历点集合
x = points[i, 0] # 取出点的x坐标
y = points[i, 1] # 取出点的y坐标
totalError += (y - (w * x + b)) ** 2 # 求和
return totalError / float(len(points)) # 求loss函数平均值
'''
loss = (WX + b - y)^2
b_current : 传入的b值
w_current : 传入的w值
points : 样本点集合
learningRate :
功能:求迭代1次的b和w的值
'''
def step_gradient(b_current, w_current, points, learningRate):
b_gradient = 0 #
w_gradient = 0 #
N = float(len(points)) # N : 点的个数
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
# 对b和w求一阶导函数值 迭代求平均值
# 梯度前加一个负号:原因是需要朝着下降最快的方向走,自然就是负的梯度的方向
b_gradient += -(1/N) * (y - ((w_current * x) + b_current))
w_gradient += -(1/N) * x * (y - ((w_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_w = w_current - (learningRate * w_gradient)
return [new_b, new_w]
'''
points : 样本点集合
starting_b : b的起始值
starting_m : w的起始值
learning_rate :
num_iterations : 迭代次数
'''
def gradient_descent_runner(points, starting_b, starting_w, learning_rate, num_iterations):
b = starting_b
w = starting_w
for i in range(num_iterations): #对b和w迭代num_iterations次
b, w = step_gradient(b, w, np.array(points), learning_rate)
return [b, w] # 迭代num_iterations次之后的b和w值
# 通过调解b和w的值使得线性函数与样本点之间的误差更小---> loss值更小
def run():
points = np.genfromtxt("data.csv", delimiter=",") # 使用库函数读取所有样本点
learning_rate = 0.0001 #
initial_b = 0 # b初始化猜测值
initial_w = 0 # w初始化猜测值
num_iterations = 1000 # 迭代次数为1000
print("Starting gradient descent at b = {0}, w = {1}, error = {2}".format(initial_b, initial_w, compute_error_for_line_given_points(initial_b, initial_w, points)))
print("Running...")
[b, w] = gradient_descent_runner(points, initial_b, initial_w, learning_rate, num_iterations)
print("After {0} iterations b = {1}, w = {2}, error = {3}".format(num_iterations, b, w, compute_error_for_line_given_points(b, w, points)))
if __name__ == '__main__':
run()
输出:
Starting gradient descent at b = 0, w = 0, error = 5565.107834483211 Running... After 1000 iterations b = 0.059058556642160816, w = 1.4783313274545458, error = 112.63267078710943
通过求解后得出b和w的最优解,写出最优拟合直线
y = 1.47833 x + 0.05906 y =1.47833x +0.05906 y=1.47833x+0.05906
2. 求解w和b
3. 得出最优方程
这里的b之所以是0.089,是因为使用的loss函数如下:
J ( θ ) = 1 / m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 J(\theta) = 1/_{m}\sum\limits_{i=1}^{m}(h_\theta(x_i) - y_i)^2 J(θ)=1/mi=1∑m(hθ(xi)−yi)2
实际上有没有2,对最终的最优拟合直线没多大影响
二维图片转换为一维点
预测函数使用矩阵计算
loss函数
非线性因子--ReLU函数
识别出来的数字概率