w1和w2
假设横竖都是64
横竖都分成4份,一共16份。
第一次在这16份里,找出比较小的点,
再来几轮,基本就OK了。
从原来的16x16变成了16+16
梯度下降法,局部最优,实际上,大家发现神经网络里并没有很多的局部最优点
import numpy as np
import matplotlib.pyplot as plt
xxl=0.01
w=1.0
# 定义模型
def forward(x):
return x*w
def cost(xs,ys):
cost=0
for x,y in zip(xs,ys):
y_prediction=forward(x)
cost+=(y_prediction-y)**2
return cost/len(xs)
def gradient(xs,ys):
grad=0
for x,y in zip(xs,ys):
y_prediction = forward(x)
grad+=2 * x * (y_prediction - y)
return grad/len(xs)
# 定义训练集
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
print('Prediciton(before training)',4,forward(4))
for epoch in range(100):
cost_val=cost(x_data,y_data)
grad=gradient(x_data,y_data)
w=w-xxl*grad
print("progress:",epoch,"w=",w,"loss=",cost_val)
print('Prediction(after training)',4,forward(4))
指数加权均值,更平滑
一定要收敛,发散说明失败了,可能是学习率太大
项目 | 速度 | 效果(鞍点) |
---|---|---|
梯度下降 | 快(因为可以并行 xi和xi+1的函数值无关) | 差 |
随机梯度下降 | 慢(只能串行,因为w与上一个有关) | 好 |
所以折中
批量随机梯度下降batch
mini-batch stochastic gradient descent
# 随机梯度下降
import numpy as np
import matplotlib.pyplot as plt
xxl=0.01
w=1.0
# 定义模型
def forward(x):
return x*w
def Loss_Function(x,y):
y_prediction=forward(x)
return (y_prediction-y)**2
def gradient(x,y):
y_prediction = forward(x)
return 2 * x *(y_prediction-y)
# 定义训练集
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
print('Prediciton(before training)',4,forward(4))
for epoch in range(100):
for x,y in zip(x_data,y_data):
grad=gradient(x,y)
w=w-xxl*grad
print('\tgradient:',x,y,grad)
l=Loss_Function(x,y)
print("progress:",epoch,"w=",w,"loss=",l)
print('Prediction(after training)',4,forward(4))