损失函数让我们知道权重矩阵 W \bold W W的质量;
优化的目的则是让我们找到合适的 W \bold W W使得损失函数更小。
bestloss = float("inf") # Python assigns the highest possible float value
for num in range(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entire training set
if loss < bestloss: # keep track of the best solution
bestloss = loss
bestW = W
print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)
# Assume X_test is [3073 x 10000], Y_test [10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples
# find the index with max score in each column (the predicted class)
Yte_predict = np.argmax(scores, axis = 0)
# and calculate accuracy (fraction of predictions that are correct)
np.mean(Yte_predict == Yte)
# returns 0.1555
但事实上,找到最佳的 W \bold W W很难,但是从一个随机矩阵开始,每一次都将其优化就不那么难。
Our strategy will be to start with random weights and iteratively refine them over time to get lower loss
从随机矩阵 W \bold W W开始,加上随机产生的干扰项 δ W \bold \delta W δW ,如果损失函数更小,则更新之。
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in range(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)
在1-d 空间,某一点的梯度既是该点的导数,表达式为
d y d x = l i m h → 0 f ( x + h ) − f ( x ) h \frac{dy}{dx} = lim_{h\to 0} \frac{f(x+h)-f(x)}{h} dxdy=limh→0hf(x+h)−f(x)
通过有限差分逼近方法计算梯度 \color{green}通过有限差分逼近方法计算梯度 通过有限差分逼近方法计算梯度
def eval_numerical_gradient(f, x):
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient at
fx = f(x) # evaluate function value at original point
grad = np.zeros(x.shape)
h = 0.00001
# iterate over all indexes in x
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
# evaluate function at x+h
ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh = f(x) # evalute f(x + h)
x[ix] = old_value # restore to previous value (very important!)
# compute the partial derivative
grad[ix] = (fxh - fx) / h # the slope
it.iternext() # step to next dimension
return grad
[ f ( x + h ) − f ( x − h ) ] / 2 h [f(x+h)-f(x-h)]/2h [f(x+h)−f(x−h)]/2h
def CIFAR10_loss_fun(W):
return L(X_train, Y_train, W)
W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient
loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )
# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
step_size = 10 ** step_size_log
W_new = W - step_size * df # new position in the weight space
loss_new = CIFAR10_loss_fun(W_new)
print 'for step size %f new loss: %f' % (step_size, loss_new)
# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036
通过微积分方法计算梯度 \color{green}通过微积分方法计算梯度 通过微积分方法计算梯度
L i = ∑ j ≠ y i [ m a x ( 0 , w j T x i − w y i T x i + Δ ) ] L_{i} = \sum_{j\not = y_{i}}[max(0,w_j^Tx_i-w_{y_i}^Tx_i+\Delta)] Li=j=yi∑[max(0,wjTxi−wyiTxi+Δ)]
g r a d w y i = − ( ∑ j ≠ y i 1 ( w j T x i − w y i T x i + Δ > 0 ) ) x i grad_{w_{y_{i}}}= -(\sum_{j\not = y_{i}}1(w_j^Tx_i-w_{y_i}^Tx_i+\Delta >0))x_{i} gradwyi=−(j=yi∑1(wjTxi−wyiTxi+Δ>0))xi
g r a d w y i = 1 ( w j T x i − w y i T x i + Δ > 0 ) x i grad_{w_{y_{i}}}= 1(w_j^Tx_i-w_{y_i}^Tx_i+\Delta >0)x_{i} gradwyi=1(wjTxi−wyiTxi+Δ>0)xi
梯度下降更新参数 \color{green}梯度下降更新参数 梯度下降更新参数
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
Mini-batch gradient descent:
通过会取很小的batch来进行梯度估算,然后更新参数。通过取32/64/128/256个元素(2的幂次,because many vectorized operation implementations work faster when their inputs are sized in powers of 2.)
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update