cs231n-assignment3的笔记

......

Q1: Image Captioning with Vanilla RNNs (25 points)

首先是rnn_step_forward,直接按照公式即可:

next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)  # [N, H]
cache = (x, prev_h, Wx, Wh, b, next_h)

rnn_step_backward,根据tanh的求导公式:


可得:

x, prev_h, Wx, Wh, b, next_h = cache
dtanh = dnext_h * (1 - next_h * next_h)  # [N, H]
db = np.sum(dtanh, axis=0)  # [H, ]
dWh = (prev_h.T).dot(dtanh)  # [H, H]
dWx = (x.T).dot(dtanh)  # [D, H]
dprev_h = dtanh.dot(Wh.T)  # [N, H]
dx = dtanh.dot(Wx.T)  # [N, D]

rnn_forward中调用rnn_step_forward,这个函数的执行过程如下图(源自课件):

cs231n-assignment3的笔记_第1张图片

代码:

N, T, D = x.shape
H = h0.shape[1]
h = np.zeros((N, T, H))
prev_h = h0
for i in range(T):
    next_h, _ = rnn_step_forward(x[:, i, :], prev_h, Wx, Wh, b)
    prev_h = next_h
    h[:, i, :] = prev_h
cache = (x, h0, Wh, Wx, b, h)

rnn_backward中注意dh的shape为(N, T, H),也就是其汇总了上图中每个h输出后返回的梯度,经过之前几个assignment的折磨后现在写起来很简单了。。。:

x, h0, Wh, Wx, b, h = cache
N, T, D = x.shape
dprev_h = np.zeros_like(h0)
dx = np.zeros_like(x)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
for i in range(T):
    if i == T-1:
        prev_h = h0
    else:
        prev_h = h[:, T-i-2, :]
    next_h = h[:, T-i-1, :]
    cache2 = (x[:, T-i-1, :], prev_h, Wx, Wh, b, next_h)
    dnext_h = dh[:, T - i - 1, :] + dprev_h
    dx1, dprev_h, dWx1, dWh1, db1 = rnn_step_backward(dnext_h, cache2)
    dx[:, T-i-1, :] = dx1
    dWx += dWx1
    dWh += dWh1
    db += db1
dh0 = dprev_h

接下来是word_embedding_forward,根据提示,利用python中numpy的index即可:

out = W[x, :]
cache = (x, W)

然后是word_embedding_backward,提示中要使用np.add.at,根据上面的前向传播过程可知,dW应该是其每个index使用量,所以:

x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout)

接下来就是这一问中较为困难的RNN for image captioning,首先整理一下思路:

首先输入为image features,也就是从vgg14和vgg16的fc7中提取出的向量,而这个features就作为rnn的初始h值,然后将ground_truth的captions中取出输入与输出,根据rnn的结构可知,输入为captions除最后一个之外的所有,输出为captions除第一个之外的所有,然后将captions_in进行word_embedding,注意,按照代码可知,在这个实验中词向量是通过训练得来的,也即是W_embed是训练来的。然后通过一个rnn_forward,affine层输出score,计算loss,然后是反向计算梯度:

N, D = features.shape
out, cache_affine = temporal_affine_forward(features.reshape((N, 1, D)), W_proj, b_proj)
h0 = out.reshape((N, -1))
# (2)
out_word, cache_word = word_embedding_forward(captions_in, W_embed)
# (3)

if self.cell_type == 'rnn':
    h_out, cache_out = rnn_forward(out_word, h0, Wx, Wh, b)
elif self.cell_type == 'lstm':
    h_out, cache_out = lstm_forward(out_word, h0, Wx, Wh, b)
else:
    raise ValueError('Invalid cell_type "%s" while running loss function' % self.cell_type)
# (4)
score, cache_score = temporal_affine_forward(h_out, W_vocab, b_vocab)
# (5)
mask = (captions_out != self._null)
loss, dscore = temporal_softmax_loss(score, captions_out, mask)

# backward
dh_out, dW_vocab, db_vocab = temporal_affine_backward(dscore, cache_score)
grads['W_vocab'] = dW_vocab
grads['b_vocab'] = db_vocab

if self.cell_type == 'rnn':
    dout_word, dh0, dWx, dWh, db = rnn_backward(dh_out, cache_out)
elif self.cell_type == 'lstm':
    dout_word, dh0, dWx, dWh, db = lstm_backward(dh_out, cache_out)
else:
    raise ValueError('Invalid cell_type "%s" while running loss function backward' % self.cell_type)
grads['Wx'] = dWx
grads['Wh'] = dWh
grads['b'] = db

dW_embed = word_embedding_backward(dout_word, cache_word)
grads['W_embed'] = dW_embed

dfeatures, dW_proj, db_proj = temporal_affine_backward(dh0.reshape((N, 1, -1)), cache_affine)
grads['W_proj'] = dW_proj
grads['b_proj'] = db_proj

over

Q2: Image Captioning with LSTMs (30 points)

差不多的过程,按照公式来实现即可:

lstm_step_forward:

N, H = prev_h.shape
a = x.dot(Wx)+prev_h.dot(Wh)+b  # [N, 4H]
ai = a[:, :H]
af = a[:, H:2*H]
ao = a[:, 2*H:3*H]
ag = a[:, 3*H:]
i = sigmoid(ai)
f = sigmoid(af)
o = sigmoid(ao)
g = np.tanh(ag)
next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)
cache = (x, Wx, prev_h, Wh, H, a, prev_c, i, f, o, g, next_c)

lstm_step_backward:

x, Wx, prev_h, Wh, H, a, prev_c, i, f, o, g, next_c = cache
do = dnext_h * np.tanh(next_c)
dnext_c = dnext_h * o * (1-np.tanh(next_c)*np.tanh(next_c)) + dnext_c
df = dnext_c * prev_c
dprev_c = dnext_c * f
di = dnext_c * g
dg = dnext_c * i
dai = di * i * (1-i)
daf = df * f * (1-f)
dao = do * o * (1-o)
dag = dg * (1 - g * g)
da = np.concatenate((dai, daf, dao, dag), axis=1)
# print(da.shape)
db = np.sum(da, axis=0)
dx = da.dot(Wx.T)
dWx = x.T.dot(da)
dprev_h = da.dot(Wh.T)
dWh = prev_h.T.dot(da)

lstm_forward:

N, T, D = x.shape
_, H = h0.shape
prev_h = h0
prev_c = np.zeros_like(prev_h)
h = np.zeros((N, T, H))
cache = []
for i in range(T):
    next_h, next_c, cache1 = lstm_step_forward(x[:, i, :], prev_h, prev_c, Wx, Wh, b)
    h[:, i, :] = next_h
    prev_h = next_h
    prev_c = next_c
    cache.append(cache1)

lstm_backward:

N, T, H = dh.shape
x = cache[T-1][0]
_, D = x.shape
dx = np.zeros((N, T, D))
dWx = np.zeros((D, 4 * H))
dWh = np.zeros((H, 4 * H))
db = np.zeros(4 * H)
dnext_h = np.zeros((N, H))
dnext_c = np.zeros((N, H))
for i in range(T):
    dnext_h = dnext_h + dh[:, T-i-1, :]
    dx1, dprev_h, dprev_c, dWx1, dWh1, db1 = lstm_step_backward(dnext_h, dnext_c, cache[T-i-1])
    dx[:, T-i-1, :] = dx1
    dWx = dWx + dWx1
    dWh = dWh + dWh1
    db = db + db1
    dnext_c = dprev_c
    dnext_h = dprev_h
dh0 = dnext_h

sample:

N, D = features.shape
out_affine, cache_affine = temporal_affine_forward(features.reshape((N, 1, D)), W_proj, b_proj)
h0 = out_affine.reshape((N, -1))
captions[:, 0] = self._start
prev_h = h0
prev_c = np.zeros_like(prev_h)
word_index = captions[:, 0]
word_embed = W_embed[word_index]
for i in range(1, max_length):
    if self.cell_type == 'rnn':
        # pass
        next_h, cache = rnn_step_forward(word_embed, prev_h, Wx, Wh, b)
    elif self.cell_type == 'lstm':
        # pass
        next_h, next_c, cache = lstm_step_forward(word_embed, prev_h, prev_c, Wx, Wh, b)
        prev_c = next_c
    else:
        raise ValueError('Invalid cell_type "%s" while running sample function' % self.cell_type)
    out_vocab, cache_vocab = affine_forward(next_h, W_vocab, b_vocab)
    captions[:, i] = list(np.argmax(out_vocab, axis=1))
    word_index = captions[:, i]
    word_embed = W_embed[word_index]
    prev_h = next_h

Q3: Network Visualization: Saliency maps, Class Visualization, and Fooling Images (15 points)

在载入squeezenet.ckpt时有可能会报错,把if判断语句删掉即可

Saliency Maps中首先求梯度,然后更新,按照说明来写即可:

saliency_grad = tf.gradients(correct_scores, model.image)
feed_dict={model.image: X, model.labels: y}
saliency = sess.run(saliency_grad, feed_dict=feed_dict)[0]
saliency = np.max(abs(saliency), axis=-1)

Fooling Images:

for i in range(100):
        target_score = tf.gather_nd(model.classifier, tf.stack((tf.range(X.shape[0]), model.labels), axis=1))
        pred_label = tf.argmax(model.classifier, axis=1)
        
        dX_fool_grad = tf.gradients(target_score, model.image)
        dX_t = learning_rate * dX_fool_grad / tf.norm(dX_fool_grad)
        dX, pred_label2 = sess.run([dX_t, pred_label], feed_dict={model.image:X_fooling, model.labels:[target_y]})
        print("pred_label:",pred_label2)
        if pred_label2[0]==target_y:
            print("finish step:", i)
            return X_fooling
        else:
            print("step", i)
            X_fooling = X_fooling + dX[0]

Class visualization:

循环外,这里的大坑是l2_reg和learning_rate,直接使用原来的变量的话在tensorflow中计算时会出问题,会导致dx_t的第一维度变成25......:

scores = tf.gather_nd(model.classifier, tf.stack((tf.range(X.shape[0]), model.labels), axis=1))
l2_reg = tf.constant(l2_reg)
lr = tf.constant(learning_rate,dtype=tf.float32)
l2_norm = tf.norm(model.image)
loss = scores - l2_reg * l2_norm * l2_norm
grad = tf.gradients(loss, model.image)
dx_t = lr * grad / tf.norm(grad)

循环内:

dx = sess.run(dx_t, feed_dict={model.image:X, model.labels:[target_y]})
X = X + dx[0]

生成的诡异图片。。。:

cs231n-assignment3的笔记_第2张图片cs231n-assignment3的笔记_第3张图片


Q4: Style Transfer (15 points)

待续



Q5: Generative Adversarial Networks (15 points)

待续




你可能感兴趣的:(python,numpy,deep,learning,cs231n)