......
首先是rnn_step_forward,直接按照公式即可:
next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b) # [N, H] cache = (x, prev_h, Wx, Wh, b, next_h)
rnn_step_backward,根据tanh的求导公式:
可得:
x, prev_h, Wx, Wh, b, next_h = cache dtanh = dnext_h * (1 - next_h * next_h) # [N, H] db = np.sum(dtanh, axis=0) # [H, ] dWh = (prev_h.T).dot(dtanh) # [H, H] dWx = (x.T).dot(dtanh) # [D, H] dprev_h = dtanh.dot(Wh.T) # [N, H] dx = dtanh.dot(Wx.T) # [N, D]
rnn_forward中调用rnn_step_forward,这个函数的执行过程如下图(源自课件):
代码:
N, T, D = x.shape H = h0.shape[1] h = np.zeros((N, T, H)) prev_h = h0 for i in range(T): next_h, _ = rnn_step_forward(x[:, i, :], prev_h, Wx, Wh, b) prev_h = next_h h[:, i, :] = prev_h cache = (x, h0, Wh, Wx, b, h)
rnn_backward中注意dh的shape为(N, T, H),也就是其汇总了上图中每个h输出后返回的梯度,经过之前几个assignment的折磨后现在写起来很简单了。。。:
x, h0, Wh, Wx, b, h = cache N, T, D = x.shape dprev_h = np.zeros_like(h0) dx = np.zeros_like(x) dWx = np.zeros_like(Wx) dWh = np.zeros_like(Wh) db = np.zeros_like(b) for i in range(T): if i == T-1: prev_h = h0 else: prev_h = h[:, T-i-2, :] next_h = h[:, T-i-1, :] cache2 = (x[:, T-i-1, :], prev_h, Wx, Wh, b, next_h) dnext_h = dh[:, T - i - 1, :] + dprev_h dx1, dprev_h, dWx1, dWh1, db1 = rnn_step_backward(dnext_h, cache2) dx[:, T-i-1, :] = dx1 dWx += dWx1 dWh += dWh1 db += db1 dh0 = dprev_h
接下来是word_embedding_forward,根据提示,利用python中numpy的index即可:
out = W[x, :] cache = (x, W)
然后是word_embedding_backward,提示中要使用np.add.at,根据上面的前向传播过程可知,dW应该是其每个index使用量,所以:
x, W = cache dW = np.zeros_like(W) np.add.at(dW, x, dout)
接下来就是这一问中较为困难的RNN for image captioning,首先整理一下思路:
首先输入为image features,也就是从vgg14和vgg16的fc7中提取出的向量,而这个features就作为rnn的初始h值,然后将ground_truth的captions中取出输入与输出,根据rnn的结构可知,输入为captions除最后一个之外的所有,输出为captions除第一个之外的所有,然后将captions_in进行word_embedding,注意,按照代码可知,在这个实验中词向量是通过训练得来的,也即是W_embed是训练来的。然后通过一个rnn_forward,affine层输出score,计算loss,然后是反向计算梯度:
N, D = features.shape out, cache_affine = temporal_affine_forward(features.reshape((N, 1, D)), W_proj, b_proj) h0 = out.reshape((N, -1)) # (2) out_word, cache_word = word_embedding_forward(captions_in, W_embed) # (3) if self.cell_type == 'rnn': h_out, cache_out = rnn_forward(out_word, h0, Wx, Wh, b) elif self.cell_type == 'lstm': h_out, cache_out = lstm_forward(out_word, h0, Wx, Wh, b) else: raise ValueError('Invalid cell_type "%s" while running loss function' % self.cell_type) # (4) score, cache_score = temporal_affine_forward(h_out, W_vocab, b_vocab) # (5) mask = (captions_out != self._null) loss, dscore = temporal_softmax_loss(score, captions_out, mask) # backward dh_out, dW_vocab, db_vocab = temporal_affine_backward(dscore, cache_score) grads['W_vocab'] = dW_vocab grads['b_vocab'] = db_vocab if self.cell_type == 'rnn': dout_word, dh0, dWx, dWh, db = rnn_backward(dh_out, cache_out) elif self.cell_type == 'lstm': dout_word, dh0, dWx, dWh, db = lstm_backward(dh_out, cache_out) else: raise ValueError('Invalid cell_type "%s" while running loss function backward' % self.cell_type) grads['Wx'] = dWx grads['Wh'] = dWh grads['b'] = db dW_embed = word_embedding_backward(dout_word, cache_word) grads['W_embed'] = dW_embed dfeatures, dW_proj, db_proj = temporal_affine_backward(dh0.reshape((N, 1, -1)), cache_affine) grads['W_proj'] = dW_proj grads['b_proj'] = db_proj
over
差不多的过程,按照公式来实现即可:
lstm_step_forward:
N, H = prev_h.shape a = x.dot(Wx)+prev_h.dot(Wh)+b # [N, 4H] ai = a[:, :H] af = a[:, H:2*H] ao = a[:, 2*H:3*H] ag = a[:, 3*H:] i = sigmoid(ai) f = sigmoid(af) o = sigmoid(ao) g = np.tanh(ag) next_c = f * prev_c + i * g next_h = o * np.tanh(next_c) cache = (x, Wx, prev_h, Wh, H, a, prev_c, i, f, o, g, next_c)
lstm_step_backward:
x, Wx, prev_h, Wh, H, a, prev_c, i, f, o, g, next_c = cache do = dnext_h * np.tanh(next_c) dnext_c = dnext_h * o * (1-np.tanh(next_c)*np.tanh(next_c)) + dnext_c df = dnext_c * prev_c dprev_c = dnext_c * f di = dnext_c * g dg = dnext_c * i dai = di * i * (1-i) daf = df * f * (1-f) dao = do * o * (1-o) dag = dg * (1 - g * g) da = np.concatenate((dai, daf, dao, dag), axis=1) # print(da.shape) db = np.sum(da, axis=0) dx = da.dot(Wx.T) dWx = x.T.dot(da) dprev_h = da.dot(Wh.T) dWh = prev_h.T.dot(da)
lstm_forward:
N, T, D = x.shape _, H = h0.shape prev_h = h0 prev_c = np.zeros_like(prev_h) h = np.zeros((N, T, H)) cache = [] for i in range(T): next_h, next_c, cache1 = lstm_step_forward(x[:, i, :], prev_h, prev_c, Wx, Wh, b) h[:, i, :] = next_h prev_h = next_h prev_c = next_c cache.append(cache1)
lstm_backward:
N, T, H = dh.shape x = cache[T-1][0] _, D = x.shape dx = np.zeros((N, T, D)) dWx = np.zeros((D, 4 * H)) dWh = np.zeros((H, 4 * H)) db = np.zeros(4 * H) dnext_h = np.zeros((N, H)) dnext_c = np.zeros((N, H)) for i in range(T): dnext_h = dnext_h + dh[:, T-i-1, :] dx1, dprev_h, dprev_c, dWx1, dWh1, db1 = lstm_step_backward(dnext_h, dnext_c, cache[T-i-1]) dx[:, T-i-1, :] = dx1 dWx = dWx + dWx1 dWh = dWh + dWh1 db = db + db1 dnext_c = dprev_c dnext_h = dprev_h dh0 = dnext_h
sample:
N, D = features.shape out_affine, cache_affine = temporal_affine_forward(features.reshape((N, 1, D)), W_proj, b_proj) h0 = out_affine.reshape((N, -1)) captions[:, 0] = self._start prev_h = h0 prev_c = np.zeros_like(prev_h) word_index = captions[:, 0] word_embed = W_embed[word_index] for i in range(1, max_length): if self.cell_type == 'rnn': # pass next_h, cache = rnn_step_forward(word_embed, prev_h, Wx, Wh, b) elif self.cell_type == 'lstm': # pass next_h, next_c, cache = lstm_step_forward(word_embed, prev_h, prev_c, Wx, Wh, b) prev_c = next_c else: raise ValueError('Invalid cell_type "%s" while running sample function' % self.cell_type) out_vocab, cache_vocab = affine_forward(next_h, W_vocab, b_vocab) captions[:, i] = list(np.argmax(out_vocab, axis=1)) word_index = captions[:, i] word_embed = W_embed[word_index] prev_h = next_h
在载入squeezenet.ckpt时有可能会报错,把if判断语句删掉即可
Saliency Maps中首先求梯度,然后更新,按照说明来写即可:
saliency_grad = tf.gradients(correct_scores, model.image) feed_dict={model.image: X, model.labels: y} saliency = sess.run(saliency_grad, feed_dict=feed_dict)[0] saliency = np.max(abs(saliency), axis=-1)
Fooling Images:
for i in range(100): target_score = tf.gather_nd(model.classifier, tf.stack((tf.range(X.shape[0]), model.labels), axis=1)) pred_label = tf.argmax(model.classifier, axis=1) dX_fool_grad = tf.gradients(target_score, model.image) dX_t = learning_rate * dX_fool_grad / tf.norm(dX_fool_grad) dX, pred_label2 = sess.run([dX_t, pred_label], feed_dict={model.image:X_fooling, model.labels:[target_y]}) print("pred_label:",pred_label2) if pred_label2[0]==target_y: print("finish step:", i) return X_fooling else: print("step", i) X_fooling = X_fooling + dX[0]
Class visualization:
循环外,这里的大坑是l2_reg和learning_rate,直接使用原来的变量的话在tensorflow中计算时会出问题,会导致dx_t的第一维度变成25......:
scores = tf.gather_nd(model.classifier, tf.stack((tf.range(X.shape[0]), model.labels), axis=1)) l2_reg = tf.constant(l2_reg) lr = tf.constant(learning_rate,dtype=tf.float32) l2_norm = tf.norm(model.image) loss = scores - l2_reg * l2_norm * l2_norm grad = tf.gradients(loss, model.image) dx_t = lr * grad / tf.norm(grad)
循环内:
dx = sess.run(dx_t, feed_dict={model.image:X, model.labels:[target_y]}) X = X + dx[0]
生成的诡异图片。。。:
待续
待续