[CS231n Assignment 3 #01] 简单RNN的图像字幕(Image Captioning with Vanilla RNNs)

文章目录

  • 作业介绍
  • 0. 准备
  • 1. Microsoft COCO
    • 1.1 可视化数据
  • 2. 递归神经网络(RNN)
    • 2.1 朴素RNN:单时间步的前向传播
    • 2.2 单时间步的反向传播
    • 2.3 整个时间序列的前向传播
    • 2.4 整个时间序列上的反向传播
  • 3. 词嵌入(Word embedding)
    • 3.1 词嵌入前向传播
    • 3.2 词嵌入的反向传播
  • 4. 时序输出转换
    • 4.1 前向传播
    • 4.2 反向传播
  • 5. 时序输出Softmax损失
  • 6. 使用RNN完成 Image Caption
    • 6.1 拟合一个小型数据集
    • 6.2 在样本上进行测试
  • 7. 作业问题

作业介绍

  • 作业主页:Assignment #3
  • 作业目的:
  • 作业源代码:RNN_Captioning.ipynb

0. 准备

在这个练习中,您将实现一个普通的递归神经网络,并使用它们来训练一个可以为图像生成描述的模型。
安装H5py:
我们将使用的COCO数据集以HDF5格式存储。要加载HDF5文件,我们需要安装h5py Python包。

pip install h5py

初始设置

# As usual, a bit of setup
import time, os, json
import numpy as np
import matplotlib.pyplot as plt

from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.rnn_layers import *
from cs231n.captioning_solver import CaptioningSolver
from cs231n.classifiers.rnn import CaptioningRNN
from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
from cs231n.image_utils import image_from_url

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

1. Microsoft COCO

对于这个练习,我们将使用2014年发布的Microsoft COCO数据集,它已经成为图像字幕的标准测试数据集。该数据集包括80,000张训练图像和40,000张验证图像,每个图像都有5个由Amazon Mechanical Turk上的工作人员编写的注释

我们已经对数据进行了预处理并为您提取了特征。
对于所有的图像,我们提取了在ImageNet上预先训练的VGG-16网络的fc7层的特征.这些特性存储在train2014_vgg16_fc7.h5文件中和val2014_vgg16_fc7.h5

为了减少处理时间和内存需求,我们将特征的维数从4096降至512.处理后的特征可以在train2014_vgg16_fc7_pca.h5文件和val2014_vgg16_fc7_pca.h5中找到。

原始图像占用了大量空间(近20GB),所以我们没有在下载中包含它们。但是,所有的图像都是从Flickr上获取的,训练和验证图像的url分别存储在train2014_url .txtval2014_url .txt文件中。这允许您动态下载图像以实现可视化。由于图像是动态下载的,您必须连接到internet才能查看图像。

处理字符串的效率很低,所以我们将使用字幕的编码版本。每个单词都分配了一个整数ID,允许我们用一个整数序列来表示标题。整数id和单词之间的映射在coco2014_vocab.json文件中。可以使用文件cs231n/coco_utils.py中的decode_captions函数将整数id的numpy数组转换回字符串。

我们向词汇表中添加了一些特殊的标记。我们在每个标题的开头和结尾分别添加了一个特殊的 token和一个 token。罕见字被替换为一个特殊的token (表示“未知”)。此外,由于我们希望使用包含不同长度字幕的小批量进行训练,所以对于那些短的字幕描述,我们在 token 后面添加特殊的token ,并且我们不计算 的损失核梯度。由于它们有点麻烦,我们已经为您处理了关于特殊token的所有实现细节。

您可以从cs231n/coco_utils.py文件中使用load_coco_data函数加载所有MS-COCO数据(标题、特性、url和词汇表)。运行以下代码:

# Load COCO data from disk; this returns a dictionary
# We'll work with dimensionality-reduced features for this notebook, but feel
# free to experiment with the original features by changing the flag below.
data = load_coco_data(pca_features=True)

# Print out all the keys and values from the data dictionary
for k, v in data.items():
    if type(v) == np.ndarray:
        print(k, type(v), v.shape, v.dtype)
    else:
        print(k, type(v), len(v))

输出:

# val set
# caption 对应的 image 索引
val_image_idxs  (195954,) int32
# images 特征
val_features  (40504, 512) float32
# caption 索引以及,文本编码
val_captions  (195954, 17) int32
# images 的下载连接
val_urls  (40504,)  (82783,)  (82783, 512) float32
# captions索引以及对应的文本编码
train_captions  (400135, 17) int32
# caption 对应的 image 索引
train_image_idxs  (400135,) int32
# 单词以及编码的映射
word_to_idx  1004
idx_to_word  1004

1.1 可视化数据

在使用数据集之前,从数据集查看示例总是一个好主意。

您可以使用cs231n/coco_utils.py文件中的sample_coco_minibatch函数,从load_coco_data返回的数据结构中提取小批量数据。运行以下命令对一小批训练数据进行采样,并显示图像及其描述。多次运行它并查看结果可以帮助您了解数据集。

# sample_coco_minibatch 函数
def sample_coco_minibatch(data, batch_size=100, split='train'):
    split_size = data['%s_captions' % split].shape[0]
    mask = np.random.choice(split_size, batch_size)
    captions = data['%s_captions' % split][mask]
    image_idxs = data['%s_image_idxs' % split][mask]
    image_features = data['%s_features' % split][image_idxs]
    urls = data['%s_urls' % split][image_idxs]
    return captions, image_features, urls

# 解码文本
def decode_captions(captions, idx_to_word):
    singleton = False
    if captions.ndim == 1:
        singleton = True
        captions = captions[None]
    decoded = []
    N, T = captions.shape
    for i in range(N):
        words = []
        for t in range(T):
            word = idx_to_word[captions[i, t]]
            if word != '':
                words.append(word)
            if word == '':
                break
        decoded.append(' '.join(words))
    if singleton:
        decoded = decoded[0]
    return decoded
# 从图像连接url中下载图像
import urllib.request, urllib.error, urllib.parse, os, tempfile
def image_from_url(url):
    """
    Read an image from a URL. Returns a numpy array with the pixel data.
    We write the image to a temporary file then read it back. Kinda gross.
    """
    try:
        f = urllib.request.urlopen(url)
        _, fname = tempfile.mkstemp()
        with open(fname, 'wb') as ff:
            ff.write(f.read())
        img = imread(fname)
        # 这里会报错另一个程序正在使用此文件,进程无法访问。
        # 所以,我加了一个异常跳过
        try:
            os.remove(fname)
        except Exception as e: print(e)
        return img
    except urllib.error.URLError as e:
        print('URL Error: ', e.reason, url)
    except urllib.error.HTTPError as e:
        print('HTTP Error: ', e.code, url)

显示一个mini-batch数据:

# Sample a minibatch and show the images and captions
batch_size = 3

captions, features, urls = sample_coco_minibatch(data, batch_size=batch_size)
for i, (caption, url) in enumerate(zip(captions, urls)):
    plt.imshow(image_from_url(url))
    plt.axis('off')
    caption_str = decode_captions(caption, data['idx_to_word'])
    plt.title(caption_str)
    plt.show()

显示结果:
[CS231n Assignment 3 #01] 简单RNN的图像字幕(Image Captioning with Vanilla RNNs)_第1张图片

2. 递归神经网络(RNN)

正如在课堂上所讨论的,我们将使用递归神经网络(RNN)语言模型进行图像字幕。文件cs231n/rnn_layers.py包含循环神经网络所需的不同层类型的实现,而文件cs231n/classifier /rnn.py使用这些层实现一个图像字幕模型。

我们将首先在cs231n/rnn_layers.py中实现不同类型的RNN层。

2.1 朴素RNN:单时间步的前向传播

打开文件cs231n/rnn_layers.py。该文件实现了递归神经网络中常用的不同类型的层的前向和后向传递。

首先实现rnn_step_forward函数,该函数实现了普通递归神经网络单时间步的前向传播。这样做之后,运行以下代码来检查您的实现。您应该看到e-8或更少的数量级上的错误。

# step forward
# h(t) = tanh(wh*h(t-1) + wx*x(t)+b)
def rnn_step_forward(x, prev_h, Wx, Wh, b):
    """
    Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
    activation function.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Inputs:
    - x: Input data for this timestep, of shape (N, D).
    - prev_h: Hidden state from previous timestep, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - cache: Tuple of values needed for the backward pass.
    """
    next_h, cache = None, None
    z = prev_h.dot(Wh) + x.dot(Wx) + b
    next_h = np.tanh(z)
    cache = (next_h,Wh,Wx,prev_h,x)

    return next_h, cache

测试:

N, D, H = 3, 10, 4

x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)
Wx = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)
Wh = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)
b = np.linspace(-0.2, 0.4, num=H)

next_h, _ = rnn_step_forward(x, prev_h, Wx, Wh, b)
expected_next_h = np.asarray([
  [-0.58172089, -0.50182032, -0.41232771, -0.31410098],
  [ 0.66854692,  0.79562378,  0.87755553,  0.92795967],
  [ 0.97934501,  0.99144213,  0.99646691,  0.99854353]])

print('next_h error: ', rel_error(expected_next_h, next_h))

2.2 单时间步的反向传播

在文件cs231n/rnn_layers.py中实现rnn_step_back函数。这样做之后,运行下面的代码来检查你的实现。您应该看到e-8或更少的数量级上的错误。

def rnn_step_backward(dnext_h, cache):
    """
    Backward pass for a single timestep of a vanilla RNN.

    Inputs:
    - dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
    - cache: Cache object from the forward pass

    Returns a tuple of:
    - dx: Gradients of input data, of shape (N, D)
    - dprev_h: Gradients of previous hidden state, of shape (N, H)
    - dWx: Gradients of input-to-hidden weights, of shape (D, H)
    - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
    - db: Gradients of bias vector, of shape (H,)
    """
    dx, dprev_h, dWx, dWh, db = None, None, None, None, None
  
    (next_h,Wh,Wx,prev_h,x) = cache
    dtanh = dnext_h * (1 - next_h * next_h) # (N,H)
    dx = dtanh.dot(Wx.T)
    dprev_h = dtanh.dot(Wh.T)
    dWx = x.T.dot(dtanh)
    dWh = prev_h.T.dot(dtanh)
    db = np.sum(dtanh,axis=0)

    return dx, dprev_h, dWx, dWh, db

梯度检验:

from cs231n.rnn_layers import rnn_step_forward, rnn_step_backward
np.random.seed(231)
N, D, H = 4, 5, 6
x = np.random.randn(N, D)
h = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)

out, cache = rnn_step_forward(x, h, Wx, Wh, b)

dnext_h = np.random.randn(*out.shape)

fx = lambda x: rnn_step_forward(x, h, Wx, Wh, b)[0]
fh = lambda prev_h: rnn_step_forward(x, h, Wx, Wh, b)[0]
fWx = lambda Wx: rnn_step_forward(x, h, Wx, Wh, b)[0]
fWh = lambda Wh: rnn_step_forward(x, h, Wx, Wh, b)[0]
fb = lambda b: rnn_step_forward(x, h, Wx, Wh, b)[0]

dx_num = eval_numerical_gradient_array(fx, x, dnext_h)
dprev_h_num = eval_numerical_gradient_array(fh, h, dnext_h)
dWx_num = eval_numerical_gradient_array(fWx, Wx, dnext_h)
dWh_num = eval_numerical_gradient_array(fWh, Wh, dnext_h)
db_num = eval_numerical_gradient_array(fb, b, dnext_h)

dx, dprev_h, dWx, dWh, db = rnn_step_backward(dnext_h, cache)

print('dx error: ', rel_error(dx_num, dx))
print('dprev_h error: ', rel_error(dprev_h_num, dprev_h))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))

2.3 整个时间序列的前向传播

既然您已经实现了普通RNN的单时间步的向前和向后传播(T= t i t_i ti),那么您将组合这些片段来实现处理整个时间序列的RNN( T = ( t 1 , t 2 , . . . , t T ) T = (t_1,t_2,...,t_T) T=(t1,t2,...,tT))。

在文件cs231n/rnn_layers中。实现函数rnn_forward。这应该使用上面定义的rnn_step_forward函数来实现。这样做之后,运行以下代码来检查您的实现。您应该看到e-7或更少的数量级上的错误。

def rnn_forward(x, h0, Wx, Wh, b):
    """
    Run a vanilla RNN forward on an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The RNN uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the RNN forward, we return the hidden states for all timesteps.

    Inputs:
    - x: Input data for the entire timeseries, of shape (N, T, D).
    - h0: Initial hidden state, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - h: Hidden states for the entire timeseries, of shape (N, T, H).
    - cache: Values needed in the backward pass
    """
    h, cache = None, None
    ##############################################################################
    # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
    # input data. You should use the rnn_step_forward function that you defined  #
    # above. You can use a for loop to help compute the forward pass.            #
    ##############################################################################
    N,T,D = x.shape
    H = b.shape[0]
    # 之前我们每个时间步的缓存是 cache = (next_h,Wh,Wx,prev_h,x)
    # 但是在一次前向传播中, 其中变化的只有 h,所以我们主要存储 h就行
    h = np.zeros((N,T,H))
    prev_h = h0 # 设定初始隐藏状态
    for i in range(T): # T个时间步
            next_h, cache_ = rnn_step_forward(x[:,i,:],prev_h,Wx,Wh,b)
            prev_h = next_h
            h[:,i,:] = prev_h
    cache = (h,Wh,Wx,b,x,h0)
    return h, cache

2.4 整个时间序列上的反向传播

In the file cs231n/rnn_layers.py, implement the backward pass for a vanilla RNN in the function rnn_backward. This should run back-propagation over the entire sequence, making calls to the rnn_step_backward function that you defined earlier. You should see errors on the order of e-6 or less.

def rnn_backward(dh, cache):
    """
    Compute the backward pass for a vanilla RNN over an entire sequence of data.

    Inputs:
    - dh: Upstream gradients of all hidden states, of shape (N, T, H).

    NOTE: 'dh' contains the upstream gradients produced by the
    individual loss functions at each timestep, *not* the gradients
    being passed between timesteps (which you'll have to compute yourself
    by calling rnn_step_backward in a loop).

    Returns a tuple of:
    - dx: Gradient of inputs, of shape (N, T, D)
    - dh0: Gradient of initial hidden state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
    - db: Gradient of biases, of shape (H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None

    (h,Wh,Wx,b,x,h0) = cache

    N, T, H = dh.shape
    _,_,D = x.shape
    ##############################################################################
    # TODO: Implement the backward pass for a vanilla RNN running an entire      #
    # sequence of data. You should use the rnn_step_backward function that you   #
    # defined above. You can use a for loop to help compute the backward pass.   #
    ##############################################################################
    # 注意,这里有个易错点,输入的上流梯度dh应该是通过每个时间步的输出得到的
    # 所以,我们还需要计算后一个隐藏状态向前一个隐藏状态传递的梯度
    dx = np.zeros((N, T, D))
    dh0 = np.zeros((N, H))
    dWx = np.zeros((D, H))
    dWh = np.zeros((H, H))
    db = np.zeros(H)
    dprev_h_ = np.zeros((N, H))

    for i in range(T-1,-1,-1): # 从后往前计算
        prev_h = h0 if i == 0 else h[:,i-1,:]
        cachei = (h[:,i,:], Wh, Wx, prev_h, x[:,i,:])
        dx[:, i, :], dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:,i,:] + dprev_h_,cachei)
        dWh += dWh_
        dWx += dWx_
        db += db_
    dh0 = dprev_h_
    return dx, dh0, dWx, dWh, db

3. 词嵌入(Word embedding)

In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system.

3.1 词嵌入前向传播

In the file cs231n/rnn_layers.py, implement the function word_embedding_forward to convert words (represented by integers) into vectors. Run the following to check your implementation. You should see an error on the order of e-8 or less.

def word_embedding_forward(x, W):
    """
    Forward pass for word embeddings. We operate on minibatches of size N where
    each sequence has length T. We assume a vocabulary of V words, assigning each
    word to a vector of dimension D.

    Inputs:
    - x: Integer array of shape (N, T) giving indices of words. Each element idx
      of x muxt be in the range 0 <= idx < V.
    - W: Weight matrix of shape (V, D) giving word vectors for all words.

    Returns a tuple of:
    - out: Array of shape (N, T, D) giving word vectors for all input words.
    - cache: Values needed for the backward pass
    """
    out, cache = None, None
    ##############################################################################
    # TODO: Implement the forward pass for word embeddings.                      #
    #                                                                            #
    # HINT: This can be done in one line using NumPy's array indexing.           #
    ##############################################################################
    # 第一步 x 应该转换成 one-hot 向量 of shape (N,T,V)
    # 但是因为 one-hot 向量中只有一个位置是 1, 与 W 相乘之后, 也只有对应的那一行被取出来,所以就能直接在 W 中进行索引
    out = W[x]
    cache = (x,W.shape)
    return out, cache

3.2 词嵌入的反向传播

Implement the backward pass for the word embedding function in the function word_embedding_backward. After doing so run the following to numerically gradient check your implementation. You should see an error on the order of e-11 or less.

def word_embedding_backward(dout, cache):
    """
    Backward pass for word embeddings. We cannot back-propagate into the words
    since they are integers, so we only return gradient for the word embedding
    matrix.
    HINT: Look up the function np.add.at
    Inputs:
    - dout: Upstream gradients of shape (N, T, D)
    - cache: Values from the forward pass

    Returns:
    - dW: Gradient of word embedding matrix, of shape (V, D).
    """
    dW = None
    ##############################################################################
    # TODO: Implement the backward pass for word embeddings.                     #
    #                                                                            #
    # Note that words can appear more than once in a sequence.                   #
    # HINT: Look up the function np.add.at                                       #
    ##############################################################################
    """np.ufunc.at(a, indices, b=None)
    Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. 
    For addition ufunc, this method is equivalent to a[indices] += b, 
    except that results are accumulated for elements that are indexed more than once. 
    For example, a[[0,0]] += 1 will only increment the first element once because of buffering, 
    whereas add.at(a, [0,0], 1) will increment the first element twice.
    """
    x, W_shape = cache
    dW = np.zeros(W_shape)
    # 就类似于先将 dW 按照 x 索引出来, 再加上 dout, 但是索引出来的向量只是 dW 的一个 view, 操作是 in-place 的
    np.add.at(dW,x,dout)
    return dW

4. 时序输出转换

At every timestep we use an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary.
Because this is very similar to the affine layer that you implemented in assignment 2, we have provided this function for you in the temporal_affine_forward and temporal_affine_backward functions in the file cs231n/rnn_layers.py.
Run the following to perform numeric gradient checking on the implementation. You should see errors on the order of e-9 or less.

4.1 前向传播

def temporal_affine_forward(x, w, b):
    """
    Forward pass for a temporal affine layer. The input is a set of D-dimensional
    vectors arranged into a minibatch of N timeseries, each of length T. We use
    an affine function to transform each of those vectors into a new vector of
    dimension M.

    Inputs:
    - x: Input data of shape (N, T, D)
    - w: Weights of shape (D, M)
    - b: Biases of shape (M,)

    Returns a tuple of:
    - out: Output data of shape (N, T, M)
    - cache: Values needed for the backward pass
    """
    N, T, D = x.shape
    M = b.shape[0]
    out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b
    cache = x, w, b, out
    return out, cache

4.2 反向传播

def temporal_affine_backward(dout, cache):
    """
    Backward pass for temporal affine layer.

    Input:
    - dout: Upstream gradients of shape (N, T, M)
    - cache: Values from forward pass

    Returns a tuple of:
    - dx: Gradient of input, of shape (N, T, D)
    - dw: Gradient of weights, of shape (D, M)
    - db: Gradient of biases, of shape (M,)
    """
    x, w, b, out = cache
    N, T, D = x.shape
    M = b.shape[0]

    dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D)
    dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T
    db = dout.sum(axis=(0, 1))

    return dx, dw, db

5. 时序输出Softmax损失

In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.

However there is one wrinkle: since we operate over minibatches and different captions may have different lengths, we append tokens to the end of each caption so they all have the same length.
We don’t want these tokens to count toward the loss or gradient, so in addition to scores and ground-truth labels our loss function also accepts a mask array that tells it which elements of the scores count towards the loss.

Since this is very similar to the softmax loss function you implemented in assignment 1, we have implemented this loss function for you; look at the temporal_softmax_loss function in the file cs231n/rnn_layers.py.

假设softmax层的输入为 x = ( N , T , D ) x = (N,T,D) x=(N,T,D),标签 y = ( N , T ) y = (N,T) y=(N,T)
只需要把他们变下形就和前面我们实现过的softmax类似了:
x = ( N ∗ T , D ) , y = ( N ∗ T ) x=(N*T,D),y=(N*T) x=(NT,D),y=(NT)

def temporal_softmax_loss(x, y, mask, verbose=False):
    """
    A temporal version of softmax loss for use in RNNs. We assume that we are
    making predictions over a vocabulary of size V for each timestep of a
    timeseries of length T, over a minibatch of size N. The input x gives scores
    for all vocabulary elements at all timesteps, and y gives the indices of the
    ground-truth element at each timestep. We use a cross-entropy loss at each
    timestep, summing the loss over all timesteps and averaging across the
    minibatch.

    As an additional complication, we may want to ignore the model output at some
    timesteps, since sequences of different length may have been combined into a
    minibatch and padded with NULL tokens. The optional mask argument tells us
    which elements should contribute to the loss.

    Inputs:
    - x: Input scores, of shape (N, T, V)
    - y: Ground-truth indices, of shape (N, T) where each element is in the range
         0 <= y[i, t] < V
    - mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not
      the scores at x[i, t] should contribute to the loss.

    Returns a tuple of:
    - loss: Scalar giving loss
    - dx: Gradient of loss with respect to scores x.
    """

    N, T, V = x.shape

    x_flat = x.reshape(N * T, V)
    y_flat = y.reshape(N * T)
    mask_flat = mask.reshape(N * T)

    probs = np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True))
    probs /= np.sum(probs, axis=1, keepdims=True)
    loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N
    dx_flat = probs.copy()
    dx_flat[np.arange(N * T), y_flat] -= 1
    dx_flat /= N
    dx_flat *= mask_flat[:, None]

    if verbose:
        print('dx_flat: ', dx_flat.shape)

    dx = dx_flat.reshape(N, T, V)

    return loss, dx

6. 使用RNN完成 Image Caption

Now that you have implemented the necessary layers, you can combine them to build an image captioning model. Open the file cs231n/classifiers/rnn.py and look at the CaptioningRNN class.

Implement the forward and backward pass of the model in the loss function. For now you only need to implement the case where cell_type='rnn' for vanialla RNNs; you will implement the LSTM case later. After doing so, run the following to check your forward pass using a small test case; you should see error on the order of e-10 or less.

class CaptioningRNN(object):
    """
    A CaptioningRNN produces captions from image features using a recurrent
    neural network.

    The RNN receives input vectors of size D, has a vocab size of V, works on
    sequences of length T, has an RNN hidden dimension of H, uses word vectors
    of dimension W, and operates on minibatches of size N.

    Note that we don't use any regularization for the CaptioningRNN.
    """

    def __init__(self, word_to_idx, input_dim=512, wordvec_dim=128,
                 hidden_dim=128, cell_type='rnn', dtype=np.float32):
        """
        Construct a new CaptioningRNN instance.

        Inputs:
        - word_to_idx: A dictionary giving the vocabulary. It contains V entries,
          and maps each string to a unique integer in the range [0, V).
        - input_dim: Dimension D of input image feature vectors.
        - wordvec_dim: Dimension W of word vectors.
        - hidden_dim: Dimension H for the hidden state of the RNN.
        - cell_type: What type of RNN to use; either 'rnn' or 'lstm'.
        - dtype: numpy datatype to use; use float32 for training and float64 for
          numeric gradient checking.
        """
        if cell_type not in {'rnn', 'lstm'}:
            raise ValueError('Invalid cell_type "%s"' % cell_type)

        self.cell_type = cell_type
        self.dtype = dtype
        self.word_to_idx = word_to_idx
        self.idx_to_word = {i: w for w, i in word_to_idx.items()}
        self.params = {}

        vocab_size = len(word_to_idx)

        self._null = word_to_idx['']
        self._start = word_to_idx.get('', None)
        self._end = word_to_idx.get('', None)

        # Initialize word vectors
        self.params['W_embed'] = np.random.randn(vocab_size, wordvec_dim)
        self.params['W_embed'] /= 100

        # Initialize CNN -> hidden state projection parameters
        self.params['W_proj'] = np.random.randn(input_dim, hidden_dim)
        self.params['W_proj'] /= np.sqrt(input_dim)
        self.params['b_proj'] = np.zeros(hidden_dim)

        # Initialize parameters for the RNN
        dim_mul = {'lstm': 4, 'rnn': 1}[cell_type]
        self.params['Wx'] = np.random.randn(wordvec_dim, dim_mul * hidden_dim)
        self.params['Wx'] /= np.sqrt(wordvec_dim)
        self.params['Wh'] = np.random.randn(hidden_dim, dim_mul * hidden_dim)
        self.params['Wh'] /= np.sqrt(hidden_dim)
        self.params['b'] = np.zeros(dim_mul * hidden_dim)

        # Initialize output to vocab weights
        self.params['W_vocab'] = np.random.randn(hidden_dim, vocab_size)
        self.params['W_vocab'] /= np.sqrt(hidden_dim)
        self.params['b_vocab'] = np.zeros(vocab_size)

        # Cast parameters to correct dtype
        for k, v in self.params.items():
            self.params[k] = v.astype(self.dtype)


    def loss(self, features, captions):
        """
        Compute training-time loss for the RNN. We input image features and
        ground-truth captions for those images, and use an RNN (or LSTM) to compute
        loss and gradients on all parameters.

        Inputs:
        - features: Input image features, of shape (N, D)
        - captions: Ground-truth captions; an integer array of shape (N, T) where
          each element is in the range 0 <= y[i, t] < V

        Returns a tuple of:
        - loss: Scalar loss
        - grads: Dictionary of gradients parallel to self.params
        """
        # Cut captions into two pieces: captions_in has everything but the last word
        # and will be input to the RNN; captions_out has everything but the first
        # word and this is what we will expect the RNN to generate. These are offset
        # by one relative to each other because the RNN should produce word (t+1)
        # after receiving word t. The first element of captions_in will be the START
        # token, and the first element of captions_out will be the first word.
        captions_in = captions[:, :-1]
        captions_out = captions[:, 1:]

        # You'll need this
        mask = (captions_out != self._null)

        # Weight and bias for the affine transform from image features to initial
        # hidden state
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

        # Word embedding matrix
        W_embed = self.params['W_embed']

        # Input-to-hidden, hidden-to-hidden, and biases for the RNN
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

        # Weight and bias for the hidden-to-vocab transformation.
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
        # In the forward pass you will need to do the following:                   #
        # (1) Use an affine transformation to compute the initial hidden state     #
        #     from the image features. This should produce an array of shape (N, H)#
        # (2) Use a word embedding layer to transform the words in captions_in     #
        #     from indices to vectors, giving an array of shape (N, T, W).         #
        # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
        #     process the sequence of input word vectors and produce hidden state  #
        #     vectors for all timesteps, producing an array of shape (N, T, H).    #
        # (4) Use a (temporal) affine transformation to compute scores over the    #
        #     vocabulary at every timestep using the hidden states, giving an      #
        #     array of shape (N, T, V).                                            #
        # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
        #     the points where the output word is  using the mask above.     #
        #                                                                          #
        # In the backward pass you will need to compute the gradient of the loss   #
        # with respect to all model parameters. Use the loss and grads variables   #
        # defined above to store loss and gradients; grads[k] should give the      #
        # gradients for self.params[k].                                            #
        #                                                                          #
        # Note also that you are allowed to make use of functions from layers.py   #
        # in your implementation, if needed.                                       #
        ############################################################################
        # Step1: 先将输入的图像特征从 (N,D) -> (N,H)
        # 将其作为 CaptionRNN 的初始隐藏状态
        # 我觉得更合理的方式应该是每一步都将图像特征进行输入
        features_proj,cache_proj = affine_forward(features,W_proj,b_proj)
        #features_proj, cache_proj = affine_relu_forward(features,W_proj,b_proj)
        # Step2: 将输入的 Caption索引进行 词嵌入 (N,T) -> (N,T,W)
        captions_embedding, cache_embedding = word_embedding_forward(captions_in,W_embed)
        # Step3: 使用 CaptionRNN 进行编码 (N,T,W) -> (N,T,H)
        if self.cell_type == 'rnn':
            hiddens,cache_rnn = rnn_forward(captions_embedding,features_proj,Wx,Wh,b)
        else:
            hiddens,cache_rnn = None,None
        # Step4: 使用 temporal_affine Layer 将隐藏状态转换成概率分布
        scores,cache_out = temporal_affine_forward(hiddens,W_vocab,b_vocab)
        # Step5: 计算输出的概率分布与期望的输出句子分布的损失
        loss,dscores = temporal_softmax_loss(scores,captions_out,mask)

        # 反向传播求梯度
        dhiddens,grads["W_vocab"],grads["b_vocab"] = temporal_affine_backward(dscores,cache_out)
        if self.cell_type == 'rnn':
            dembedding,dh0,grads["Wx"],grads["Wh"],grads["b"] = rnn_backward(dhiddens,cache_rnn)
        else:
            dembedding,dh0 = None,None
        grads["W_embed"] = word_embedding_backward(dembedding,cache_embedding)
        #dfeatures,grads["W_proj"],grads["b_proj"] = affine_relu_backward(dh0,cache_proj)
        dfeatures, grads["W_proj"], grads["b_proj"] = affine_backward(dh0, cache_proj)

        return loss, grads

上面,我在转换CNN特征的时候,多添加了一层ReLu,输出的loss可能和期望的有点差别。

6.1 拟合一个小型数据集

Similar to the Solver class that we used to train image classification models on the previous assignment, on this assignment we use a CaptioningSolver class to train image captioning models. Open the file cs231n/captioning_solver.py and read through the CaptioningSolver class; it should look very familiar.

Once you have familiarized yourself with the API, run the following to make sure your model overfits a small sample of 100 training examples. You should see a final loss of less than 0.1.

np.random.seed(231)

small_data = load_coco_data(max_train=50)

small_rnn_model = CaptioningRNN(
          cell_type='rnn',
          word_to_idx=data['word_to_idx'],
          input_dim=data['train_features'].shape[1],
          hidden_dim=512,
          wordvec_dim=256,
        )

small_rnn_solver = CaptioningSolver(small_rnn_model, small_data,
           update_rule='adam',
           num_epochs=50,
           batch_size=25,
           optim_config={
             'learning_rate': 5e-3,
           },
           lr_decay=0.95,
           verbose=True, print_every=10,
         )

small_rnn_solver.train()

# Plot the training losses
plt.plot(small_rnn_solver.loss_history)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training loss history')
plt.show()

如果在 affine 输入的CNN 特征的时候加了 ReLu 激活,则可能在 50 个 epochs 后损失不能降到0.1以下。
[CS231n Assignment 3 #01] 简单RNN的图像字幕(Image Captioning with Vanilla RNNs)_第2张图片

6.2 在样本上进行测试

Unlike classification models, image captioning models behave very differently at training time and at test time. At training time, we have access to the ground-truth caption, so we feed ground-truth words as input to the RNN at each timestep.
At test time, we sample from the distribution over the vocabulary at each timestep, and feed the sample as input to the RNN at the next timestep.

In the file cs231n/classifiers/rnn.py, implement the sample method for test-time sampling. After doing so, run the following to sample from your overfitted model on both training and validation data. The samples on training data should be very good; the samples on validation data probably won’t make sense.

测试代码

    def sample(self, features, max_length=30):
        """
        Run a test-time forward pass for the model, sampling captions for input
        feature vectors.

        At each timestep, we embed the current word, pass it and the previous hidden
        state to the RNN to get the next hidden state, use the hidden state to get
        scores for all vocab words, and choose the word with the highest score as
        the next word. The initial hidden state is computed by applying an affine
        transform to the input image features, and the initial word is the 
        token.

        For LSTMs you will also have to keep track of the cell state; in that case
        the initial cell state should be zero.

        Inputs:
        - features: Array of input image features of shape (N, D).
        - max_length: Maximum length T of generated captions.

        Returns:
        - captions: Array of shape (N, max_length) giving sampled captions,
          where each element is an integer in the range [0, V). The first element
          of captions should be the first sampled word, not the  token.
        """
        if len(features.shape) == 1:
            # (D) -> (N,D)
            features = features[None,:]
        N = features.shape[0]
        captions = self._null * np.ones((N, max_length), dtype=np.int32)

        # Unpack parameters
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
        W_embed = self.params['W_embed']
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        ###########################################################################
        # TODO: Implement test-time sampling for the model. You will need to      #
        # initialize the hidden state of the RNN by applying the learned affine   #
        # transform to the input image features. The first word that you feed to  #
        # the RNN should be the  token; its value is stored in the         #
        # variable self._start. At each timestep you will need to do to:          #
        # (1) Embed the previous word using the learned word embeddings           #
        # (2) Make an RNN step using the previous hidden state and the embedded   #
        #     current word to get the next hidden state.                          #
        # (3) Apply the learned affine transformation to the next hidden state to #
        #     get scores for all words in the vocabulary                          #
        # (4) Select the word with the highest score as the next word, writing it #
        #     (the word index) to the appropriate slot in the captions variable   #
        #                                                                         #
        # For simplicity, you do not need to stop generating after an  token #
        # is sampled, but you can if you want to.                                 #
        #                                                                         #
        # HINT: You will not be able to use the rnn_forward or lstm_forward       #
        # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
        # a loop.                                                                 #
        #                                                                         #
        # NOTE: we are still working over minibatches in this function. Also if   #
        # you are using an LSTM, initialize the first cell state to zeros.        #
        ###########################################################################
        # Step1: 映射输入的图像特征
        affine_out, cache_affine = affine_forward(features,W_proj,b_proj)
        # Step2: embedding输入的第一个单词 (N,W)
        prev_words = [self._start] * N
        pre_h = affine_out
        # Step3: 使用 RNN 产生输出
        for t in range(max_length):
            #embedding = W_embed[prev_words]
            embedding, cache_embedding = word_embedding_forward(prev_words, W_embed)
            if self.cell_type == 'rnn':
                next_h,rnn_cache = rnn_step_forward(embedding,pre_h,Wx,Wh,b)
            else:
                next_h,rnn_cache = None
            # Step4: 产生概率分布
            # out of shape (N,1,V)
            # (N,H) -> (N,1,H)
            pre_h = next_h
            next_h = next_h[:, None, :]
            out,out_cache = temporal_affine_forward(next_h,W_vocab,b_vocab)
            # (N,1,V) -> (N,V)
            out = out.squeeze(1)

            # out,out_cache = affine_forward(next_h, W_vocab, b_vocab) # 输入 (N,H)
            # Step5: 选择概率最高的位置作为输出 of shape (N,)
            prev_words = np.argmax(out,axis=1)
            captions[:, t] = prev_words
        return captions

结果显示

for split in ['train', 'val']:
    minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)
    if minibatch is None: continue
    
    gt_captions, features, urls = minibatch
    gt_captions = decode_captions(gt_captions, data['idx_to_word'])

    sample_captions = small_rnn_model.sample(features)
    sample_captions = decode_captions(sample_captions, data['idx_to_word'])

    for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):
        plt.imshow(image_from_url(url))
        plt.title('%s\n%s\nGT:%s' % (split, sample_caption, gt_caption))
        plt.axis('off')
        plt.show()

[CS231n Assignment 3 #01] 简单RNN的图像字幕(Image Captioning with Vanilla RNNs)_第3张图片

7. 作业问题

In our current image captioning setup, our RNN language model produces a word at every timestep as its output. However, an alternate way to pose the problem is to train the network to operate over characters (e.g. ‘a’, ‘b’, etc.) as opposed to words, so that at it every timestep, it receives the previous character as input and tries to predict the next character in the sequence. For example, the network might generate a caption like

‘A’, ’ ', ‘c’, ‘a’, ‘t’, ’ ', ‘o’, ‘n’, ’ ', ‘a’, ’ ', ‘b’, ‘e’, ‘d’

Can you describe one advantage of an image-captioning model that uses a character-level RNN? Can you also describe one disadvantage?
HINT: there are several valid answers, but it might be useful to compare the parameter space of word-level and character-level models.

My answer:

  • 如果采用 char-level RNN 则我们的 词汇表会比较小,且字符的嵌入矩阵的维度也很小(因为单词的数目远比字符的数目要多,这样 word embedding 矩阵就会很大),但是当基于当前 char 来预测下一个 char 的时候可能会比较困难,因为一个 char 提供的信息比 一个 word 少很多。不能保证连续输出的char能构成一个合法的word,这需要良好的模型以及训练数据来保证。(可以根据训练任务来选)
  • char-level RNN序列长度更长。需要模型有建模长长序列的能力。字符语言模型的另一个问题是,除了语法、语义等之外,它们还需要学习拼写,这需要更多的训练数据。所以,单词语言模型的错误通常比字符模型低。

Some blogs and discussion about the Q:

  • What is the difference between word-based and char-based text generation RNNs?

This is because char-based RNN LMs require much bigger hidden layer to successfully model long-term dependencies which means higher computational costs.
one of the fundamental differences between the word level and character level models is in the number of parameters the RNN has to access during the training and test.
The smaller is the input and output layer of RNN, the larger needs to be the fully connected hidden layer, which makes the training of the model expensive.

  • Advantage of character based language models over word based

你可能感兴趣的:(#,CS231n,深度学习,计算机视觉)