CPN服饰关键点论文及代码解读(keras)

服饰关键点论文解读


这里我们拿github上的 FashionAI_KeyPoint_Detection_Challenge_Keras这个项目来做分析。这份代码是为了天池比赛中的服饰关键点,这里面很多想法都来源于这篇论文 Cascaded Pyramid Network for Multi-Person Pose Estimation。我们首先来看一下模型预测结果。

服饰类型 预测结果
Dress
Blouse
Outwear
Skirt
Trousers

下面我们就来好好分析一下这篇文章的模型结构,以及代码实现。

一、CPN网络

图1. CPN网络结构图.png

这篇文章的实现主要是利用CPN(Cascaded Pyramid Network)网络来实现,此网络可以帮助我们解决很多被遮挡,不可见的关键点,我们称这些关键点为“hard keypoint”.
CPN网络主要分为两个结构一个是GlobalNet另一个就是RefineNet.
CPN =
GlobalNet(其实就是FPN(特征金字塔网络)提取特征的过程为了定位比较容易的点)+
RefineNet(是为了处理比较难的关键点计算集合每层特征表达值的loss)

其中很多challenge keypoint(大量的遮挡点,不可视点以及复杂的背景)是很难被定为到的,这里原因有两个:1."hard point"连接点被难根据其表面特征识别,例如躯干的一些point. 2.“hard point”连接点在训练过程中很难去处理。

1. FPN提取特征过程

这里我们只需要通过两张图张图就能很容易的理解原理。

图2. FPN细节展示.jpg
图3. 不同特征层的空间及语义信息

特征金字塔网络设计的时候考虑到了精确性以及速度,替代了FasterRCNN之类的检测模型的特征提取器,生成多尺度的特征映射,信息的质量比普通的用于特征检测的特征金字塔更好。FPN由自底向上和自顶向下两个路径组成。自底向上的路径通常是用卷积神经网络。自底向上,空间分辨率递减,检测更多高层结构,网络的语义值相应增加。上右面有一张图,非常方便也十分重要表述了,不同特征层对于空间以及语义的信息量。

通过这张图我们就很简单明了的理节,随着卷积深度的增加我们的语义信息是越来越大,但是我们空间分辨率(也就是我们常说的空间信息)就会不断降低。其实SSD(Single Shot MultiBox Detecto)会过滤掉底层的信息,因为其语义值很低,为了避免显著的速度下降,目标检测的时候不适用这些层,因此SSD检测的时候仅仅对大目标检测比较好。

由图2 我们可以看出在左侧采用的是自底向上卷积的方式进行特征提取,而右侧采用的是自上而下的方式。尽管经过多次卷积在最高层的语义信息非常丰富。但是经过这些下采样以及上采样的过程,目标的位置以及不在准确。因此FPN在重建层和相应的特征映射间增加了横向连接,以帮助检测更好地预测位置(个人在这里认为是浅层的空间信息加入到了具有丰富的语义信息的深层中去)这些横向连接的同时起到了跳跃连接(skip connection)的作用,类似于残差网络的做法。

下面我们来详细的说一下FPN提取特征过程。

图4. FPN整体结构.jpg

图5. Residual Network Structure.jpg

这里在提取图片特征的过程采用ResNet(原理可参考这篇文章Tensorflow(二) Residual Network原理及官方代码介绍)
图4中C1-C5即在图5中的残差网络结构中找到对应的位置。

    1. Bottom-up pathway
      自底向上的路径中有很多卷积模块组成,其实就是前向计算过程,特征图经过卷积池化层一般会越来越小,也有一些特征层的输出与输入大小一样。分为C1-C5分为5个stage,作者抽取每一个stage最后一个层作为输出(最后一个层具有最强的语义特征)构成金字塔。作者使用每个stage的最后一个残差结构的特征激活输出。这些输出我们可以表示为[C2, C3, C4, C5]对应于conv2_x, conv3_x, conv4_x, conv5_x。但是考虑到内存占用,没用将conv1包含在金字塔中。
    1. Top-down pathway and lateral connections
      自顶向下的路径是通过对网络上采样(upsampling)进行的,横向连接则是将上采样的结果和自底向上生成的大小相同的feature进行融合(这里是用Add非concat方法)。在融合之后再采用3*3的卷积核对每个融合结果进行卷积,目的是消除上采样的混叠效应(aliasing effect)。这样做的目的是假设生成的[P2, P3, P4, P5]和原来字底向上的卷积结果[C2, C3, C4, C5]有一一对应的效果.

2.Global Net 结构图

global net.png

在原文中作者提到对于3*3卷积视为了做关键点的heatmap这里表达一下对这些机构的个人理解。这里对于反卷积,上采样, 空洞卷积可以参考本人总结的这篇文章反卷积,空洞卷积, 上采样与上池化动画演示及解释。

对于卷积层可以理解为就是提取特征,对于反卷积可以理解为是为了增加分辨率,并且可以训练参数,方便后面进行concat以及Add. 同样上采样upsampling 也是为了增加分辨率。但是为什么在这样的结构中前面用到反卷积后面用到upsampling呢?这是因为前面更注重训练参数,反卷积是可以训练权重,而上采样(可以理解为就是双线性插值)无法对权重进行训练。所以我们一般在设计网络结构都是前部分用反卷积来增加图片分辨率,而后面通过上采样进行操作。其次这里前部分的结构可以理解为不同层进行卷积然后在进行合并,这样可以得到不同层不同程度的空间及语义信息,再对得到的信息进行卷积特征操作,得到特征更加全面。同时这里我们用到了空洞卷积,也是为了增加感受野的范围。

3. Refine Net 结构图

Refine Net.png

二、OHEM(Online Hard Example Mining)

在这里我们模型中用到了OHEM技术,这里难度挖掘是指在训练过程中对于不同的样本loss,或者一个样本不同的loss中选择最大的loss样本或者,一个样本最大的loss进行重复训练(即对其特定增加训练轮数)。在这个网络中,模型不断训练同一个样本不同关键点的loss,然后对其这些loss进行排序,然后选择top-K的loss进行继续训练。

三、代码解析

数据集介绍及下载地址http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
代码下载地址: FashionAI_KeyPoint_Detection_Challenge_Keras

我们得到文件目录如下:


代码结构目录.png

下面我们来说明一下每一个.py功能。

3.1 数据生成

3.2 模型训练

3.2.1 train.py 训练模型入口
import sys
sys.path.insert(0, "../data_gen/")
sys.path.insert(0, "../unet/")

import argparse
import os
from fashion_net import FashionNet
from dataset import getKpNum
import tensorflow as tf
from keras import backend as k

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--gpuID", default=0, type=str, help='gpu id')
    parser.add_argument("--category", help="specify cloth category")
    parser.add_argument("--network", help="specify  network arch'")
    parser.add_argument("--batchSize", default=8, type=int, help='batch size for training')
    parser.add_argument("--epochs", default=20, type=int, help="number of traning epochs")
    parser.add_argument("--resume", default=False, type=bool,  help="resume training or not")
    parser.add_argument("--lrdecay", default=False, type=bool,  help="lr decay or not")
    parser.add_argument("--resumeModel", help="start point to retrain")
    parser.add_argument("--initEpoch", type=int, help="epoch to resume")


    args = parser.parse_args()

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpuID)


    # TensorFlow wizardry
    config = tf.ConfigProto()

    # Don't pre-allocate memory; allocate as-needed
    config.gpu_options.allow_growth = True

    # Only allow a total of half the GPU memory to be allocated
    config.gpu_options.per_process_gpu_memory_fraction = 1.0

    # Create a session with the above options specified.
    k.tensorflow_backend.set_session(tf.Session(config=config))

    if not args.resume :
        xnet = FashionNet(512, 512, getKpNum(args.category))
        xnet.build_model(modelName=args.network, show=True)
        xnet.train(args.category, epochs=args.epochs, batchSize=args.batchSize, lrschedule=args.lrdecay)
    else:
        xnet = FashionNet(512, 512, getKpNum(args.category))
        xnet.resume_train(args.category, args.resumeModel, args.network, args.initEpoch,
                          epochs=args.epochs, batchSize=args.batchSize)

1. 参数解释
gpuID: 注明需要训练gpu机器序号。
category: 这里有几个选项分为skirt, dress, trousers, blouse, outwear, all这几个分类,如果是all则会训练所有的关键点,否则选取特定的服饰关键点进行训练。
network: 这里是填写模型训练的名称,方便后期模型保存,日志输出定位问题。
batchSize: 一个batch的大小,默认为8
epoch: epoch数量默认为20
resume: 是否要恢复训练。
lrdecay: 是否使用学习率衰退
resumeModel: 接着上一次的训练模型接着训练
initEpoch:开始训练的轮次(有助于恢复之前的训练)。

2. 判断是否需要重新训练

if not arg.resume:
    # 第一次训练
    ...
else:
    # 接着上次接着训练

3. 模型建立, 训练, 继续训练
(1) 模型初始化: xnet = FashionNet(512, 512, getKpNum(args.category))
(2) 模型建立: xnet.build_model(modelName=args.network, show=True)
(3) 模型训练: xnet.train(args.category, epochs=args.epochs, batchSize=args.batchSize, lrschedule=args.lrdecay)
(4) 模型继续训练: net.resume_train(args.category, args.resumeModel, args.network, args.initEpoch, epoch=args.epochs, batchSize=args.batchSize)

3.2.2 refinenet_mask_v3.py 模型建立及加载

from refinenet import load_backbone_res101net, create_global_net_dilated, create_stack_refinenet
from keras.models import *
from keras.layers import *
from keras.optimizers import Adam, SGD
from keras import backend as K
import keras

def Res101RefineNetMaskV3(n_classes, inputHeight, inputWidth, nStackNum):
    model = build_resnet101_stack_mask_v3(inputHeight, inputWidth, n_classes, nStackNum)
    return model

def euclidean_loss(x, y):
    return K.sqrt(K.sum(K.square(x - y)))

def apply_mask_to_output(output, mask):
    output_with_mask = keras.layers.multiply([output, mask])
    return output_with_mask

def build_resnet101_stack_mask_v3(inputHeight, inputWidth, n_classes, nStack):

    input_mask = Input(shape=(inputHeight//2, inputHeight//2, n_classes), name='mask')
    input_ohem_mask = Input(shape=(inputHeight//2, inputHeight//2, n_classes), name='ohem_mask')

    # backbone network
    input_image, lf2x, lf4x, lf8x, lf16x = load_backbone_res101net(inputHeight, inputWidth)

    # global net
    g8x, g4x, g2x = create_global_net_dilated((lf2x, lf4x, lf8x, lf16x), n_classes)

    s8x, s4x, s2x = g8x, g4x, g2x

    g2x_mask = apply_mask_to_output(g2x, input_mask)

    outputs = [g2x_mask]
    for i in range(nStack):
        s8x, s4x, s2x = create_stack_refinenet((s8x, s4x, s2x), n_classes, 'stack_'+str(i))
        if i == (nStack-1): # last stack with ohem_mask
            s2x_mask = apply_mask_to_output(s2x, input_ohem_mask)
        else:
            s2x_mask = apply_mask_to_output(s2x, input_mask)
        outputs.append(s2x_mask)

    model = Model(inputs=[input_image, input_mask, input_ohem_mask], outputs=outputs)

    adam = Adam(lr=1e-4)
    model.compile(optimizer=adam, loss=euclidean_loss, metrics=["accuracy"])
    return model

1. 模型实例化输入
对于FashionNet模型初始化的图片输入高宽统一为512,如果种类为“all”, 则总共有一下数量的类型

    ALL_KP_KESY = ['image_id','neckline_left', 'neckline_right', 'center_front', 'shoulder_left', 'shoulder_right',
                 'armpit_left', 'armpit_right', 'waistline_left', 'waistline_right', 'cuff_left_in', 'cuff_left_out', 'cuff_right_in',
                 'cuff_right_out', 'top_hem_left', 'top_hem_right', 'waistband_left', 'waistband_right', 'hemline_left', 'hemline_right' ,
                 'crotch', 'bottom_left_in' , 'bottom_left_out', 'bottom_right_in' ,'bottom_right_out']

2.模型建立
(1)加载训练模型。
self.model = Res101RefineNetMaskV3(self.nClass, self.inputHeight, self.inputWidth, nStackNum=2)

(2) 设置输入张量

input_image = xresnet.input
input_mask = Input(shape=(inputHeight//2, inputHeight//2, n_classes), name='mask'
input_ohem_mask = Input(shape=(inputHeight//2, inputHeight//2, n_classes), name='ohem_mask')

这时候有读者会问为什么要有两个输入张量,我们可以理解为input_mask为原始输入图像,还有一个ohem_mask,是为了选取loss函数最多的点与原始mask做一个“与”操作,从而过滤掉loss较小的关键点,这里没有理解没关系,在后面会详细介绍。

(3) 模型结构建立

    # backbone network
    input_image, lf2x, lf4x, lf8x, lf16x = load_backbone_res101net(inputHeight, inputWidth)

    # global net
    g8x, g4x, g2x = create_global_net_dilated((lf2x, lf4x, lf8x, lf16x), n_classes)

    s8x, s4x, s2x = g8x, g4x, g2x

    g2x_mask = apply_mask_to_output(g2x, input_mask)

    outputs = [g2x_mask]
    for i in range(nStack):
        s8x, s4x, s2x = create_stack_refinenet((s8x, s4x, s2x), n_classes, 'stack_'+str(i))
        if i == (nStack-1): # last stack with ohem_mask
            s2x_mask = apply_mask_to_output(s2x, input_ohem_mask)
        else:
            s2x_mask = apply_mask_to_output(s2x, input_mask)
        outputs.append(s2x_mask)

a. 加载resnet101预训练模型
input_image, lf2x, lf4x, lf8x, lf16x = load_backbone_res101net(inputHeight, inputWidth)
在加载resnet101模型的时候我们输入模型的大小为256,256大小作为输入。下面我们来看看这个函数。

def load_backbone_res101net(inputHeight, inputWidth):
    from resnet101 import ResNet101
    xresnet = ResNet101(weights='imagenet', include_top=False, input_shape=(inputHeight, inputWidth, 3))

    xresnet.load_weights("../../data/resnet101_weights_tf.h5", by_name=True)

    lf16x = xresnet.get_layer('res4b22_relu').output
    lf8x = xresnet.get_layer('res3b2_relu').output
    lf4x = xresnet.get_layer('res2c_relu').output
    lf2x = xresnet.get_layer('conv1_relu').output

    # add one padding for lf4x whose shape is 127x127
    lf4xp = ZeroPadding2D(padding=((0, 1), (0, 1)))(lf4x)

    return (xresnet.input, lf2x, lf4xp, lf8x, lf16x)

可以看出通过加载预训练模型,并且通过每一层网络的名称得到不同层的feature map.

b. 构建global_net
g8x, g4x, g2x = create_global_net_dilated((lf2x, lf4x, lf8x, lf16x), n_classes)
下面为具体如何构建这个模型代码, tips: 可以通过global net.png图来理解这里的结构。

def create_global_net_dilated(lowlevelFeatures, n_classes):
    lf2x, lf4x, lf8x, lf16x = lowlevelFeatures

    o = lf16x

    o = (Conv2D(256, (3, 3), dilation_rate=(2, 2), activation='relu', padding='same', name='up16x_conv', data_format=IMAGE_ORDERING))(o)
    o = (BatchNormalization())(o)

    o = (Conv2DTranspose(256, kernel_size=(3, 3), strides=(2, 2), name='upsample_16x', activation='relu', padding='same',
                    data_format=IMAGE_ORDERING))(o)
    o = (concatenate([o, lf8x], axis=-1))
    o = (Conv2D(128, (3, 3), dilation_rate=(2, 2), activation='relu', padding='same', name='up8x_conv', data_format=IMAGE_ORDERING))(o)
    o = (BatchNormalization())(o)
    fup8x = o

    o = (Conv2DTranspose(128, kernel_size=(3, 3), strides=(2, 2), name='upsample_8x', padding='same', activation='relu',
                         data_format=IMAGE_ORDERING))(o)
    o = (concatenate([o, lf4x], axis=-1))
    o = (Conv2D(64, (3, 3), dilation_rate=(2, 2), activation='relu', padding='same', name='up4x_conv', data_format=IMAGE_ORDERING))(o)
    o = (BatchNormalization())(o)
    fup4x = o

    o = (Conv2DTranspose(64, kernel_size=(3, 3), strides=(2, 2), name='upsample_4x', padding='same', activation='relu',
                         data_format=IMAGE_ORDERING))(o)
    o = (concatenate([o, lf2x], axis=-1))
    o = (Conv2D(64, (3, 3), dilation_rate=(2, 2), activation='relu', padding='same', name='up2x_conv', data_format=IMAGE_ORDERING))(o)
    o = (BatchNormalization())(o)
    fup2x = o

    out2x = Conv2D(n_classes, (1, 1), activation='linear', padding='same', name='out2x', data_format=IMAGE_ORDERING)(fup2x)
    out4x = Conv2D(n_classes, (1, 1), activation='linear', padding='same', name='out4x', data_format=IMAGE_ORDERING)(fup4x)
    out8x = Conv2D(n_classes, (1, 1), activation='linear', padding='same', name='out8x', data_format=IMAGE_ORDERING)(fup8x)

    x4x = UpSampling2D((2, 2), data_format=IMAGE_ORDERING)(out8x)
    eadd4x = Add(name='global4x')([x4x, out4x])

    x2x = UpSampling2D((2, 2), data_format=IMAGE_ORDERING)(eadd4x)
    eadd2x = Add(name='global2x')([x2x, out2x])

    return (fup8x, eadd4x, eadd2x)

c. 构建refine_net

   for i in range(nStack):
        s8x, s4x, s2x =  create_stack_refinenet((s8x, s4x, s2x), n_classes, 'stack_'+str(i))
        outputs.append(s2x)

下面是详细构建refine_net 代码


def create_stack_refinenet(inputFeatures, n_classes, layerName):
    f8x, f4x, f2x = inputFeatures

    # 2 Conv2DTranspose f8x -> fup8x
    fup8x = (Conv2D(256, kernel_size=(1, 1), name=layerName+'_refine8x_1', padding='same', activation='relu'))(f8x)
    fup8x = (BatchNormalization())(fup8x)

    fup8x = (Conv2D(128, kernel_size=(1, 1), name=layerName+'refine8x_2', padding='same', activation='relu'))(fup8x)
    fup8x = (BatchNormalization())(fup8x)

    out8x = fup8x
    fup8x = UpSampling2D((4, 4), data_format=IMAGE_ORDERING)(fup8x)

    # 1 Conv2DTranspose f4x -> fup4x
    fup4x = (Conv2D(128, kernel_size=(1, 1), name=layerName+'refine4x', padding='same', activation='relu'))(f4x)
    fup4x = (BatchNormalization())(fup4x)
    out4x = fup4x
    fup4x = UpSampling2D((2, 2), data_format=IMAGE_ORDERING)(fup4x)

    # 1 conv f2x -> fup2x
    fup2x = (Conv2D(128, (1, 1), activation='relu', padding='same', name=layerName+'refine2x_conv'))(f2x)
    fup2x = (BatchNormalization())(fup2x)

    # concat f2x, fup8x, fup4x
    fconcat = (concatenate([fup8x, fup4x, fup2x], axis=-1, name=layerName+'refine_concat'))

    # 1x1 to map to required feature map
    out2x = Conv2D(n_classes, (1, 1), activation='linear', padding='same', name=layerName+'refine2x')(fconcat)

    return out8x, out4x, out2x

d. 确定哪一个stack使用ohem mask

for i in range(nStack):
    s8x, s4x, s2x = create_stack_refinenet((s8x, s4x, s2x), n_classes, 'stack_'+str(i))
    if i == (nStack-1): # last stack with ohem_mask
        s2x_mask = apply_mask_to_output(s2x, input_ohem_mask)
    else:
        s2x_mask = apply_mask_to_output(s2x, input_mask)
    outputs.append(s2x_mask)

这里我们设置nStack=2,所以在最后一次我们会选取ohem,即loss最高的点与output做一个相乘output_with_mask = keras.layers.multiply([output, mask])这样我们就会得到最高loss点,前几次我们都会选取整体点做一个训练。

下面就是最重要,也是最难理解的fashion_net, 再次之前我们需要说一下data_generator代码了。

3.2.3 data_generator.py

import os
import cv2
import pandas as pd
import numpy as np
import random

from kpAnno import KpAnno
from dataset import getKpNum, getKpKeys, getFlipMapID,  generate_input_mask
from utils import make_gaussian, load_annotation_from_df
from data_process import pad_image, resize_image, normalize_image, rotate_image, \
    rotate_image_float, rotate_mask, crop_image
from ohem import generate_topk_mask_ohem

class DataGenerator(object):

    def __init__(self, category, annfile):
        self.category = category
        self.annfile  = annfile
        self._initialize()

    def get_dim_order(self):
        # default tensorflow dim order
        return "channels_last"

    def get_dataset_size(self):
        return len(self.annDataFrame)

    def generator_with_mask_ohem(self, graph, kerasModel, batchSize=16, inputSize=(512, 512), flipFlag=False, cropFlag=False,
                            shuffle=True, rotateFlag=True, nStackNum=1):

        '''
        Input:  batch_size * Height (512) * Width (512) * Channel (3)
        Input:  batch_size * 256 * 256 * Channel (N+1). Mask for each category. 1.0 for valid parts in category. 0.0 for invalid parts
        Output: batch_size * Height/2 (256) * Width/2 (256) * Channel (N+1)
        '''
        xdf = self.annDataFrame

        targetHeight, targetWidth = inputSize

        # train_input: npfloat,  height, width, channels
        # train_gthmap: npfloat, N heatmap + 1 background heatmap,
        train_input = np.zeros((batchSize, targetHeight, targetWidth, 3), dtype=np.float)
        train_mask = np.zeros((batchSize, int(targetHeight // 2), int(targetWidth // 2), getKpNum(self.category)), dtype=np.float)
        train_gthmap = np.zeros((batchSize, int(targetHeight // 2), int(targetWidth // 2), getKpNum(self.category)), dtype=np.float)
        train_ohem_mask = np.zeros((batchSize, int(targetHeight // 2), int(targetWidth // 2), getKpNum(self.category)), dtype=np.float)
        train_ohem_gthmap = np.zeros((batchSize, int(targetHeight // 2), int(targetWidth // 2), getKpNum(self.category)), dtype=np.float)

        ## generator need to be infinite loop
        while 1:
            # random shuffle at first
            if shuffle:
                xdf = xdf.sample(frac=1)
            count = 0
            for _index, _row in xdf.iterrows():
                xindex = count % batchSize
                xinput, xhmap = self._prcoess_img(_row, inputSize, rotateFlag, flipFlag, cropFlag, nobgFlag=True)
                xmask = generate_input_mask(_row['image_category'],
                                            (targetHeight, targetWidth, getKpNum(self.category)))

                xohem_mask, xohem_gthmap = generate_topk_mask_ohem([xinput, xmask], xhmap, kerasModel, graph,
                                            8, _row['image_category'], dynamicFlag=False)

                train_input[xindex, :, :, :] = xinput
                train_mask[xindex, :, :, :] = xmask
                train_gthmap[xindex, :, :, :] = xhmap
                train_ohem_mask[xindex, :, :, :] = xohem_mask
                train_ohem_gthmap[xindex, :, :, :] = xohem_gthmap

                # if refinenet enable, refinenet has two outputs, globalnet and refinenet
                if xindex == 0 and count != 0:
                    gthamplst = list()
                    for i in range(nStackNum):
                        gthamplst.append(train_gthmap)

                    # last stack will use ohem gthmap
                    gthamplst.append(train_ohem_gthmap)

                    yield [train_input, train_mask, train_ohem_mask], gthamplst

                count += 1

    def _initialize(self):
        self._load_anno()

    def _load_anno(self):
        '''
        Load annotations from train.csv
        '''
        # Todo: check if category legal
        self.train_img_path = "../../data/train"

        # read into dataframe
        xpd = pd.read_csv(self.annfile)
        xpd = load_annotation_from_df(xpd, self.category)
        self.annDataFrame = xpd

    def _prcoess_img(self, dfrow, inputSize, rotateFlag, flipFlag, cropFlag, nobgFlag):

        mlist = dfrow[getKpKeys(self.category)]
        imgName, kpStr = mlist[0], mlist[1:]

        # read kp annotation from csv file
        kpAnnlst = list()
        for _kpstr in kpStr:
            _kpAn = KpAnno.readFromStr(_kpstr)
            kpAnnlst.append(_kpAn)

        assert (len(kpAnnlst) == getKpNum(self.category)), str(len(kpAnnlst))+" is not the same as "+str(getKpNum(self.category))


        xcvmat = cv2.imread(os.path.join(self.train_img_path, imgName))
        if xcvmat is None:
            return None, None

        #flip as first operation.
        # flip image
        if random.choice([0, 1]) and flipFlag:
            xcvmat, kpAnnlst = self.flip_image(xcvmat, kpAnnlst)

        #if cropFlag:
        #    xcvmat, kpAnnlst = crop_image(xcvmat, kpAnnlst, 0.8, 0.95)

        # pad image to 512x512
        paddedImg, kpAnnlst = pad_image(xcvmat, kpAnnlst, inputSize[0], inputSize[1])

        assert (len(kpAnnlst) == getKpNum(self.category)), str(len(kpAnnlst)) + " is not the same as " + str(
            getKpNum(self.category))

        # output ground truth heatmap is 256x256
        trainGtHmap = self.__generate_hmap(paddedImg, kpAnnlst)

        if random.choice([0,1]) and rotateFlag:
            rAngle = np.random.randint(-1*40, 40)
            rotatedImage,  _ = rotate_image(paddedImg, list(), rAngle)
            rotatedGtHmap  = rotate_mask(trainGtHmap, rAngle)
        else:
            rotatedImage  = paddedImg
            rotatedGtHmap = trainGtHmap

        # resize image
        resizedImg    = cv2.resize(rotatedImage, inputSize)
        resizedGtHmap = cv2.resize(rotatedGtHmap, (inputSize[0]//2, inputSize[1]//2))

        return normalize_image(resizedImg), resizedGtHmap


    def __generate_hmap(self, cvmat, kpAnnolst):
        # kpnum + background
        gthmp = np.zeros((cvmat.shape[0], cvmat.shape[1], getKpNum(self.category)), dtype=np.float)

        for i, _kpAnn in enumerate(kpAnnolst):
            if _kpAnn.visibility == -1:
                continue

            radius = 100
            gaussMask = make_gaussian(radius, radius, 20, None)

            # avoid out of boundary
            top_x, top_y = int(max(0, _kpAnn.x - radius/2)), int(max(0, _kpAnn.y - radius/2))
            bottom_x, bottom_y = int(min(cvmat.shape[1], _kpAnn.x + radius/2)), int(min(cvmat.shape[0], _kpAnn.y + radius/2))

            top_x_offset = int(top_x - (_kpAnn.x - radius/2))
            top_y_offset = int(top_y - (_kpAnn.y - radius/2))

            gthmp[ top_y:bottom_y, top_x:bottom_x, i] = gaussMask[top_y_offset:top_y_offset + bottom_y-top_y,
                                                                  top_x_offset:top_x_offset + bottom_x-top_x]

        return gthmp

    def flip_image(self, orgimg, orgKpAnolst):
        flipImg = cv2.flip(orgimg, flipCode=1)
        flipannlst = self.flip_annlst(orgKpAnolst, orgimg.shape)
        return flipImg, flipannlst


    def flip_annlst(self, kpannlst, imgshape):
        height, width, channels = imgshape

        # flip first
        flipAnnlst = list()
        for _kp in kpannlst:
            flip_x = width - _kp.x
            flipAnnlst.append(KpAnno(flip_x, _kp.y, _kp.visibility))

        # exchange location of flip keypoints, left->right
        outAnnlst = flipAnnlst[:]
        for i, _kp in enumerate(flipAnnlst):
            mapId = getFlipMapID('all', i)
            outAnnlst[mapId] = _kp

        return outAnnlst
  1. 首先看一下DataGenerator成员变量。
    a. self.category: 关键点种类, 这里我们设置为 all
    b. self.annfile: 关键点标注文件,这里我们设置格式为.csv文件。
    c. _initialize():这里相当于读取关键点标注文件生成dataFrame格式。

  2. 下面来介绍generator_with_mask_ohem这个函数。
    通过

while 1:
    yield xxx

结构来不断产生训练集,当然这些训练集的产生需要符合一定条件。
a. 我们的输入为512*512
b. 是否进行shuffle, 抽样比为1, 即选取全国数据进行打乱。

            if shuffle:
                xdf = xdf.sample(frac=1)

c. 选取训练集

for _index, _row in xdf.iterrows():
    xindex = count % batchSize
    xxx

这里xindex记录batch的index。

 xinput, xhmap = self._prcoess_img(_row, inputSize, rotateFlag, flipFlag, cropFlag, nobgFlag=True)

上面的函数是对图形进行旋转,镜像,裁剪操作

d. 下面就是深层heat_map过程,也是我们生成groundtruth的过程。我们为什么要产生heat_map呢?这是因为我们训练的过程中,不能只是计算预测出的点与标签那一个坐标点做loss。这样的话如果预测点稍微有点偏移,loss则最大,否则完全一致,loss为零。这显然不是我们想要的,因此我们需要在标注的关键点上做一个高斯模糊,即标注点的概率最大,越向周围是关键点的概率越小。

     trainGtHmap = self.__generate_hmap(paddedImg, kpAnnlst)
    def __generate_hmap(self, cvmat, kpAnnolst):
        # kpnum + background
        gthmp = np.zeros((cvmat.shape[0], cvmat.shape[1], getKpNum(self.category)), dtype=np.float)

        for i, _kpAnn in enumerate(kpAnnolst):
            if _kpAnn.visibility == -1:
                continue

            radius = 100
            gaussMask = make_gaussian(radius, radius, 20, None)

            # avoid out of boundary
            top_x, top_y = int(max(0, _kpAnn.x - radius/2)), int(max(0, _kpAnn.y - radius/2))
            bottom_x, bottom_y = int(min(cvmat.shape[1], _kpAnn.x + radius/2)), int(min(cvmat.shape[0], _kpAnn.y + radius/2))

            top_x_offset = int(top_x - (_kpAnn.x - radius/2))
            top_y_offset = int(top_y - (_kpAnn.y - radius/2))

            gthmp[ top_y:bottom_y, top_x:bottom_x, i] = gaussMask[top_y_offset:top_y_offset + bottom_y-top_y,
                                                                  top_x_offset:top_x_offset + bottom_x-top_x]

首先定义groundtruth heat map, 即gthmap。其为三个维度的矩阵, 分别是feature map的高,宽,以及关键点的数量getKpNum(self.category)

接着通过判断是否存在关键点(visibility == -1),如果不存在则直接矩阵置为0。

        for i, _kpAnn in enumerate(kpAnnolst):
            if _kpAnn.visibility == -1:
                continue

最后我们来看一下groundtruth是如何生成的。这里我们设置radius = 100。 通过make_gaussian函数生成

def make_gaussian(width, height, sigma=3, center=None):
    '''
        generate 2d guassion heatmap
    :return:
    '''

    x = np.arange(0, width, 1, float)
    y = np.arange(0, height, 1, float)[:, np.newaxis]

    if center is None:
        x0 = width // 2
        y0 = height // 2
    else:
        x0 = center[0]
        y0 = center[1]

    return np.exp( -4*np.log(2)*((x-x0)**2 + (y-y0)**2)/sigma**2)

下面这张图片显示了我们生成高斯点的mask。
relative_mask.jpg

但是我们需要将我们生成的高斯maks映射到整张图上去

top_x, top_y = int(max(0, _kpAnn.x - radius/2)), int(max(0, _kpAnn.y - radius/2))
bottom_x, bottom_y = int(min(cvmat.shape[1], _kpAnn.x + radius/2)), int(min(cvmat.shape[0], _kpAnn.y + radius/2))
top_x_offset = int(top_x - (_kpAnn.x - radius/2))
top_y_offset = int(top_y - (_kpAnn.y - radius/2))
gthmp[ top_y:bottom_y, top_x:bottom_x, i] = gaussMask[top_y_offset:top_y_offset + bottom_y-top_y,top_x_offset:top_x_offset + bottom_x-top_x]
absolute_mask.jpg

e. 说好了如何生成groundtruth,下面我们来说一下我们初始input mask是如何构建的。它是通generate_input_mask函数构建出来。

def generate_input_mask(image_category, shape, nobgFlag=True):
    import numpy as np
    # 0.0 for invalid key points for each category
    # 1.0 for valid key points for each category
    h, w, c = shape
    mask = np.zeros((h // 2, w // 2, c), dtype=np.float)

    for key in getKpKeys(image_category)[1:]:
        index = get_kp_index_from_allkeys(key)
        mask[:, :, index] = 1.0

    # for last channel, background
    if nobgFlag:     mask[:, :, -1] = 0.0
    else:   mask[:, :, -1] = 1.0

    return mask

如果含有某一个类型的点,我们就将其那个通道数所有的mask置为1, 否则将其置位0.

f. 下面将要介绍一下如何生成topk_mask_ohem.

def generate_topk_mask_ohem(input_data, gthmap, keras_model, graph, topK, image_category, dynamicFlag=False):
    '''
    :param input_data: input
    :param gthmap:  ground truth
    :param keras_model: keras model
    :param graph:  tf grpah to WA thread issue
    :param topK: number of kp selected
    :return:
    '''

    # do inference, and calculate loss of each channel
    mimg, mmask = input_data
    ximg  = mimg[np.newaxis,:,:,:]
    xmask = mmask[np.newaxis,:,:,:]

    if len(keras_model._input_layers) == 3:
        # use original mask as ohem_mask
        inputs = [ximg, xmask, xmask]
    else:
        inputs = [ximg, xmask]

    with graph.as_default():
        keras_output = keras_model.predict(inputs)

    # heatmap of last stage
    outhmap = keras_output[-1]

    channel_num = gthmap.shape[-1]

    # calculate loss
    mloss = list()
    for i in range(channel_num):
        _dtmap = outhmap[0, :, :, i]
        _gtmap = gthmap[:, :, i]
        loss   = np_euclidean_l2(_dtmap, _gtmap)
        mloss.append(loss)

    # refill input_mask, set topk as 1.0 and fill 0.0 for rest
    # fixme: topk may different b/w category
    if dynamicFlag:
        topK = getKpNum(image_category)//2

    ohem_mask   = adjsut_mask(mloss, mmask, topK)

    ohem_gthmap = ohem_mask * gthmap

    return ohem_mask, ohem_gthmap

我们的输入层是三个输入以list方式作为输入,初始输入inputs = [ximg, xmask, xmask],其中第三个mask就是input_ohem_mask。这里我们默认的keras_model._input_layers是带ohem的mask,所以其输入大小为3。

        keras_output = keras_model.predict(inputs)

接着通过模型对输入进行预测,得到keras_output,因为nStack这里设置为2(即大小为3),因此我们选取最后一个Stack(即refine_net的输出)outhmap = keras_output[-1]作为作为模型预测的feature map.
接下来对每一个通道的feature 与groundtruth求取欧式距离作为loss,并记录每一个通道(关键点)的loss。

    mloss = list()
    for i in range(channel_num):
        _dtmap = outhmap[0, :, :, i]
        _gtmap = gthmap[:, :, i]
        loss   = np_euclidean_l2(_dtmap, _gtmap)
        mloss.append(loss)

最后我们选取topK(topK = getKpNum(image_category)//2)数量的mask作为ohem mask。

def adjsut_mask(loss, input_mask,  topk):
    # pick topk loss from losses
    # fill topk with 1.0 and fill the rest as 0.0
    assert (len(loss) == input_mask.shape[-1]), \
        "shape should be same" + str(len(loss)) + " vs " + str(input_mask.shape)

    outmask = np.zeros(input_mask.shape, dtype=np.float)

    topk_index = sorted(range(len(loss)), key=lambda i:loss[i])[-topk:]

    for i in range(len(loss)):
        if i in topk_index:
            outmask[:,:,i] = 1.0

    return outmask

这里的mask的作用我们可以理解为就是关键点的筛选器。我们想要预测多少关键点就将其maks那一维度的通道置为1。OHEM就很合理的利用到这一点,对难识别出的关键点,重新筛选做一个进一步训练。
最后通过ohem_gthmap = ohem_mask * gthmap两矩阵相乘的方式得到ohem_groundtruth_mask.

g. 最后我们看最终的模型输入以及标注是如何生成的

train_input[xindex, :, :, :] = xinput
train_mask[xindex, :, :, :] = xmask
train_gthmap[xindex, :, :, :] = xhmap
train_ohem_mask[xindex, :, :, :] = xohem_mask
train_ohem_gthmap[xindex, :, :, :] = xohem_gthmap
# if refinenet enable, refinenet has two outputs, globalnet and refinenet
if xindex == 0 and count != 0:
    gthamplst = list()
    for i in range(nStackNum):
        gthamplst.append(train_gthmap)
       # last stack will use ohem gthmap
        gthamplst.append(train_ohem_gthmap)
        yield [train_input, train_mask, train_ohem_mask], gthamplst

这里对于每一个batch(if xindex == 0 and count != 0:)都会将输入输出进行yield返回。

3.2.4 fashion_net.py
import sys
sys.path.insert(0, "../data_gen/")
sys.path.insert(0, "../eval/")

from data_generator import DataGenerator
from keras.callbacks import ModelCheckpoint, CSVLogger
from keras.models import load_model
from data_process import pad_image, normalize_image
import os
import cv2
import numpy as np
import datetime
from eval_callback import NormalizedErrorCallBack
#from keras.utils.training_utils import multi_gpu_model
from refinenet_mask_v3 import Res101RefineNetMaskV3, euclidean_loss
from resnet101 import Scale
import tensorflow as tf

class FashionNet(object):

    def __init__(self, inputHeight, inputWidth, nClasses):
        self.inputWidth = inputWidth
        self.inputHeight = inputHeight
        self.nClass = nClasses

    def build_model(self, modelName='v2', show=False):
        self.modelName = modelName
        self.model = Res101RefineNetMaskV3(self.nClass, self.inputHeight, self.inputWidth, nStackNum=2)
        self.nStackNum = 2

        # show model summary and layer name
        if show:
            self.model.summary()
            for layer in self.model.layers:
                print(layer.name, layer.trainable)

    def train(self, category, batchSize=8, epochs=20, lrschedule=False):
        trainDt = DataGenerator(category, os.path.join("../../data/train/Annotations", "train_split.csv"))
        trainGen = trainDt.generator_with_mask_ohem( graph=tf.get_default_graph(), kerasModel=self.model,
                                    batchSize= batchSize, inputSize=(self.inputHeight, self.inputWidth),
                                    nStackNum=self.nStackNum, flipFlag=False, cropFlag=False)

        normalizedErrorCallBack = NormalizedErrorCallBack("../../trained_models/", category, True)

        csvlogger = CSVLogger( os.path.join(normalizedErrorCallBack.get_folder_path(),
                               "csv_train_"+self.modelName+"_"+str(datetime.datetime.now().strftime('%H:%M'))+".csv"))

        xcallbacks = [normalizedErrorCallBack, csvlogger]

        self.model.fit_generator(generator=trainGen, steps_per_epoch=trainDt.get_dataset_size()//batchSize,
                                 epochs=epochs,  callbacks=xcallbacks)

    def load_model(self, netWeightFile):
        self.model = load_model(netWeightFile, custom_objects={'euclidean_loss': euclidean_loss, 'Scale': Scale})

    def resume_train(self, category, pretrainModel, modelName, initEpoch, batchSize=8, epochs=20):
        self.modelName = modelName
        self.load_model(pretrainModel)
        refineNetflag = True
        self.nStackNum = 2

        modelPath = os.path.dirname(pretrainModel)

        trainDt = DataGenerator(category, os.path.join("../../data/train/Annotations", "train_split.csv"))
        trainGen = trainDt.generator_with_mask_ohem(graph=tf.get_default_graph(), kerasModel=self.model,
                                                    batchSize=batchSize, inputSize=(self.inputHeight, self.inputWidth),
                                                    nStackNum=self.nStackNum, flipFlag=False, cropFlag=False)


        normalizedErrorCallBack = NormalizedErrorCallBack("../../trained_models/", category, refineNetflag, resumeFolder=modelPath)

        csvlogger = CSVLogger(os.path.join(normalizedErrorCallBack.get_folder_path(),
                                           "csv_train_" + self.modelName + "_" + str(
                                               datetime.datetime.now().strftime('%H:%M')) + ".csv"))

        self.model.fit_generator(initial_epoch=initEpoch, generator=trainGen, steps_per_epoch=trainDt.get_dataset_size() // batchSize,
                                 epochs=epochs, callbacks=[normalizedErrorCallBack, csvlogger])


    def predict_image(self, imgfile):
        # load image and preprocess
        img = cv2.imread(imgfile)
        img, _ = pad_image(img, list(), 512, 512)
        img = normalize_image(img)
        input = img[np.newaxis,:,:,:]
        # inference
        heatmap = self.model.predict(input)
        return heatmap


    def predict(self, input):
        # inference
        heatmap = self.model.predict(input)
        return heatmap

这是整个模型的入口包括训练与预测。

  1. 初始化函数
def __init__(self, inputHeight, inputWidth, nClasses):
    self.inputWidth = inputWidth
    self.inputHeight = inputHeight
    self.nClass = nClasses

定义输入图片的宽高以及关键点类别。

  1. 加载模型并打印模型结构
def build_model(self, modelName='v2', show=False):
    self.modelName = modelName
    self.model = Res101RefineNetMaskV3(self.nClass, self.inputHeight, self.inputWidth, nStackNum=2)
    self.nStackNum = 2
     # show model summary and layer name
    if show:
        self.model.summary()
        for layer in self.model.layers:
            print(layer.name, layer.trainable)

这里有一个超参数nStackNum默认为2,其实这是控制难预测关键点在第几阶段用OHEM方式进行在预测的参数,如果这个参数设置的越小越容易对难识别的点进行针对性的训练。

  1. 模型第一次训练
def train(self, category, batchSize=8, epochs=20, lrschedule=False):
        trainDt = DataGenerator(category, os.path.join("../../data/train/Annotations", "train_split.csv"))
        trainGen = trainDt.generator_with_mask_ohem( graph=tf.get_default_graph(), kerasModel=self.model,
                                    batchSize= batchSize, inputSize=(self.inputHeight, self.inputWidth),
                                    nStackNum=self.nStackNum, flipFlag=False, cropFlag=False)

        normalizedErrorCallBack = NormalizedErrorCallBack("../../trained_models/", category, True)

        csvlogger = CSVLogger( os.path.join(normalizedErrorCallBack.get_folder_path(),
                               "csv_train_"+self.modelName+"_"+str(datetime.datetime.now().strftime('%H:%M'))+".csv"))

        xcallbacks = [normalizedErrorCallBack, csvlogger]

        self.model.fit_generator(generator=trainGen, steps_per_epoch=trainDt.get_dataset_size()//batchSize,
                                 epochs=epochs,  callbacks=xcallbacks)

a. 首先定义了标注文件的数日地址

trainDt = DataGenerator(category, os.path.join("../../data/train/Annotations", "train_split.csv"))

b. 定义输入数据的generator

trainGen = trainDt.generator_with_mask_ohem( graph=tf.get_default_graph(), kerasModel=self.model,
                                    batchSize= batchSize, inputSize=(self.inputHeight, self.inputWidth),
                                    nStackNum=self.nStackNum, flipFlag=False, cropFlag=False)
  1. 定义回调函数
normalizedErrorCallBack = NormalizedErrorCallBack("../../trained_models/", category, True)
csvlogger = CSVLogger( os.path.join(normalizedErrorCallBack.get_folder_path(),
                               "csv_train_"+self.modelName+"_"+str(datetime.datetime.now().strftime('%H:%M'))+".csv"))
xcallbacks = [normalizedErrorCallBack, csvlogger]

参考:
[1] FashionAI_KeyPoint_Detection_Challenge_Keras
[2] FPN(特征金字塔网络)的直觉、架构和表现简要介绍
[3] Feature Pyramid Networks for Object Detection

你可能感兴趣的:(CPN服饰关键点论文及代码解读(keras))